# Week 6 - Classification models  

## Part 3: Travel mode choice - Probit regression

In this part we will revisit our real world problem of travel model choice. 

The first part is very similar to previous notebook for part 2: loading data, preprocessing, train/test split, etc. However, in this part, we will consider a Probit regression model. For the sake of simplicty, lets assume that we are just interested in distinguishing between car vs non-car (binary classification problem).

Lets just start running the parts corresponding to imports, data loading, preprocessing, train/test split, etc.

Import required libraries:

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import linear_model
import pystan
import pystan_utils

# fix random generator seed (for reproducibility of results)
np.random.seed(42)

# matplotlib style options
plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 10)

Load data:

In [2]:
# load csv
df = pd.read_csv("modechoice_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,individual,hinc,psize,ttme_air,invc_air,invt_air,gc_air,ttme_train,invc_train,invt_train,gc_train,ttme_bus,invc_bus,invt_bus,gc_bus,invc_car,invt_car,gc_car,mode_chosen
0,0,70.0,30.0,4.0,10.0,61.0,80.0,73.0,44.0,24.0,350.0,77.0,53.0,19.0,395.0,79.0,4.0,314.0,52.0,1.0
1,1,8.0,15.0,4.0,64.0,48.0,154.0,71.0,55.0,25.0,360.0,80.0,53.0,14.0,462.0,84.0,4.0,351.0,57.0,2.0
2,2,62.0,35.0,2.0,64.0,58.0,74.0,69.0,30.0,21.0,295.0,66.0,53.0,24.0,389.0,83.0,7.0,315.0,55.0,2.0
3,3,61.0,40.0,3.0,45.0,75.0,75.0,96.0,44.0,33.0,418.0,96.0,53.0,28.0,463.0,98.0,5.0,291.0,49.0,1.0
4,4,27.0,70.0,1.0,20.0,106.0,190.0,127.0,34.0,72.0,659.0,143.0,35.0,33.0,653.0,104.0,44.0,592.0,108.0,1.0


Preprocess data:

In [3]:
# separate between features/inputs (X) and target/output variables (y)
mat = df.values
X = mat[:,2:-1]
print(X.shape)
y = mat[:,-1].astype("int")
print(y.shape)
ind = mat[:,1].astype("int")
print(ind.shape)

(394, 17)
(394,)
(394,)


### This part is important!

This is where we turn our previous 4-class problem into a binary classification problem: car vs non-car

In [4]:
# transform to binary problem: car vs non-car
y = (y == 4).astype("int")

In [5]:
# standardize input features
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X = (X - X_mean) / X_std

Train/test split:

In [6]:
train_perc = 0.66 # percentage of training data
split_point = int(train_perc*len(y))
perm = np.random.permutation(len(y))
ix_train = perm[:split_point]
ix_test = perm[split_point:]
X_train = X[ix_train,:]
X_test = X[ix_test,:]
y_train = y[ix_train]
y_test = y[ix_test]
print("num train: %d" % len(y_train))
print("num test: %d" % len(y_test))

num train: 260
num test: 134


Again, for the purpose of comparison, we run the logistic regression method from sklearn. But note that although sklearn has an implementation of logistic regression, it is not a Bayesian approach, nor does it support probit regression or some other variant that you may think is more appropriate for your particular problem. On the other hand, STAN offers us complete flexibility!

In [7]:
# create and fit logistic regression model
logreg = linear_model.LogisticRegression(solver='lbfgs', multi_class='auto')
logreg.fit(X_train, y_train)

# make predictions for test set
y_hat = logreg.predict(X_test)
print("predictions:", y_hat)
print("true values:", y_test)

# evaluate prediction accuracy
print("Accuracy:", 1.0*np.sum(y_hat == y_test) / len(y_test))

predictions: [0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1
 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
true values: [1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0
 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1
 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1]
Accuracy: 0.7164179104477612


Ok, time to implement binary logistic regression in STAN!

Your turn now :-)

Note: don't forget to include an explicit intercept parameter $\alpha$ in the model!

In [26]:
# define Stan model
model_definition = """
data {
  int<lower=0> N;    // shape of data
  int<lower=0> D;    // dimensions of x
  int<lower=0> C;    // number of possible choices
  matrix[N,D]  X;    // feature matrix
  int          y[N]; // response vector
}
parameters {
  vector[C]    alpha; // bias of each choice
  matrix[C, D] beta;  // parameters
}
model{
  matrix[N, C] z;

  for (c in 1:C){
    z = alpha[c] + X*beta';  
  }

  //for (c in 1:C){
  //  beta[c, D]  ~ normal(0., 5);  
  //}
  to_vector(beta) ~ normal(0, 5);

  alpha  ~ normal(0., 5);
  
  for (n in 1:N){
    y[n] ~ categorical_logit(z[n]');  
  }
}
"""

# compile model
sm = pystan.StanModel(model_code=model_definition)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_0e4214910074a4517de4eacafaee6a38 NOW.


In [33]:
# define Stan model
model_definition = """
data {
  int<lower=0> N;    // shape of data
  int<lower=0> D;    // dimensions of x
  int<lower=0> C;    // number of possible choices
  matrix[N,D]  X;    // feature matrix
  int          y[N]; // response vector
}
parameters {
  real      alpha; // bias 
  vector[D] beta;  // parameters
}
model{
  beta   ~ normal(0, 5);
  alpha  ~ normal(0., 5);
  
  for (n in 1:N){
    y[n] ~ bernoulli(softmax(alpha + X*beta));  
  }
}
"""

# compile model
sm = pystan.StanModel(model_code=model_definition)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_82e3a775f1499b9c0446fb3f3a86bc06 NOW.


Prepare input data for STAN, compile STAN program and run inference (MCMC):

In [None]:
# prepare data for Stan model
N, D = X_train.shape
C = int(y_train.max())
print("N=%d, D=%d, C=%d" % (N,D,C))
data = {'N': N, 'D': D, 'C': C, 'X': X_train, 'y': y_train}

N=260, D=17, C=1


In [None]:
%%time
fit = sm.sampling(data=data, iter=1000, chains=4, algorithm="NUTS", seed=42, verbose=True)

In [None]:
print(fit)

Extract samples from posterior, make predictions and compute accuracy (make sure that you understand all the code!):

In [14]:
samples = fit.extract(permuted=True)  # return a dictionary of arrays

In [15]:
# make predictions for test set
mu = np.mean(samples["alpha"].T + np.dot(X_test, samples["beta"].T), axis=1)
y_hat = (mu > 0).astype("int")
print("predictions:", y_hat)
print("true values:", y_test)

# evaluate prediction accuracy
print("Accuracy:", 1.0*np.sum(y_hat == y_test) / len(y_test))

predictions: [0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1
 1 1 0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1
 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1]
true values: [1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0
 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1
 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1]
Accuracy: 0.7238805970149254


Nice, it seems that we are already doing better than sklearn!

Ok, now lets try a **probit regression model in STAN**.

Can you implement it?

Note: the function that you need to use for the probit is called *Phi_approx* in STAN

In [16]:
# define Stan model
model_definition = """
data {
  int<lower=0> N;    // shape of data
  int<lower=0> D;    // dimensions of x
  int<lower=0> C;    // number of possible choices
  matrix[N,D]  X;    // feature matrix
  int          y[N]; // response vector
}
parameters {
  real      alpha; // bias 
  vector[D] beta;  // parameters
}
model{
  beta   ~ normal(0, 5);
  alpha  ~ normal(0., 5);
  
  for (n in 1:N){
    y[n] ~ bernoulli(Phi_approx(alpha + X*beta));  
  }
}
"""

# compile model
sm = pystan.StanModel(model_code=model_definition)

Prepare input data for STAN, compile STAN program and run inference (MCMC):

In [17]:
# prepare data for Stan model
N, D = X_train.shape
C = int(y_train.max())
print("N=%d, D=%d, C=%d" % (N,D,C))
data = {'N': N, 'D': D, 'C': C, 'X': X_train, 'y': y_train}

N=260, D=17, C=1


In [None]:
%%time
fit = sm.sampling(data=data, iter=1000, chains=4, algorithm="NUTS", seed=42, verbose=True)

In [None]:
print(fit)

Extract samples from posterior, make predictions and compute accuracy (make sure that you understand all the code!):

In [None]:
samples = fit.extract(permuted=True)  # return a dictionary of arrays

In [None]:
# make predictions for test set
mu = np.mean(samples["alpha"].T + np.dot(X_test, samples["beta"].T), axis=1)
y_hat = (mu > 0).astype("int")
print("predictions:", y_hat)
print("true values:", y_test)

# evaluate prediction accuracy
print("Accuracy:", 1.0*np.sum(y_hat == y_test) / len(y_test))

How are your results in comparison to the version with the logistic sigmoid?

In some cases, using a probit function instead of the logistic sigmoid can make a significant difference. In other cases, it doesn't... You have to consider what makes more sense to the specific problem that you are trying to solve. Or, we can just try different approaches! That is just fine... STAN makes it very easy to try all these different variants.