# Model-evaluation and K-fold cross-validation

In this notebook, we will be using K-fold cross-validation to estimate generalization performance of two different logistic regression models. scikit-learn has two useful modules:

1. `sklearn.metrics`:
2. `sklearn.model_selection`: 

## Data - car marketing study

Notes 1 slide 6

The data in `Car.csv` are car purchasing behavior for n = 33 households from a marketing survey. The response is binary (1 if a household purchased a car; 0 otherwise).

**Attribute information**:
    
    1. y: Binary response variable
    2. income: Household income (thousands)
    3. car_age: Age of oldest automobile
    
Compare two different logistic regression models:

1. no interaction term
$$
    \log\left(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2
$$
2. with an interaction term

$$
    \log\left(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2
$$

In [1]:
import numpy as np
import pandas as pd

# models
import statsmodels.api as sm
import statsmodels.formula.api as smf

# metrics
from sklearn.metrics import log_loss # negative log-likelihood

%matplotlib inline

In [2]:
car = pd.read_csv('../data/Car.csv')
car.head()

Unnamed: 0,y,income,car_age
0,0,32,3
1,0,45,2
2,1,60,2
3,0,53,1
4,0,25,4


In [3]:
# car and income-age as predictors
clf1 = smf.glm('y~income+car_age',data=car,family=sm.families.Binomial()).fit()

# interaction term
clf2 = smf.glm('y~income+car_age + income:car_age',data=car,family=sm.families.Binomial()).fit()

In [4]:
print(clf1.summary2())

              Results: Generalized linear model
Model:              GLM              AIC:            42.6896 
Link Function:      logit            BIC:            -68.2056
Dependent Variable: y                Log-Likelihood: -18.345 
Date:               2023-01-25 10:12 LL-Null:        -22.494 
No. Observations:   33               Deviance:       36.690  
Df Model:           2                Pearson chi2:   33.6    
Df Residuals:       30               Scale:          1.0000  
Method:             IRLS                                     
-------------------------------------------------------------
               Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept     -4.7393   2.1019 -2.2547 0.0242 -8.8591 -0.6196
income         0.0677   0.0281  2.4141 0.0158  0.0127  0.1227
car_age        0.5986   0.3901  1.5347 0.1249 -0.1659  1.3631



In [5]:
# negative log-likelihood - training set metric
log_loss1 = clf1.deviance/2
log_loss2 = clf2.deviance/2

print(log_loss1,log_loss2)

18.34481579484669 17.70205090961161


In [4]:
# AICs - computed by statsmodels
n = car.shape[0]
clf1.aic/n,clf2.aic/n

(1.2936251996876782, 1.315275812703734)

In [5]:
# manually computing AICs
aic1 = clf1.deviance/n + 2*3/n
aic2 = clf2.deviance/n + 2*4/n
print(aic1,aic2)

1.2936251996876782 1.315275812703734


## K-fold cross-validation

We will be using the `KFold` class from scikit-learn's `model_selection` module to generate the KFold partitions. 

In [6]:
from sklearn.model_selection import KFold

# data with 10 entries
Z = np.concatenate([np.zeros(5),np.ones(5)])

# create K-fold object with 5 folds
kf = KFold(n_splits=5) 

print('%-17s %-8s'%('Train idx','Test idx'))
for train_index,test_index in kf.split(Z):
    print(train_index,test_index)

Train idx         Test idx
[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]


In [7]:
print('%-25s %-9s'%('Train data','Test data'))
for train_index,test_index in kf.split(Z):
    print(Z[train_index],Z[test_index])

Train data                Test data
[0. 0. 0. 1. 1. 1. 1. 1.] [0. 0.]
[0. 0. 0. 1. 1. 1. 1. 1.] [0. 0.]
[0. 0. 0. 0. 1. 1. 1. 1.] [0. 1.]
[0. 0. 0. 0. 0. 1. 1. 1.] [1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1.] [1. 1.]


`KFold` does not re-arrange the data by default. Running it multiple times, we get the same folds partition. To re-arrange the data, pass `shuffle=True` and optionally pass a value to the `random_state` argument for reproducibility.

In [8]:
# impact of shuffling
# create K-fold object with 5 folds
kf = KFold(n_splits=5,shuffle=True,random_state=8) 

print('%-17s %-5s'%('Train','Test'))
for train_index,test_index in kf.split(Z):
    print(train_index,test_index)

Train             Test 
[0 1 2 3 4 5 7 9] [6 8]
[1 2 3 4 5 6 7 8] [0 9]
[0 1 3 4 6 7 8 9] [2 5]
[0 2 3 4 5 6 8 9] [1 7]
[0 1 2 5 6 7 8 9] [3 4]


In [9]:
# impact of shuffling
# create K-fold object with 5 folds
kf = KFold(n_splits=5,shuffle=True,random_state=9) 

print('%-17s %-5s'%('Train','Test'))
for train_index,test_index in kf.split(Z):
    print(train_index,test_index)

Train             Test 
[0 1 2 3 5 6 7 9] [4 8]
[0 1 3 4 5 6 8 9] [2 7]
[0 2 3 4 5 6 7 8] [1 9]
[1 2 4 5 6 7 8 9] [0 3]
[0 1 2 3 4 7 8 9] [5 6]


In [10]:
# create the KFold object
kf = KFold(n_splits=10,random_state=1,shuffle=True)

logloss1 = []
logloss2 = []

for train_index,test_index in kf.split(car):
    
    #### model with no interaction term ####
    # fit model on the remaining folds
    clf1 = (
        smf.glm('y~income+car_age',data=car.loc[train_index,:],family=sm.families.Binomial())
        .fit()
    )
    # obtain predictions on the held out fold and compute logistic loss
    p_pred1 = clf1.predict(car.loc[test_index,:]) # returns probabilities
    logloss1.append(log_loss(car['y'].loc[test_index],p_pred1,labels=[0,1]))
    

    #### model with interaction term ####
    # fit model on the remaining folds
    clf2 = (
        smf.glm('y~income+car_age+income:car_age',data=car.loc[train_index,:],family=sm.families.Binomial())
        .fit()
    )
    # obtain predictions on the held out fold and compute logistic loss
    p_pred2 = clf2.predict(car.loc[test_index,:])
    logloss2.append(log_loss(car['y'].loc[test_index],p_pred2,labels=[0,1]))

# print CV estimate
print('NLL for model1: %5.3f'%np.mean(logloss1))
print('NLL for model2: %5.3f'%np.mean(logloss2))

NLL for model1: 0.712
NLL for model2: 0.741


### Using a different K-fold partition

In [11]:
# create the KFold object
kf = KFold(n_splits=10,random_state=17,shuffle=True)

logloss1 = []
logloss2 = []

for train_index,test_index in kf.split(car):
    
    #### model with no interaction term ####
    # fit model on the remaining folds
    clf1 = (
        smf.glm('y~income+car_age',data=car.loc[train_index,:],family=sm.families.Binomial())
        .fit()
    )
    # obtain predictions on the held out fold and compute logistic loss
    p_pred1 = clf1.predict(car.loc[test_index,:]) # returns probabilities
    logloss1.append(log_loss(car['y'].loc[test_index],p_pred1,labels=[0,1]))
    

    #### model with interaction term ####
    # fit model on the remaining folds
    clf2 = (
        smf.glm('y~income+car_age+income:car_age',data=car.loc[train_index,:],family=sm.families.Binomial())
        .fit()
    )
    # obtain predictions on the held out fold and compute logistic loss
    p_pred2 = clf2.predict(car.loc[test_index,:])
    logloss2.append(log_loss(car['y'].loc[test_index],p_pred2,labels=[0,1]))
    
# print CV estimate
print('NLL for model1: %5.3f'%np.mean(logloss1))
print('NLL for model2: %5.3f'%np.mean(logloss2))

NLL for model1: 0.644
NLL for model2: 0.650


## Replicated cross-validation

For running replicating cross-validation, we could use a for loop to repeat the above calculations for different fold partitions. Alternatively, the `model_selection` module provides a `RepeatedKFold` class for generating multiple replicates.

The for loop syntax using `RepeatedKFold` is exactly the same as when using `KFold`.

In [12]:
from sklearn.model_selection import RepeatedKFold
# data with 10 entries
Z = np.concatenate([np.zeros(5),np.ones(5)])

# replicated k-fold with 3 replicates
rkf = RepeatedKFold(n_splits=5,n_repeats=3,random_state=1) 

print('%-17s %-8s'%('Train idx','Test idx'))
ct = 1
for train_index,test_index in rkf.split(Z):
    print(train_index,test_index)
    if ct%5==0:
        # demarcating replicates
        print('*****************')
    ct+= 1

Train idx         Test idx
[0 1 3 4 5 6 7 8] [2 9]
[0 1 2 3 5 7 8 9] [4 6]
[1 2 4 5 6 7 8 9] [0 3]
[0 2 3 4 5 6 8 9] [1 7]
[0 1 2 3 4 6 7 9] [5 8]
*****************
[0 1 2 3 4 6 7 8] [5 9]
[1 2 4 5 6 7 8 9] [0 3]
[0 1 2 3 5 6 7 9] [4 8]
[0 3 4 5 6 7 8 9] [1 2]
[0 1 2 3 4 5 8 9] [6 7]
*****************
[0 1 2 4 5 6 7 9] [3 8]
[0 1 2 3 4 6 7 8] [5 9]
[1 2 3 4 5 7 8 9] [0 6]
[0 2 3 4 5 6 8 9] [1 7]
[0 1 3 5 6 7 8 9] [2 4]
*****************


In [13]:
# 5 replicates of 10 fold cross-validation
rkf = RepeatedKFold(n_splits=10,n_repeats=5,random_state=1)

logloss1 = []
logloss2 = []

for train_index,test_index in rkf.split(car):
    
    #### model with no interaction term ####
    # fit model on the remaining folds
    clf1 = (
        smf.glm('y~income+car_age',data=car.loc[train_index,:],family=sm.families.Binomial())
        .fit()
    )
    # obtain predictions on the held out fold and compute logistic loss
    p_pred1 = clf1.predict(car.loc[test_index,:]) # returns probabilities
    logloss1.append(log_loss(car['y'].loc[test_index],p_pred1,labels=[0,1]))
    

    #### model with interaction term ####
    # fit model on the remaining folds
    clf2 = (
        smf.glm('y~income+car_age+income:car_age',data=car.loc[train_index,:],family=sm.families.Binomial())
        .fit()
    )
    # obtain predictions on the held out fold and compute logistic loss
    p_pred2 = clf2.predict(car.loc[test_index,:])
    logloss2.append(log_loss(car['y'].loc[test_index],p_pred2,labels=[0,1]))

# print CV estimate
print('NLL for model1: %5.3f'%np.mean(logloss1))
print('NLL for model2: %5.3f'%np.mean(logloss2))

NLL for model1: 0.674
NLL for model2: 0.716


In [14]:
logloss1

[0.8824128186916258,
 0.7423322543177083,
 0.37163639827807765,
 0.6895255662225539,
 0.4876876616091002,
 1.8723431643825992,
 0.4446211590973837,
 0.4557258794640511,
 0.5123470365825539,
 0.6621216843186849,
 1.0528540322572453,
 0.33241954849805,
 0.7554616914557136,
 0.3428810771412291,
 0.5179831928014335,
 0.7032840484619672,
 0.3061130950311808,
 0.6134716591258881,
 0.6907315182914652,
 1.1610794039990464,
 0.8870478752079196,
 0.4907311006143295,
 1.1328871274355574,
 0.4629217726195587,
 0.3261367775725105,
 0.5791864782577943,
 1.0743292160533844,
 0.28551487617950594,
 0.7975733772118082,
 0.8428034301655057,
 0.7110524358838342,
 1.4570610486867728,
 0.6521643847452291,
 0.3703127309454037,
 0.39947136703210906,
 0.2731829624709366,
 0.6864666661821245,
 0.734379212423702,
 0.809411991186796,
 0.3574360788565973,
 0.9856668283250295,
 0.8222474351472615,
 0.6780182208678622,
 0.5503576055835853,
 0.5365426404107375,
 0.26544053245049815,
 0.6559343004452943,
 0.9248336210