Cross-validation (CV) is a model evaluation method that provides insight into how well the model will perform on new and unseen datasets.
Let’s see the cross-validation methods that will be covered here.

1) The Validation Set Approach: Randomly dividing the available set of observations into two parts, a training set, and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate (MSE in the case of a quantitative response) provides an estimate of the test error rate. 

2) Leave-One-Out Cross-Validation: Split a dataset into a training set and a testing set. However, instead of creating two subsets of comparable size, a single observation is used for the validation set, and the remaining observations make up the training set. Then build a statistical learning method using only data from the training set and use the model to predict the response of the one observation left out of the model and calculate the MSE. Repeat the process n times.   

  

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from IPython.display import display, HTML
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import datasets
from scipy import stats
from sklearn.datasets import make_classification

In [3]:
# Data Generation 
features, output = make_classification(n_samples = 1000,
                                       n_features = 5,
                                       n_informative = 5,
                                       n_redundant = 0,
                                       n_classes = 2,
                                       weights = [.2, .3])
print()
print("Target Class: ");
print(pd.DataFrame(output, columns=["TargetClass"]).head())
print("Feature Matrix: ");

df = pd.DataFrame(np.hstack((features, output.reshape(-1, 1))))

df = df.rename(columns={0: 'x1', 1: 'x2', 2: 'x3',3:'x4', 4:'x5', 5:'y'})
df.head()


Target Class: 
   TargetClass
0            1
1            1
2            1
3            0
4            1
Feature Matrix: 


Unnamed: 0,x1,x2,x3,x4,x5,y
0,-0.085933,-0.387788,-1.125196,-0.003456,-1.864406,1.0
1,-1.777683,0.206918,-0.433815,1.576912,0.26824,1.0
2,-1.16137,1.93904,-0.404667,1.98907,0.361136,1.0
3,-3.754439,-1.218731,0.696586,3.217664,2.876382,0.0
4,-1.452071,1.657636,-0.259698,0.766656,-2.083101,1.0


The Validation Set Approach: 
   1)  Split the sample set into a training set and a validation set
   2)  Fit a multiple logistic regression model using only the training observations. 
   3)  Use fitted model to to predict the responses for the observations in the validation set. 
   4)  Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
   5)  Repeat the process in three times, using three different splits of the observations into a training set and a validation set. 

In [4]:
# To have nice confusion table 
def confusion_table(confusion_mtx):
   
    confusion_df = pd.DataFrame({'y_pred=0': np.append(confusion_mtx[:, 0], confusion_mtx.sum(axis=0)[0]),
                                 'y_pred=1': np.append(confusion_mtx[:, 1], confusion_mtx.sum(axis=0)[1]),
                                 'Total': np.append(confusion_mtx.sum(axis=1), ''),
                                 '': ['y=0', 'y=1', 'Total']}).set_index('')
    return confusion_df

def total_error_rate(confusion_matrix):
  
    return 1 - np.trace(confusion_mtx) / np.sum(confusion_mtx)



for s in range(1,4):
    display(HTML('<h3>Random seed = {}</h3>'.format(s)))
    # Create index for 50% holdout set
    np.random.seed(s)
    train = np.random.rand(len(df)) < 0.5
    
    response   = 'y'
    predictors = ['x1', 'x2', 'x3', 'x4', 'x5']
    
    X_train = np.array(df[train][predictors])
    X_test  = np.array(df[~train][predictors])
    y_train = np.array(df[train][response])
    y_test  = np.array(df[~train][response])
    
    # Logistic regression
    logit       = LogisticRegression()
    model_logit = logit.fit(X_train, y_train)
    
    # Predict
    y_pred = model_logit.predict(X_test)
    
    # Confusion mtx
    confusion_mtx = confusion_matrix(y_test, y_pred)
    display(confusion_table(confusion_mtx))
    
    total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)
    print('total_error_rate: {}%'.format(total_error_rate_pct))


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,196.0,33.0,229.0
y=1,23.0,254.0,277.0
Total,219.0,287.0,


total_error_rate: 11.0672%


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,187.0,28.0,215.0
y=1,30.0,221.0,251.0
Total,217.0,249.0,


total_error_rate: 12.4464%


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,187.0,36.0,223.0
y=1,29.0,260.0,289.0
Total,216.0,296.0,


total_error_rate: 12.6953%


Leave-One-Out Cross-Validation:
    1) Fit a logistic regression model that predicts the response (y) using x1,x2,x3,x4,x5  using all but the first observation.
    2) Find whether this observation correctly classified?

In [5]:
# using all but the first observation

train = df.index > 0

response   = 'y'
predictors = ['x1', 'x2', 'x3', 'x4', 'x5']

X_train = np.array(df[train][predictors])
X_test  = np.array(df[~train][predictors])
y_train = np.array(df[train][response])
y_test  = np.array(df[~train][response])

# Logistic regression
logit       = LogisticRegression(fit_intercept=True)
model_logit = logit.fit(X_train, y_train)

# Predict
y_pred = model_logit.predict(X_test)

# Analysis
confusion_mtx = confusion_matrix(y_test, y_pred)

display(confusion_table(confusion_mtx))
total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)
print('total_error_rate: {}%'.format(total_error_rate_pct))


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,0.0,0.0,0.0
y=1,1.0,0.0,1.0
Total,1.0,0.0,


total_error_rate: 100.0%


 Repeat the process above n time by writing a loop  from i = 1 to i = n, where n is the number of
observations in the data set. 

In [6]:
response   = 'y'
predictors = ['x1', 'x2', 'x3', 'x4', 'x5']
y_pred = []

for i in range(df.shape[0]):
  
    train = df.index != i
    
    X_train = np.array(df[train][predictors])
    X_test  = np.array(df[~train][predictors])
    y_train = np.array(df[train][response])
    
    # Logistic regression
    logit       = LogisticRegression()
    model_logit = logit.fit(X_train, y_train)
    
    # Predict
    y_pred += [model_logit.predict(X_test)] 
    
y_pred = np.array(y_pred)
y_test = df[response]

# Analysis
confusion_mtx = confusion_matrix(y_test, y_pred)
display(confusion_table(confusion_mtx))

total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)
print('total_error_rate: {}%'.format(total_error_rate_pct))


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,387.0,61.0,448.0
y=1,68.0,484.0,552.0
Total,455.0,545.0,


total_error_rate: 12.9%
