# Nested Cross validation
Reference: [Sergey Feldman: You Should Probably Be Doing Nested Cross-Validation | PyData Miami 2019](https://www.youtube.com/watch?v=DuDtXtKNpZs&ab_channel=PyData)
## Inner Loop
*   Training set (including validation set) 
*   K-fold cross-validation to find the set of hyper-parameters on the validation set 
*   This method helps to fix the validation error variance
------------------ 
K-fold CV on the inner loop
* run cv to find the best set of hyper-params
* retrain the best hyper-params on the train + validation
* use the retrained model on the test set


## Outer Loop
*  Training set + Testing set
*  N-fold Split on the whole dataset into N folds, ( N - 1 ) folds for training and 1 fold to conduct model evaluation
*  Apply inner loop k-fold CV on the (N-1) folds 
*  This helps to fix the variance on test error due to randomly choosen a signle test set



In [34]:
import sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score,cross_val_predict

In [21]:
X, y =  load_iris(return_X_y= True)

In [27]:
# display(X_train_val)

##  The normal way 
- Train & Validation: K-fold to find the best set of hyper-parameters
- Test: Assess the model performance on the unseen test data

In [39]:
X_train_val, X_test,  y_train_val, y_test = train_test_split(X ,y , test_size = 0.2 )

In [29]:
p_grid = {'max_depth' :[2, 5, 10]}
rf = RandomForestClassifier()
clf = GridSearchCV( estimator = rf , param_grid = p_grid, cv = 5 )
clf.fit( X_train_val , y_train_val)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [2, 5, 10]})

In [30]:
best_rf = RandomForestClassifier(**clf.best_params_)

In [31]:
best_rf.fit(X_train_val , y_train_val) 

RandomForestClassifier(max_depth=2)

In [36]:
test_error = np.mean( best_rf.predict(X_test) !=  y_test) 

In [37]:
test_error

0.1

##  Nested CV
* gridsearch + cross_val_score: this doesn't give you access to the optimal hyper-params selected in each inner loop
* customized function: 

In [53]:
p_grid = {'max_depth' :[2, 5, 10]}
rf = RandomForestClassifier()
# 4-outer, 5-inner nested CV 
clf = GridSearchCV( estimator = rf , param_grid = p_grid, cv = 5 )
nested_scores = cross_val_score( estimator = clf, X=X, y=y, cv = 4 ,verbose = 1  )

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    9.9s finished


In [41]:
nested_scores

array([0.94736842, 0.94736842, 0.94594595, 1.        ])

In [52]:
type(clf)

sklearn.model_selection._search.GridSearchCV

In [81]:
# cross_val_predict(estimator = clf, X=X, y=y, cv = 4 ,verbose = 1  )

Customized Function

In [82]:
X, y =  load_iris(return_X_y= True)
outer_loop_kf = KFold(n_splits= 5  )
n = 0 
# 5 -outer loop
for train_index, test_index in outer_loop_kf.split(X):
    print(f"CV = {n}")
    n +=1 
    # print("TRAIN:", train_index, "TEST:", test_index)
    X_train_val, X_test = X[train_index], X[test_index]
    y_train_val, y_test = y[train_index], y[test_index]
    
    # 5-inner loop
    p_grid = {'max_depth' :[2, 5, 10]}
    rf = RandomForestClassifier()
    clf = GridSearchCV( estimator = rf , param_grid = p_grid, cv = 5 )
    clf.fit( X_train_val , y_train_val)
    print(f"best parameters: {clf.best_params_}")
    best_rf = RandomForestClassifier(**clf.best_params_)
    best_rf.fit(X_train_val , y_train_val) 
    
    # evaluate the performance on the test set
    test_error = np.mean(best_rf.predict(X_test) == y_test) 
    print(f"test error is: {test_error}")

CV = 0
best parameters: {'max_depth': 10}
test error is: 1.0
CV = 1
best parameters: {'max_depth': 2}
test error is: 1.0
CV = 2
best parameters: {'max_depth': 2}
test error is: 0.8333333333333334
CV = 3
best parameters: {'max_depth': 5}
test error is: 0.9333333333333333
CV = 4
best parameters: {'max_depth': 2}
test error is: 0.7666666666666667
