# Note for Nested CV

[reference - machine learning mastery](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/)
[reference - sklearn](https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)

Nested CV is also called double cross-validation which uses two cross-validation loops.

A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.

It is common to use a larger k for outer loop but a smaller value for k in the inner loop.

__TBD - Algorithm Summaration__

# Example 

In [3]:
# load dataset
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt 
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold 
import numpy as np

## 1. Constant Setting

In [4]:
# number of random trials 
NUM_TRIALS = 30

# load the dataset 
iris = load_iris()
X_iris = iris.data 
y_iris = iris.target

# parameters
p_grid = {"C": [1, 10, 100], "gamma": [0.01, 0.1]}

# use support vector classifier with "rbf" kernel 
svm = SVC(kernel = "rbf")

# arrays to store scores
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)

## 2. Loop

In [8]:
# loop for each trials
for i in range(NUM_TRIALS):

    # choose cross validation techniques for inner and outer loops,
    inner_cv = KFold(n_splits = 4, shuffle = True, random_state = i)
    outer_cv = KFold(n_splits = 4, shuffle = True, random_state = i)

    # non nested parameters seach and scoring 
    clf = GridSearchCV(estimator = svm, param_grid = p_grid, cv = outer_cv)
    clf.fit(X_iris, y_iris)
    non_nested_scores[i] = clf.best_score_
    
    # nested cv with parameter optimization 
    clf = GridSearchCV(estimator = svm, param_grid = p_grid, cv = inner_cv)
    nested_score = cross_val_score(clf, X = X_iris, y = y_iris, cv = outer_cv)
    print("During the {} trial, we have outer cv scores {}".format(i, nested_score))
    nested_scores[i] = nested_score.mean()

During the 0 trial, we have outer cv scores [0.97368421 0.89473684 1.         0.91891892]
During the 1 trial, we have outer cv scores [0.97368421 0.89473684 0.97297297 0.94594595]
During the 2 trial, we have outer cv scores [1.         1.         0.97297297 0.91891892]
During the 3 trial, we have outer cv scores [0.94736842 0.94736842 1.         0.94594595]
During the 4 trial, we have outer cv scores [0.97368421 0.94736842 0.94594595 0.97297297]
During the 5 trial, we have outer cv scores [0.97368421 0.92105263 1.         0.97297297]
During the 6 trial, we have outer cv scores [0.97368421 0.97368421 0.97297297 0.97297297]
During the 7 trial, we have outer cv scores [0.89473684 1.         0.97297297 0.97297297]
During the 8 trial, we have outer cv scores [0.92105263 0.94736842 1.         0.97297297]
During the 9 trial, we have outer cv scores [1.         0.97368421 0.91891892 0.97297297]
During the 10 trial, we have outer cv scores [0.97368421 0.92105263 1.         0.94594595]
During th

In [7]:
clf.best_score_

0.9603485064011379

# Example 2

In [9]:
# manual nested cross-validation for random forest on a classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [10]:
# create dataset
X, y = make_classification(n_samples=1000
                            , n_features=20
                            , random_state=1
                            , n_informative=10
                            , n_redundant=10)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)