# Nested Cross-Validation

- Through nested cross validation we can attempt to discover the best model parameters to use through a grid search which attempts a large range of possiblities and retains the best performing model as measured by a specified metric. In cross validation, multiple train test splits are performed to prevent the possibility of getting a lucky or unlucky split. This technique allows various models to be compared in a more robust way.

### Import Required Packages

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import pandas as pd

### Load in the Data

In [3]:
iris = pd.read_csv('iris_sklearn_data.csv') # data was obtianed via sklearn

# Visualize the dataframe
iris.head()

Unnamed: 0,Petal_Length,Petal_Width,Sepal_Length,Sepal_Width,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
# Seperate the Features from the Labels
X = iris[['Petal_Length', 'Petal_Width', 'Sepal_Length', 'Sepal_Width']]
y = iris['Species']

### Cross Validation Via Gridsearch

In [5]:
# Specify the cross validation splits and scoring metric to determine the best model parameters
inner_cv = KFold(n_splits = 4, shuffle = True, random_state = 0)
outer_cv = KFold(n_splits = 4, shuffle = True, random_state = 0)
scoring = 'accuracy'

In [6]:
# Model Creation

# Inialize the desired model - in this case a decision tree was used
dt = DecisionTreeClassifier()

# Specify the parameters you would like the gridsearch to attempt
parameters = {'min_samples_split': [2, 3, 4, 5], 
              'max_depth': [2, 3, 4, 5, 6, 7, 8, 9]}

# Specify the classifier with the range of parameters to discover the best model
dt_clf = GridSearchCV(estimator = dt, scoring = scoring, param_grid = parameters, cv = inner_cv)

# Score the best model accross all four splits created by the outer cross validation layer
nested_scores = cross_val_score(estimator = dt_clf, X = X, y = y, cv = outer_cv)

# Display all of the scores obtained
print('Four Nested Scores (Accuracy)')
print(nested_scores)

# Calculate the average score across all of the validation splits
avg_score = nested_scores.mean()
print('Average Accuracy: {}' .format(avg_score))

Four Nested Scores (Accuracy)
[0.97368421 0.92105263 0.97297297 0.94594595]
Average Accuracy: 0.9534139402560456


In [9]:
# Display the best parameters discovered
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

dt_clf.fit(X_train, y_train)
dt_clf.best_params_

{'max_depth': 3, 'min_samples_split': 3}