<h1 style='color:blue;' align='center'>KFold Cross Validation Python Tutorial</h2>

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

**Logistic Regression**

In [5]:
# As the score is 96 % logistic regression is performing better
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9611111111111111

**SVM**

In [6]:
# As the score is 53 % SVM is not performing better
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.5333333333333333

**Random Forest**

In [9]:
# As the score is 97% Random Forest is performing better
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9740740740740741

In [16]:
# define a method called get_score
# Input: model, X_train, X_test, y_train, y_test
# Output: model score
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [20]:
get_score(LogisticRegression(solver='liblinear',multi_class='ovr'),X_train, X_test, y_train, y_test)

0.9611111111111111

In [21]:
get_score(SVC(gamma='auto'),X_train, X_test, y_train, y_test)

0.5333333333333333

In [22]:
get_score(RandomForestClassifier(n_estimators=40),X_train, X_test, y_train, y_test)

0.975925925925926

# Conclusion: 

You have to run train_test_split multiple times to decide which model is performing better. K-Fold cross validation is better compared to train_test_split

<h2 style='color:purple'>KFold cross validation</h2>

**Basic example**

In [10]:
from sklearn.model_selection import KFold
# n_splits refers to the number of folds that we want to create
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [15]:
# As the number of folds / splits = 3 it is going to have each fold with 3 samples as we have 9 samples from 0 to 8
# That is, fold number 1 --> [0 1 2], fold number 2 --> [3,4,5], fold number 3 --> [6,7,8]
# In 1st iteration --> fold number 2 & fold number 3 are used for training & fold number 1 is used for testing
# That is, in 1st iteration --> [3 4 5 6 7 8] [0 1 2]
# In 2nd iteration --> fold number 1 & fold number 3 are used for training & fold number 2 is used for testing
# That is, in 2nd iteration --> [0 1 2 6 7 8] [3 4 5]
# In 3rd iteration --> fold number 1 & fold number 2 are used for training & fold number 3 is used for testing
# That is, in 3rd iteration --> [0 1 2 3 4 5] [6 7 8]
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


**Use KFold for our digits example**

In [24]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [25]:
scores_logistic

[0.8948247078464107, 0.9532554257095158, 0.9098497495826378]

In [26]:
scores_svm

[0.3806343906510851, 0.41068447412353926, 0.5125208681135225]

In [27]:
scores_rf

[0.9181969949916527, 0.9449081803005008, 0.9232053422370617]

# Conclusion: 

Instead of writing the code as above you can use ready made function in sklearn calle cross_val_score which does the same thing.

In the projects as well you can use cross_val_score directly.

<h2 style='color:purple'>cross_val_score function</h2>

In [32]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [33]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

array([0.89482471, 0.95325543, 0.90984975])

**svm model performance using cross_val_score**

In [34]:
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

array([0.38063439, 0.41068447, 0.51252087])

**random forest performance using cross_val_score**

In [35]:
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

array([0.92988314, 0.95659432, 0.91652755])

cross_val_score uses stratifield kfold by default

<h2 style='color:purple'>Parameter tunning using k fold cross validation</h2>

In [39]:
# As you increase the number of decision trees in random forest the score is going to increase

In [40]:
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8764525139664805

In [41]:
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)

0.9398975791433891

In [42]:
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9482526381129732

In [43]:
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9465704531346988

Here we used cross_val_score to
fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result. 

<h2 style='color:purple'>Exercise</h2>

Use iris flower dataset from sklearn library and use cross_val_score against following
models to measure the performance of each. In the end figure out the model with best performance,
1. Logistic Regression
2. SVM
3. Decision Tree
4. Random Forest