# K Fold cross validation

K-fold cross-validation is a valuable technique for model evaluation and helps in obtaining a more accurate and reliable estimate of a model's performance on unseen data.

We devide the samples into folds.Say, we have 100 samples then we fold them into 5 folds each containing 20 samples.Then we train the model with last 4 folds and for testing we use first 1 fold and finally notedown the score. 

Then we use 2nd fold for testing and rest of the folds for training. And repeat the process. Finally we,  average all the scores.

# Importing Libraries

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

# Loading Data

In [2]:
digits = load_digits()

# Train, Test and Split

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target, test_size=0.3)

# Using Logistic Regression 

In [4]:
lr = LogisticRegression(max_iter=1000, solver='sag')
lr.fit(X_train, Y_train)
print(f"Accuracy from Logistic Regression: {lr.score(X_test, Y_test)*100} %")

Accuracy from Logistic Regression: 96.85185185185186 %


# Using SVM

In [5]:
svm = SVC(gamma='auto')
svm.fit(X_train, Y_train)
print(f"Accuracy from SVM: {svm.score(X_test, Y_test)*100} %")

Accuracy from SVM: 36.851851851851855 %


# Using Random Forest

In [6]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, Y_train)
print(f"Accuracy from Random Forest: {rf.score(X_test, Y_test)*100} %")

Accuracy from Random Forest: 95.74074074074073 %


## Summary 

We can after using different different classifiers, it is visible that our model accuracy is not uniform. Depending on the train and test data it is changing all the time. Moreover, we can't say the model accuracy is consistent or which model is better.

# K Fold Validation.

### Basic example  using some numbers.

In [7]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 3) # n_splits means the number of folds('groups')
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [8]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


kf.split([1,2,3,4,5,6,7,8,9]), here this field returns an iterator.
And in each iterations it will return train_index and test_index. As we supplied n_splits = 3, it divides the set into 3 folds. One fold[0,1,2] for testing and two folds[3,4,5,6,7,8] for training.

In second iteration, [0,1,2,6,7,8] goes for training and [3,4,5] goes for testing.

Repeats.

Example ends

### Creating a reusable function.

which we can use for LogisticRegression, SVM, RandomForestClassifier as well.

In [9]:
def get_score(model, X_train, X_test, Y_train, Y_test):
    model.fit(X_train, Y_train)
    return model.score(X_test, Y_test)

# Using K Fold for digits example.

In [10]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

'StratifiedKFold' is similar to KFold but it is better, Because, when we are spearating our folds it will divide each classification catagories in a uniform way.

In [11]:
scores_svm = []
scores_lr = []
scores_rf = []

for train_index, test_index in kf.split(digits.data):
    X_train, X_test, Y_train, Y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]
        
    scores_lr.append(get_score(LogisticRegression(max_iter=1000, solver='sag'), X_train, X_test, Y_train, Y_test))
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, Y_train, Y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, Y_train, Y_test))

In [12]:
scores_lr

[0.9282136894824707, 0.9432387312186978, 0.9098497495826378]

In [13]:
scores_svm

[0.41068447412353926, 0.41569282136894825, 0.4273789649415693]

In [14]:
scores_rf

[0.9265442404006677, 0.9582637729549248, 0.9181969949916527]

Insted of using so codes in line[15], we can use 'cross_val_score'.
# cross_val_score function

In [15]:
from sklearn.model_selection import cross_val_score

### Logistic Regression model performance using cross_val_score

In [16]:
cross_val_score(LogisticRegression(max_iter=1000, solver='sag'), digits.data, digits.target)

array([0.91944444, 0.86666667, 0.94428969, 0.93593315, 0.89415042])

### SVM model performance using cross_val_score

In [17]:
cross_val_score(SVC(gamma='auto'), digits.data, digits.target)

array([0.41111111, 0.45      , 0.454039  , 0.44846797, 0.47910864])

### Random Forest model performance using cross_val_score

In [18]:
cross_val_score(RandomForestClassifier(n_estimators=40), digits.data, digits.target)

array([0.94166667, 0.90555556, 0.93871866, 0.95264624, 0.93314763])

We can see that RandomForestClassifier is working best.

# Checking the performance after tuning 

In [19]:
score1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(score1)                       

0.8775729360645561

In [20]:
score2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(score2)                       

0.9337647423960271

In [21]:
score3 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(score3)                       

0.9465859714463066

In [22]:
score4 = cross_val_score(RandomForestClassifier(n_estimators=80),digits.data, digits.target, cv=10)
np.average(score4)                       

0.9510304158907511