# K-Fold Cross Validation

K Fold cross validation is used to avoid overfitting. This technique is better then the train/test approach. In KFold Cross Validation approach we divide the data into K number of sets. One set is kept as the testing data and this test set is evaluated against all the remaining K-1 sets which are the training sets. At the end we take the avearage of all the error metrics provided by each training set and get the final error metric from the K Fold Cross Validation.

In [2]:
#import numpy for scientific calculations
import numpy as np

#split arrays or matrices into random train and test subsets
from sklearn.model_selection import train_test_split,cross_val_score 

#embeds some small toy datasets
from sklearn import datasets

#set of supervised learning methods used for classification, regression and outliers detection
from sklearn import svm

#load and return the iris dataset (classification).
iris = datasets.load_iris() 


A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [3]:
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
#iris.data contains all the measurements of each flower
#X.train 60% of data.
#X_test 40% of data.
#y_train and y_test contains species of each segment.

#iris.target is the data which we want to predict, in this case it is the species of the flowers.
#test_size=0.4-this splits the data with 40% data for testing and 60% training.
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=2).fit(X_train, y_train) 
#SVC model is built just using the training  species data and training , we fit the SVC model using the Linear Kernel. We call this model as CLF.
# Now measure its performance with the test data
clf.score(X_test, y_test) #here we call the score function of clf to measure its performance against test data set which contains iris measurements and the species.  

0.9

96% of the times the model predicts the species of an iris that is has never seen before just based on the measuremnet of the Iris.

K-Fold cross validation is just as easy; let's use a K of 10:

In [6]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, iris.data, iris.target, cv=10) #here we call cross_val_score and pass the model(clf),training data,testing data and divide the training data into 5 sets by using cv=5.

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[1.         1.         1.         1.         0.86666667 1.
 0.93333333 1.         1.         1.        ]
0.9800000000000001


Here we get the list of error metric for each fold and then we take an avearge of error metric to get the overall error metric for 5 folds. By using 10 folds we improve the accuracy from 93% to 98%

Our model is even better than we thought! Can we do better? Let's try a different kernel (poly):

We will check whether there is a linear relation or polynomial relation between the species and the iris measurements.

In [8]:
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train) #here we call the clf model by pass the kernel as polynomial.
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
print(scores.mean())

[1.         1.         0.9        0.93333333 1.        ]
0.9666666666666666




No! The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [9]:
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='poly', C=2).fit(X_train, y_train)

# Now measure its performance with the test data
clf.score(X_test, y_test)   



0.9666666666666667

That's the same score we got with a single train/test split on the linear kernel.