<a href="https://colab.research.google.com/github/satishgunjal/Machine-Learning-Using-Python/blob/master/12_K_Fold_Cross_Validation/K_Fold_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K Fold Cross Validation
1. As of now we have created training and test sets to train and test the model but in case of K fold cross valiadation model we dont use it
2. In K fold cross validation we devide the given dataset in K number of folds(batches) and run the model for K number of iterations. In every iteration use one fold(batch) as test set and rest all folds as training set and compute the score for every algorithm. At the end we get K number of scores. Now take the average of scores as our final score.

**Problem Statement: Classify the sklearn digits dataset into one of the 10 categories ( 0 to 9). Here we are going to use different algorithms and evaluate each algorithms performance using K Fold Cross Validation**

Import required libraries

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

In [5]:
digits = load_digits()
dir(digits)

['DESCR', 'data', 'images', 'target', 'target_names']

## Understanding the digits dataset
1. digits.DESCR > Description of the dataset
2. digits.data > Contains 1797 training example. Since each image is 8x8 digts, 64 pixel is the size of each example
3. digits.target > Contains target value for each training examples, so it conatins 1797 y labels
4. digits.target_names > Contains name for each target since we have 10 possible classes it conatins 10 names only

* Here digits.data is out independent/inputs/ X variables
* And digits.target is out dependent/target/y varaibale

## Lets split the data

In [15]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(digits.data,digits.target,test_size=0.3)

print("len of X_train is %s" % (len(X_train)))
print("len of X_test is %s" % (len(X_test)))
print("len of y_train is %s" % (len(y_train)))
print("len of y_test is %s" % (len(y_test)))

len of X_train is 1257
len of X_test is 540
len of y_train is 1257
len of y_test is 540


## Lets test using LogisticRegression

In [23]:
model_lr = LogisticRegression(solver='liblinear') 
model_lr.fit(X_train,y_train)
model_lr.score(X_test,y_test)
#Note above we are using 'liblinear' insated of default 'lbfgs'. SInce 'lbfgs' failed to converge

0.9703703703703703

## Test using SVM model

In [24]:
model_svm = SVC()
model_svm.fit(X_train,y_train)
model_svm.score(X_test,y_test)

0.9925925925925926

## Test using RandomForest Algorithm

In [26]:
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
model_rf.score(X_test,y_test)

0.975925925925926

# Using K Fold Cross Validation

Basic Example....

In [0]:
from sklearn.model_selection import KFold

kf = KFold(n_splits = 3) # Here K = 3
kf

In [29]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
  print(train_index,test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


Note above we are using [1,2,3,4,5,6,7,8,9] as sample dataset. Observe the values of train and test index


## Now lets use K Fold on our digits dataset



In [0]:
# Finction to get the score for a model
def get_score(model, X_train, X_test, y_train, y_test):
  model.fit(X_train,y_train)
  return model.score(X_test,y_test)

In [48]:
from sklearn.model_selection import StratifiedKFold

folds = StratifiedKFold(n_splits = 3) # means K = 3

# Store the score for each model
score_lr = []
score_svm = []
score_rf = []

# Since K = 3 , for loop will run for 3 iterations
for train_index, test_index in folds.split(digits.data, digits.target):
  X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]

  score_lr.append(get_score(LogisticRegression(solver='liblinear') , X_train, X_test, y_train, y_test))
  score_svm.append(get_score(SVC() , X_train, X_test, y_train, y_test))
  score_rf.append(get_score(RandomForestClassifier() , X_train, X_test, y_train, y_test))

print("LogisticRegression scores are %s " % (score_lr))
print("SVM scores are %s " % (score_svm))
print("Random Forst scores are %s " % (score_rf))

LogisticRegression scores are [0.8948247078464107, 0.9532554257095158, 0.9098497495826378] 
SVM scores are [0.9649415692821369, 0.9799666110183639, 0.9649415692821369] 
Random Forst scores are [0.9415692821368948, 0.9565943238731218, 0.9265442404006677] 


## Using Sklearn cross_val_score function
**NOTE: Sklearn's cross_val_score uses StratifiedKFold by default** 

Instead of using above code, we can use sklearn library 'cross_val_score'

In [0]:
from sklearn.model_selection import cross_val_score

### Lets test teh LogisticRegression model performance using cross_val_score

In [54]:
cross_val_score(LogisticRegression(solver='liblinear'), digits.data, digits.target,cv=3)

array([0.89482471, 0.95325543, 0.90984975])

### Lets test the SVM model performance using cross_val_score 

In [55]:
cross_val_score(SVC(),digits.data,digits.target,cv=3)

array([0.96494157, 0.97996661, 0.96494157])

### Lets test the RandomForest models performance using cross_val_score

In [62]:
cross_val_score(RandomForestClassifier(),digits.data,digits.target,cv=3)

array([0.93656093, 0.95993322, 0.93489149])

## Fine Tunning
From above results its clear that 'RandomForest' performed better that other algorithms

**NOTE: Default value of 'n_estimators' is '100' for RandomForest algorithm**

Lets fine tune it further....

In [63]:
# Try1: Using n_estimators=5 and cv=10
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8653258845437616

In [64]:
# Try1: Using n_estimators=20 and cv=10
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)m

0.9337771570453134

In [65]:
# Try1: Using n_estimators=30 and cv=10
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9460117939168218

In [66]:
# Try1: Using n_estimators=40 and cv=10
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9449037864680323