<h1 style='color:blue;' align='center'>KFold Cross Validation Python Tutorial</h2>

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()

# We have loaded our dataset, but we don't know ki humare data ke liye konsi algorithm best rahesgi toh hum pehle data ko test,train mai divide krte hai
# fir har model ko train krke test krte hai ki kiski performance achhi hai

In [2]:
# dividing the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

**Logistic Regression**

In [22]:
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

# now here we are using Logistic Regression for our data , isse 0.91 score aaya

0.9115191986644408

**SVM**

In [23]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

# # now here we are using Logistic SVM for our data , isse 0.42 score aaya

0.4273789649415693

**Random Forest**

In [24]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

# now here we are using Random Forest for our data , isse 0.91 score aaya

# Now is appraoch mai problem ye hai ki train and test data har baar alag hog jitni baar run kroge, toh har baar models ka score different aaega like
#  abhi SVM ka score 42 aa rha hai , vaapis run kro toh 64 aa rha hai toh ye problem hai
# So we will use K fold cross validation to identify ki konsa model best rahega hamare liye

0.9181969949916527

<h2 style='color:purple'>KFold cross validation</h2>

**Basic example**

In [6]:
# we will divide the data into 3 folds(sets)
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [7]:
# ye [1,2,3,4,5,6,7,8,9] dataset hai jo ki kfold ko diya hai apan ne ab since voh 3 folds banaega toh dekho 3 rows aayi hai output mai
# k folds banana means ki k times train and test krna dataset mai and unka avg score batana
# like first iteration mai usne training ke liye [3 4 5 6 7 8] liya hai testing ke liye [0 1 2] liya
# second iteration mai usne training ke liye [0 1 2 6 7 8]  liya hai testing ke liye [3 4 5] liya
# third iteration mai usne training ke liye [0 1 2 3 4 5] liya hai testing ke liye  [6 7 8] liya
# 3 times ye train, test krega and fir sabka avg score dega


for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


**Use KFold for our digits example**

In [8]:
# a function which takes model, data as input and returns the score
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [25]:
# Above was just example ki K fold kaam kese krta hai ab apan apne real dataset mai krte hai ye kaam, pehle apan ne 9 data entries hi paas kri thi [1,2,3,4,5,6,7,8,9]
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []  # to store score of logistic regression
scores_svm = []   # to store score of svm
scores_rf = []     # to store score of random forest

# ye for loop hai k times chalegi and har model ka score note kregi and uske respective array mai score daal degi

for train_index, test_index in folds.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [26]:
# scores of logistic regression 
scores_logistic

[0.8953488372093024, 0.9499165275459098, 0.9093959731543624]

In [27]:
# scores of svm
scores_svm

[0.39368770764119604, 0.41068447412353926, 0.4597315436241611]

In [28]:
# scores of random forest
# ab avg krke dekh lo konsa best hai
# isme bhi apan ko khud se function likhna pada, khud se ktimes loop chalani padi , but we have a fucntion cross_val_score which will do this itself
scores_rf

[0.9285714285714286, 0.9515859766277128, 0.9295302013422819]

<h2 style='color:purple'>cross_val_score function</h2>

In [29]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [31]:
# scores of logistic regression 

cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

array([0.89534884, 0.94991653, 0.90939597])

**svm model performance using cross_val_score**

In [32]:
# scores of svm

cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

array([0.39368771, 0.41068447, 0.45973154])

**random forest performance using cross_val_score**

In [33]:
# scores of random forest

cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

array([0.93521595, 0.94156928, 0.93288591])

cross_val_score uses stratifield kfold by default

<h2 style='color:purple'>Parameter tunning using k fold cross validation</h2>

In [34]:
# earlier we were using k fold to identify ki humare data ke liye konsa model fit rahega, we can also find ki humare model ke liye konse parameter
# fit rahege this is called parameter tuning, isme ek ki model denge like yah random forest diya , ab usme parameter kya choose krna chaiye vo decide krege
# parameter and feature is different 
# random forest data ko multiple x part mai divide krta hai and x decision tree banata hai and un x trees se predict krvaata hai output, majority tree jo 
# bolte hai vo usko output mai dikhata hai.
# Toh in random forest parameter is x(number of decision tree)
# in linear regression parameter is alpha(step size)

# toh yaha 5 trees use kre and score dekha
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8793698637138034

In [35]:
# yaha 20 trees use kre and score dekha
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)

0.9358915527370766

In [36]:
# yaha 30 trees use kre and score dekha
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9494121957589801

In [37]:
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9482877777434258

Here we used cross_val_score to
fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result. 

<h2 style='color:purple'>Exercise</h2>

Use iris flower dataset from sklearn library and use cross_val_score against following
models to measure the performance of each. In the end figure out the model with best performance,
1. Logistic Regression
2. SVM
3. Decision Tree
4. Random Forest