## Introduction to K Fold Cross Validation

Link to the Youtube video tutorial: https://www.youtube.com/watch?v=gJo0uNL-5Qw&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=13

**Ways to train your machine learning model: K-fold cross validation technique**  <br />
Theory behind k-fold cross validation:
1) Firstly, we divide a dataset into folds(groups). In this example, we divide a dataset which consists of 100 samples into 5 folds(groups), each fold contains 20 samples (100/5=20). Then, you run multiple iterations. In the 1st iteration, you use fold number 2 to 5 as train set to train the model while fold number 1 (1st fold) as test set to test the model. Then, you note down the score.  <br />
<img src="hidden\cv1.png" alt="This image describes k-fold cross validation tecnique, part 1" style="width: 400px;"/>  <br />

2) In the 2nd iteration, you use fold number 1, 3, 4, and 5 as train set to train the model while fold number 2 (2nd fold) as test set to test the model. Then, you note down the score.  <br />
<img src="hidden\cv2.png" alt="This image describes k-fold cross validation tecnique, part 2" style="width: 400px;"/>  <br />

3) You repeat the process till the last fold where you use fold number 5 (5th fold) for testing and remaining folds for training.  <br />
<img src="hidden\cv3.png" alt="This image describes k-fold cross validation tecnique, part 3" style="width: 400px;"/>  <br />

4) Then, once you have the score from each iteration, you average them out.  <br />
<img src="hidden\cv4.png" alt="This image describes k-fold cross validation tecnique, part 4" style="width: 400px;"/>  <br />

5) K-fold cross validation technique is very good because you are giving variety of samples to your model and then you are taking individual scores and then averaging them out.

6) Another detailed information with visualization of k-fold cross validation : https://www.statology.org/k-fold-cross-validation/

Note:
A classifier is basically your machine learning model which is trying to classify your samples

### Load the dataset

In [218]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

import numpy as np
from sklearn.datasets import load_digits

# load the dataset
digits = load_digits()

### Additional steps

###### Encode the dependent variable
Encode the categorical variables of the dependent variable from categorical names (text labels) into integer labels using label encoder. The encoded dependent variable is required by Stratified KFold.

In [219]:
from sklearn.preprocessing import LabelEncoder

# create a label encoder called le_DependentVariable
le_DependentVariable = LabelEncoder() 
# encode the data available in the attribute of target of the dataset into integer labels. Then, save the outputs to variable called Y_encoded
Y_encoded = le_DependentVariable.fit_transform(digits.target)

import pandas as pd
# show the encoded values for each data available in the attribute of target of the dataset, in the format of dataframe and manually set the column name as target.
pd.DataFrame(Y_encoded,columns=['target'])

Unnamed: 0,target
0,0
1,1
2,2
3,3
4,4
...,...
1792,9
1793,0
1794,8
1795,9


Verify the encoded dependent variable  <br />
To check the output of Y_encoded by comparing with digits.target (Because the data available in digits.target are already in integer labels)

In [220]:
# show the data available in the attribute of target of the dataset, in the format of dataframe & its column name is manually set as target
pd.DataFrame(digits.target,columns=['target'])

Unnamed: 0,target
0,0
1,1
2,2
3,3
4,4
...,...
1792,9
1793,0
1794,8
1795,9


#### Self-define a function to print performance metric of a machine learning model on the given test set after it is trained with the given train test

In [221]:
# write a generic method (self-defined function) called get_score
# the get_score takes model, train set, and test set as inputs
def get_score(model, X_train, X_test, Y_train, Y_test):
    # train the model
    model.fit(X_train,Y_train) 
    # return/provide the score of the trained model
    return model.score(X_test, Y_test) 

#### Introduction to KFold concept
A simple example to introduce & explain KFold function

In [222]:
from sklearn.model_selection import KFold

# create kfold function which will divide a dataset into 3 folds/groups (n_splits=3)
kf = KFold(n_splits=3)

for train_index, test_index in kf.split([21,22,23,24,25,26,27,28,29]):
    print(train_index,test_index)

'''
In this example:
[21,22,23,24,25,26,27,28,29] is the dataset provided to the kfold function (Usually the independent variables (features) are provided to the KFold as the dataset).
Since the dataset consists of total 9 samples, the indices (row numbers) of the dataset corresponding to the samples in the dataset are 0,1,2,3,4,5,6,7,8.
Since this kfold function will split the dataset into 3 folds (9 samples in dataset / 3 folds = 3 samples per fold).
1st fold contains the 3 indices (row numbers) of the dataset corresponding to the samples in the dataset -> [0,1,2]
2nd fold contains the 3 indices (row numbers) of the dataset corresponding to the samples in the dataset -> [3,4,5]
3rd fold contains the 3 indices (row numbers) of the dataset corresponding to the samples in the dataset -> [6,7,8]
The 1st row of train_index contains indices (row numbers) of the dataset corresponding to the samples in the dataset for training the model (train set) in an iteration, 
the 1st row of test_index contains indices (row numbers) of the dataset corresponding to the samples in the dataset for testing the model (test set) in the same iteration.
Same concept applies to Nth row of train_index & test_index. In this example, since the KFold splits the dataset into 3 folds, the for loop of the KFold repeats for 3 iterations, the train_index & test_index have 3 rows eventually (1 row generated at each iteration of the for loop).
'''

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


'\nIn this example:\n[21,22,23,24,25,26,27,28,29] is the dataset provided to the kfold function (Usually the independent variables (features) are provided to the KFold as the dataset).\nSince the dataset consists of total 9 samples, the indices (row numbers) of the dataset corresponding to the samples in the dataset are 0,1,2,3,4,5,6,7,8.\nSince this kfold function will split the dataset into 3 folds (9 samples in dataset / 3 folds = 3 samples per fold).\n1st fold contains the 3 indices (row numbers) of the dataset corresponding to the samples in the dataset -> [0,1,2]\n2nd fold contains the 3 indices (row numbers) of the dataset corresponding to the samples in the dataset -> [3,4,5]\n3rd fold contains the 3 indices (row numbers) of the dataset corresponding to the samples in the dataset -> [6,7,8]\nThe 1st row of train_index contains indices (row numbers) of the dataset corresponding to the samples in the dataset for training the model (train set) in an iteration, \nthe 1st row of tes

### Develop machine learning models without cross validation

##### Data preprocessing
Split the dataset into train set and test set using train_test_split

In [223]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target, test_size=0.3)

##### Quick way of measuring the performance of different machine learning models on the same train and test sets:

###### (Logistic regression model) Create, train, and calculate the performance metric of logistic regression model in separate lines

In [224]:
# Logistic regression model

# create a logistic regression model
logistic_regression = LogisticRegression(max_iter=1000) # specify the max_iter parameter to avoid the warning
# train the logistic regression model 
logistic_regression.fit(X_train,Y_train)
# evaluate the performance of the trained logistic regression model 
lr_sl = logistic_regression.score(X_test,Y_test)

print('Without cross validation, without involving self-defined function, the performance metric of the logistic regression model: ',lr_sl)

Without cross validation, without involving self-defined function, the performance metric of the logistic regression model:  0.9703703703703703


###### (Logistic regression model) Create, train, and calculate the performance metric of logistic regression model using a single line (using self-defined function, get_score)

In [225]:
# Logistic regression model
lr_sdf = get_score(LogisticRegression(max_iter=1000), X_train, X_test, Y_train, Y_test)

print('Without cross validation, by involving self-defined function, the performance metric of the logistic regression model: ',lr_sdf)

Without cross validation, by involving self-defined function, the performance metric of the logistic regression model:  0.9703703703703703


###### (SVM model) Create, train, and calculate the performance metric of SVM model in separate lines

In [226]:
# SVM model

# create a SVM model
svm = SVC()
# train the SVM model 
svm.fit(X_train,Y_train)
# evaluate the performance of the trained SVM model 
svm_sl = svm.score(X_test,Y_test)

print('Without cross validation, without involving self-defined function, the performance metric of the SVM model: ',svm_sl)

Without cross validation, without involving self-defined function, the performance metric of the SVM model:  0.9888888888888889


###### (SVM model) Create, train, and calculate the performance metric of SVM model using a single line (using self-defined function, get_score)

In [227]:
# SVM model
svm_sdf = get_score(SVC(), X_train, X_test, Y_train, Y_test)

print('Without cross validation, by involving self-defined function, the performance metric of the SVM model: ',svm_sdf)

Without cross validation, by involving self-defined function, the performance metric of the SVM model:  0.9888888888888889


###### (Random forest model) Create, train, and calculate the performance metric of random forest model in separate lines

In [228]:
# Random forest model

# create a random forest model
random_forest = RandomForestClassifier()
# train the random forest model 
random_forest.fit(X_train,Y_train)
# evaluate the performance of the trained random model 
rf_sl = random_forest.score(X_test,Y_test)

print('Without cross validation, without involving self-defined function, the performance metric of the random forest model: ',rf_sl)

Without cross validation, without involving self-defined function, the performance metric of the random forest model:  0.9814814814814815


###### (Random forest model) Create, train, and calculate the performance metric of random forest model using a single line (using self-defined function, get_score)

In [229]:
# Random forest model
rf_sdf = get_score(RandomForestClassifier(), X_train, X_test, Y_train, Y_test)

print('Without cross validation, by involving self-defined function, the performance metric of the random forest model: ',rf_sdf)

Without cross validation, by involving self-defined function, the performance metric of the random forest model:  0.9796296296296296


### Develop machine learning models with cross validation using different methods

#### Cross validation with KFold to split a dataset into train and test sets + self-defined function to create, train, and calculate the performance metric of a machine learning model

In [230]:
# create an empty variable to store the performance metric of logistic regression model for different folds of dataset as test set later
scores_logis_kf = [] 
# create an empty variable to store the performance metric of SVM model for different folds of dataset as test set later
scores_svm_kf = []
# create an empty variable to store the performance metric of random forest model for different folds of dataset as test set later
scores_randforest_kf = []

## Divide/split the dataset into N folds using KFold
# KFold only requires either independent variable or dependent variable as input to get the size of samples available in the dataset, then split the dataset into N fold
from sklearn.model_selection import KFold

# create KFold object that will divide/split a dataset into 3 folds (n_splits=3)
kf = KFold(n_splits=3)

# split the dataset into 3 folds (the for loop will iterate 3 times)
for train_index_kf, test_index_kf in kf.split(digits.data):
    # the train_index_kf & test_index_kf have 3 rows eventually (1 row generated at each iteration of the for loop).
    # The 1st row of train_index_kf contains indices (row numbers) of the dataset corresponding to the samples in the dataset for training the model (train set) in an iteration, 
    # the 1st row of test_index_kf contains indices (row numbers) of the dataset corresponding to the samples in the dataset for testing the model (test set) in the same iteration.
    # Same concept applies to Nth row of train_index_kf & test_index_kf. So we use the indices in train_index_kf to identify the 
    # samples (data of independent and dependent variables) of the dataset which will be used to train the machine learning model (train set), then load them into X_train_kf and X_test_kf variables respectively. Same concept applies to test set.
    X_train_kf, X_test_kf, Y_train_kf, Y_test_kf = digits.data[train_index_kf], digits.data[test_index_kf], \
                                                   Y_encoded[train_index_kf], Y_encoded[test_index_kf]
    
    '''
    Create, train, and calculate the performance metric of logistic regression model using a single line (using self-defined function, get_score). 
    Then, append the performance metric of logistic regression model for different folds of dataset as test set to scores_logis_kf variable.
    '''
    scores_logis_kf.append(get_score(LogisticRegression(max_iter=1000),X_train_kf,X_test_kf,Y_train_kf,Y_test_kf))
    '''
    Create, train, and calculate the performance metric of SVM model using a single line (using self-defined function, get_score). 
    Then, append the performance metric of SVM model for different folds of dataset as test set to scores_svm_kf variable.
    '''
    scores_svm_kf.append(get_score(SVC(),X_train_kf,X_test_kf,Y_train_kf,Y_test_kf))
    '''
    Create, train, and calculate the performance metric of random forest model using a single line (using self-defined function, get_score). 
    Then, append the performance metric of random forest model for different folds of dataset as test set to scores_rf_kf variable.
    '''
    scores_randforest_kf.append(get_score(RandomForestClassifier(),X_train_kf,X_test_kf,Y_train_kf,Y_test_kf))

# show the performance metric of different machine learning models
print('Using KFold to split dataset into N folds, the performance metric of different machine learning models:\nLogistic regression model\t:'\
      +str(scores_logis_kf)+'\nSVM model\t\t\t:'+str(scores_svm_kf)+'\nRandom forest model\t\t:'+str(scores_randforest_kf))

Using KFold to split dataset into N folds, the performance metric of different machine learning models:
Logistic regression model	:[0.9248747913188647, 0.9432387312186978, 0.9148580968280468]
SVM model			:[0.9666110183639399, 0.9816360601001669, 0.9549248747913188]
Random forest model		:[0.9382303839732888, 0.9565943238731218, 0.9282136894824707]


#### Cross validation with Stratified KFold to split a dataset into train and test sets + self-defined function to create, train, and calculate the performance metric of a machine learning model

Parameters of cross_val_score:
<img src="hidden\crossvalscore.png" alt="This image describes cross_val_score parameters" style="width: 400px;"/>  <br />

In [231]:
# create an empty variable to store the performance metric of logistic regression model for different folds of dataset as test set later
scores_logis_kf_stratified = [] 
# create an empty variable to store the performance metric of SVM model for different folds of dataset as test set later
scores_svm_kf_stratified = []
# create an empty variable to store the performance metric of random forest model for different folds of dataset as test set later
scores_randforest_kf_stratified = []

## Divide/split the dataset into N folds using StratifiedKFold
# Advantage of StratifiedKFold over KFold: it will divide each of the classification categories available in the dataset in a uniform way. So using StratifiedKFold is better.
# StratifiedKFold also requires dependent variable as input whose data are encoded into integer labels so that it can divide each of the classification categories available in the dataset in a uniform way.
from sklearn.model_selection import StratifiedKFold

# create a StratifiedKFold object that will divide/split a dataset into 3 folds (n_splits=3)
kf_stratified = StratifiedKFold(n_splits=3) 

# split the dataset into 3 folds (the for loop will iterate 3 times)
for train_index_kf_stratified, test_index_kf_stratified in kf_stratified.split(digits.data,Y_encoded):
    # the train_index_kf_stratified & test_index_kf_stratified have 3 rows eventually (1 row generated at each iteration of the for loop).
    # The 1st row of train_index_kf_stratified contains indices (row numbers) of the dataset corresponding to the samples in the dataset for training the model (train set) in an iteration, 
    # the 1st row of test_index_kf_stratified contains indices (row numbers) of the dataset corresponding to the samples in the dataset for testing the model (test set) in the same iteration.
    # Same concept applies to Nth row of train_index_kf_stratified & test_index_kf_stratified. So we use the indices in train_index_kf_stratified to identify the 
    # samples (data of independent and dependent variables) of the dataset which will be used to train the machine learning model (train set), then load them into X_train_kf_stratified and X_test_kf_stratified variables respectively. Same concept applies to test set.
    X_train_kf_stratified, X_test_kf_stratified, Y_train_kf_stratified, Y_test_kf_stratified = digits.data[train_index_kf_stratified], digits.data[test_index_kf_stratified], \
                                                                                                Y_encoded[train_index_kf_stratified], Y_encoded[test_index_kf_stratified]
    
    '''
    Create, train, and calculate the performance metric of logistic regression model using a single line (using self-defined function, get_score). 
    Then, append the performance metric of logistic regression model for different folds of dataset as test set to scores_logis_kf_stratified variable.
    '''
    scores_logis_kf_stratified.append(get_score(LogisticRegression(max_iter=1000),X_train_kf_stratified,X_test_kf_stratified,Y_train_kf_stratified,Y_test_kf_stratified))
    '''
    Create, train, and calculate the performance metric of SVM model using a single line (using self-defined function, get_score). 
    Then, append the performance metric of SVM model for different folds of dataset as test set to scores_svm_kf_stratified variable.
    '''
    scores_svm_kf_stratified.append(get_score(SVC(),X_train_kf_stratified,X_test_kf_stratified,Y_train_kf_stratified,Y_test_kf_stratified))
    '''
    Create, train, and calculate the performance metric of random forest model using a single line (using self-defined function, get_score). 
    Then, append the performance metric of random forest model for different folds of dataset as test set to scores_rf_kf_stratified variable.
    '''
    scores_randforest_kf_stratified.append(get_score(RandomForestClassifier(),X_train_kf_stratified,X_test_kf_stratified,Y_train_kf_stratified,Y_test_kf_stratified))

# show the performance metric of different machine learning models
print('Using StratifiedKFold to split dataset into N folds, the performance metric of different machine learning models:\nLogistic regression model\t:'\
      +str(scores_logis_kf_stratified)+'\nSVM model\t\t\t:'+str(scores_svm_kf_stratified)+'\nRandom forest model\t\t:'+str(scores_randforest_kf_stratified))

Using StratifiedKFold to split dataset into N folds, the performance metric of different machine learning models:
Logistic regression model	:[0.9198664440734557, 0.9415692821368948, 0.9165275459098498]
SVM model			:[0.9649415692821369, 0.9799666110183639, 0.9649415692821369]
Random forest model		:[0.9382303839732888, 0.9649415692821369, 0.9181969949916527]


#### Cross validation with cross_val_score to split a dataset into train and test sets + create, train, and calculate the performance metric of a machine learning model (without self-defined function to print performance metric)

In [232]:
from sklearn.model_selection import cross_val_score
'''
split the dataset into 3 folds, then get the performance metric of logistic regression model for different folds of dataset as test set. 
Store the outputs to lr_cvs variable. The data available in the attribute of data of the dataset is provided as independent variables,
while the data available in the attribute of target of the dataset is provided as dependent variables. The parameter of cv refers to the 
number of folds/groups the dataset will be splitted into.
'''
lr_cvs = cross_val_score(LogisticRegression(max_iter=1000),digits.data,digits.target,cv=3)

'''
split the dataset into 3 folds, then get the performance metric of SVM model for different folds of dataset as test set. 
Store the outputs to lr_cvs variable. The data available in the attribute of data of the dataset is provided as independent variables,
while the data available in the attribute of target of the dataset is provided as dependent variables. The parameter of cv refers to the 
number of folds/groups the dataset will be splitted into.
'''
svm_cvs = cross_val_score(SVC(),digits.data,digits.target,cv=3)

'''
split the dataset into 3 folds, then get the performance metric of random forest model for different folds of dataset as test set. 
Store the outputs to lr_cvs variable. The data available in the attribute of data of the dataset is provided as independent variables,
while the data available in the attribute of target of the dataset is provided as dependent variables. The parameter of cv refers to the 
number of folds/groups the dataset will be splitted into.
'''
rf_cvs = cross_val_score(RandomForestClassifier(),digits.data,digits.target,cv=3)

# show the performance metric of different machine learning models
print('Using cross_val_score to split dataset into N folds then create, train, and calculate the performance metric of a machine learning model , the performance metric of different machine learning models:\nLogistic regression model\t:'\
      +str(lr_cvs)+'\nSVM model\t\t\t:'+str(svm_cvs)+'\nRandom forest model\t\t:'+str(rf_cvs))


Using cross_val_score to split dataset into N folds then create, train, and calculate the performance metric of a machine learning model , the performance metric of different machine learning models:
Logistic regression model	:[0.91986644 0.94156928 0.91652755]
SVM model			:[0.96494157 0.97996661 0.96494157]
Random forest model		:[0.93489149 0.96661102 0.92988314]
