<a id='home'></a>
### K Fold Cross validation

### Model validation
We always don't have enough data for creating a ml model even in that case we have to split it for training and validation/testing of a model, because if we not validate our model then we won't be able to know its performance. And also we can't use the same data for both training and validation as it is will not give the good estimation of its performance on unseen data.

### Problem with traditional validation method
In tradational approach, we split our data in ratio of 70:30 or 80:20 etc and pass the major chunk of data for training and rest for validation/testing, this method is know as `Holdout method`. But the problem with this approach is that the model is not fully able to understand the pattern of the data and if the dataset is not large enough then model gets underfitted. <br>
> <img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/15185319/blogs-15-7-2020-02-1024x565.jpg" width="500">

Holdout method is similar of teaching a kid, from first 7 chapter or 7 random chapter of a book and then testing his learning with rest of chapters of that book.

### How cross validation solves the issue
Cross validation is one of the most popular yet simplest data sampling techniques, the basic idea of cross validation is to utilize the whole dataset for training and validation. Because of this approach, model is able to fully utilize the given dataset and understand the complete pattern of data and it also helps it overcome the underfitting issue. Cross validation is also used for finding the best performing model(different algos) or for finding best parameter of a particular model.


### Why do we need cross validation
As we always don't have enough data to train our model, removing a part of it for validation poses a problem of underfitting. By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that.

### Types of Cross Validation
1. [KFold Cross Validation](#kfold)
2. [Stratified KFold Cross Validation](#scv)
3. [Leave-P-Out Cross Validation](#lcv)
---

## Building different validation model

In [1]:
import pandas
from sklearn.datasets import load_wine # <-- dataset we will be using
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
# dataset 
features = load_wine().data
label = load_wine().target

print(features[:5])
print('--'*20)
print(label[:5])

[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 1.860e+01 1.010e+02 2.800e+00 3.240e+00
  3.000e-01 2.810e+00 5.680e+00 1.030e+00 3.170e+00 1.185e+03]
 [1.437e+01 1.950e+00 2.500e+00 1.680e+01 1.130e+02 3.850e+00 3.490e+00
  2.400e-01 2.180e+00 7.800e+00 8.600e-01 3.450e+00 1.480e+03]
 [1.324e+01 2.590e+00 2.870e+00 2.100e+01 1.180e+02 2.800e+00 2.690e+00
  3.900e-01 1.820e+00 4.320e+00 1.040e+00 2.930e+00 7.350e+02]]
----------------------------------------
[0 0 0 0 0]


In [3]:
print(features.shape, label.shape)

(178, 13) (178,)


## Splitting data
### Holdout method

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=41)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(142, 13) (36, 13) (142,) (36,)


In [7]:
# Training and validation
model = RandomForestClassifier().fit(x_train, y_train)
y_pred = model.predict(x_test)
print('Training Score\n', model.score(x_train, y_train), '\nTesting Score\n',model.score(x_test, y_test))
print('-'*20, '\nClassification Report\n', classification_report(y_test, y_pred))

Training Score
 1.0 
Testing Score
 0.9722222222222222
-------------------- 
Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       1.00      0.92      0.96        12
           2       0.94      1.00      0.97        15

    accuracy                           0.97        36
   macro avg       0.98      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36



> Accuracy of the model is affected by the data it is trained on, and data on which model will be trained on is dependent on the how we split it. Here we are splitting our data with `Holdout method`, in which some portion of data are never used for training our model (which is used for validation) because of which our model didn't get the opportuinty to learn the complete pattern of the data.

> For example if we change the `random_state` in `train_test_split` function (responsible randomly splitting the data) our model will give different accuracy, though there may or may not be huge difference.

In [32]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=56)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(142, 13) (36, 13) (142,) (36,)


In [8]:
model = RandomForestClassifier().fit(x_train, y_train)

# Test report 
y_pred = model.predict(x_test)
print('Training Score\n', model.score(x_train, y_train), '\nTesting Score\n',model.score(x_test, y_test))
print('-'*20, '\nClassification Report\n', classification_report(y_test, y_pred))

Training Score
 1.0 
Testing Score
 0.9444444444444444
-------------------- 
Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       1.00      0.83      0.91        12
           2       0.88      1.00      0.94        15

    accuracy                           0.94        36
   macro avg       0.96      0.94      0.95        36
weighted avg       0.95      0.94      0.94        36



> As you can see, we got better accuracy in from our previous model, just from changing value of random state.

<a id='kfold'></a>
### 1. K Fold
It first divides the whole data into `K` number of subset (K is decided by user) for training and validation of a model. Then the model is trained for `K` times with `K-1` different subset of data and validated on left 1 subset of the data. Basically it follows `Holdout method` for K times with different sample of training and testing data each time. After training and validating the model K times we get K different accuracy score of model, so we average them out to get the overall effectiveness of our model
> <img src="https://zitaoshen.rbind.io/project/machine_learning/machine-learning-101-cross-vaildation/featured.png" width="600">

**[▲ Go To Top](#home)**

In [9]:
from sklearn.model_selection import KFold
cv = KFold(n_splits=5, shuffle=True) # <-- here n_splits parameter is the value of K

In [10]:
# how KFold works (demonstration)
for train, test in cv.split(range(10)):
     print(train, test)   

[1 2 3 5 6 7 8 9] [0 4]
[0 1 3 4 5 7 8 9] [2 6]
[0 2 3 4 6 7 8 9] [1 5]
[0 1 2 4 5 6 7 9] [3 8]
[0 1 2 3 4 5 6 8] [7 9]


> In above metric you can easily see that we have 5 (value of K) different train test arrays each time. Similary if we apply KFold on our data it will create 5 different splits.

In [11]:
# cv.split(data) <-- it returns the index not actual values -(remember that)
for train, test in cv.split(range(100, 110)):
    print(train, test)

[0 1 3 4 5 6 8 9] [2 7]
[0 1 2 3 5 6 7 9] [4 8]
[0 2 3 4 6 7 8 9] [1 5]
[0 1 2 4 5 6 7 8] [3 9]
[1 2 3 4 5 7 8 9] [0 6]


In [12]:
# splitting data with KFold
scores = []
for train_index, test_index in cv.split(features):
    x_train, x_test, y_train, y_test = features[train_index], features[test_index], label[train_index], label[test_index]
    model = RandomForestClassifier().fit(x_train, y_train)
    
    # for simplicity we will only get training score
    test_score = model.score(x_test, y_test)
    scores.append(test_score)
    print(test_score)

0.9722222222222222
1.0
0.9722222222222222
1.0
1.0


In [13]:
# now to get the model's overall performance we can take average of all tried model
print(sum(scores)/len(scores))

0.9888888888888889


> So this is estimate of best performance we can get from using Random Forest on this data. We can try different model with cross validation to find the best model

> There's another way to use the KFold cross validation, which is easier and shorter than this to implement. The method can be used with a method called `cross_val_score`, so lets see how we can implement this.

In [40]:
from sklearn.model_selection import cross_val_score
score = cross_val_score(RandomForestClassifier(), features, label, cv=5, scoring='accuracy')
print(score)
print(sum(scores)/len(scores))

[0.94444444 0.94444444 0.91666667 1.         0.97142857]
0.9775280898876404


<a id='scv'></a>
### 2. StratifiedKFold
In some cases, there may be a large imbalance in the response aka target variables. For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This variation is also known as Stratified K Fold.

**[▲ Go To Top](#home)**

In [15]:
from sklearn.model_selection import StratifiedKFold
s_cv = StratifiedKFold(n_splits=5, shuffle=True)

In [24]:
scores = []

# here feature and label is of np array type
for train_index, test_index in s_cv.split(features, label):
    x_train, x_test, y_train, y_test = features[train_index], features[test_index], label[train_index], label[test_index]
    model = RandomForestClassifier().fit(x_train, y_train)
    
    # for simplicity we will only get training score
    test_score = model.score(x_test, y_test)scv = StratifiedKFold(n_splits=5, random_state=49, shuffle=True)
scv_score = []
    scores.append(test_score)
    print(test_score)

1.0
1.0
1.0
0.9142857142857143
1.0


In [25]:
# now to get the model's overall performance we can take average of all tried model
print(sum(scores)/len(scores))

0.9828571428571429


<a id='lcv'></a>
### Leave-P-Out Cross Validation
This approach leaves p data points out of training data, i.e. if there are n data points in the original sample then, n-p samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.
This method is exhaustive in the sense that it needs to train and validate the model for all possible combinations, and for moderately large p, it can become **computationally infeasible**.

**[▲ Go To Top](#home)**

In [30]:
from sklearn.model_selection import LeavePOut
l_cv = LeavePOut(1)

In [31]:
scores = []
for train_index, test_index in l_cv.split(features, label):
    x_train, x_test, y_train, y_test = features[train_index], features[test_index], label[train_index], label[test_index]
    model = RandomForestClassifier().fit(x_train, y_train)
    
    # for simplicity we will only get training score
    test_score = model.score(x_test, y_test)
    scores.append(test_score)
    print(test_score)

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


In [34]:
print(sum(scores)/len(scores))

0.9775280898876404


In [32]:
len(scores)

178

In [33]:
len(x_train)

177

Here you can see we trained 177 model with 177 possible combinations of data, leaving one for validation