# K-Fold Cross Validation

* Evaluating a Machine Learning model can be quite tricky. 


* Usually, we split the data set into training and testing sets and use the training set to train the model and testing set to test the model. 


* We then evaluate the model performance based on an error metric to determine the accuracy of the model. 


* This method however, is not very reliable as the accuracy obtained for one test set can be very different to the accuracy obtained for a different test set. 


* K-fold Cross Validation(CV) provides a solution to this problem by dividing the data into folds and ensuring that each fold is used as a testing set at some point. 


# What is K-Fold Cross Validation?


* K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. 


#### Lets take the scenario of 5-Fold cross validation(K=5). 

* Here, the data set is split into 5 folds. 


* In the first iteration, the first fold is used to test the model and the rest are used to train the model. 


* In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. 


* This process is repeated until each fold of the 5 folds have been used as the testing set.


<img src='img/kfold.JPG'>


# Evaluating a ML model using K-Fold CV

Lets evaluate a simple regression model using K-Fold CV. 

Here we will be performing 10-Fold cross validation using the RBF kernel of the SVR model

# Importing libraries


In [1]:
import pandas
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
import numpy as np


# Reading the data set

In [2]:
dataset = pandas.read_csv('dataset/boston.csv')

# Pre-processing

In [3]:
X = dataset.iloc[:, [0, 12]]
y = dataset.iloc[:, 13]

The above code indicates that all the rows of column index 0-12 are considered as features and the column with the index 13 to be the dependent variable A.K.A the output. 

Now, lets apply the MinMax scaling pre-processing technique to normalize the data set.

In [4]:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

This technique re-scales the data between a specified range(in this case, between 0–1), to ensure that certain features do not affect the final prediction more than the other features.

# K-Fold CV

In [5]:
scores = []
best_svr = SVR(kernel='rbf')


cv = KFold(n_splits=10, random_state=42, shuffle=False)


for train_index, test_index in cv.split(X):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)

    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    best_svr.fit(X_train, y_train)
    scores.append(best_svr.score(X_test, y_test))

Train Index:  [ 51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212
 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284
 285 286 287 288 289 290 291 292 293 


Test Index:  [306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323
 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
 342 343 344 345 346 347 348 349 350 351 352 353 354 355]
Train Index:  [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 1



* We are using the RBF kernel of the SVR model, implemented using the sklearn library for the evaluation purpose of this example. 


* First, we indicate the number of folds we want our data set to be split into. 


* Here, we have used 10-Fold CV (n_splits=10), where the data will be split into 10 folds. 


* We are printing out the indexes of the training and the testing sets in each iteration to clearly see the process of K-Fold CV where the training and testing set changes in each iteration.


# Next, we specify the training and testing sets to be used in each iteration. 


* For this, we use the indexes(train_index, test_index) specified in the K-Fold CV process. 


* Then, we train the model in each iteration using the train_index of each iteration of the K-Fold process and append the error metric value to a list(scores ).

In [6]:
best_svr.fit(X_train, y_train)
scores.append(best_svr.score(X_test, y_test))




# The error metric computed using the best_svr.score() function is the r2 score. 


* Each iteration of F-Fold CV provides an r2 score. 


* We append each score to a list and get the mean value in order to determine the overall accuracy of the model.

In [7]:
print(np.mean(scores))

0.125187416587258


# Alternatively use following to do the same task of 10-Fold cross validation. The first method will give you a list of r2 scores and the second will give you a list of predictions.


In [8]:
from sklearn.model_selection import cross_val_score, cross_val_predict

cross_val_score(best_svr, X, y, cv=10)



array([ 0.57587479,  0.31598287,  0.29073841, -0.36956896, -0.10400539,
       -0.63991482,  0.1434353 ,  0.44181703, -0.1596671 ,  0.44118473])

In [9]:
cross_val_predict(best_svr, X, y, cv=10)



array([25.36718928, 23.06977613, 25.868393  , 26.4326278 , 25.17432617,
       25.24206729, 21.18313164, 17.3573978 , 12.07022251, 18.5012095 ,
       16.63900232, 20.69694063, 19.29837052, 23.51331985, 22.37909146,
       23.39547762, 24.4107798 , 19.83066293, 21.54450501, 21.78701492,
       16.23568134, 20.3075875 , 17.49385015, 16.87740936, 18.90126376,
       18.77344633, 19.76058911, 18.32650026, 20.90814231, 21.35963996,
       15.41341146, 20.71688823, 12.91193149, 17.70743788, 16.56846148,
       22.76029078, 21.76864141, 23.2741799 , 22.49430392, 25.71585402,
       26.9177714 , 25.43184683, 24.90897453, 24.01209495, 22.82900453,
       22.44898945, 20.18410789, 17.54318606, 11.71340068, 19.00728416,
       20.59593149, 22.99702181, 25.40080635, 23.58576963, 19.79970832,
       25.66798186, 25.12626447, 26.14488041, 24.4908987 , 23.11597693,
       20.77068049, 19.99941657, 24.56955668, 22.94840212, 23.8109607 ,
       25.74429478, 22.51664327, 23.77844313, 20.80775266, 23.36