# K fold cross validation

What we did so far is loading the data, split it into a training and testing sets, fitted a regression model to the training data, made predictions based on this data and tested the predictions on the test data.

But what if the split was not made randomly? What if the split accidently separated the data to low income level and high income level? This is where cross validation comes in.

<img class="irc_mi" src="http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/06/07_cross_validation_diagram.png" onload="google.aft&amp;&amp;google.aft(this)" width="285" height="393" style="margin-top: 0px;" alt="Image result for k cross validation">

In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

In [1]:
from sklearn.model_selection import KFold # import KFold
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # create an array
y = np.array([1, 2, 3, 4]) # Create another array
kf = KFold(n_splits=2) # Define the split - into 2 folds 
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
print(kf) 

KFold(n_splits=2, random_state=None, shuffle=False)


In [2]:
for train_index, test_index in kf.split(X):
    print('TRAIN INDEX:', train_index, 'TEST INDEX:', test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN INDEX: [2 3] TEST INDEX: [0 1]
TRAIN INDEX: [0 1] TEST INDEX: [2 3]


As you can see, the function split the original data into different subsets of the data.

## familiar data set?

Now, let's go back to the data set we used before for evaluation metrics. 
We will use **k fold cross validation** to evaluate Pima Indians Diabetes Data Set this time.

Pima Indians Diabetes Data Set: https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes

In [3]:
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Convert dataframe into array

In [4]:
array = dataframe.values
array

array([[   6.   ,  148.   ,   72.   , ...,    0.627,   50.   ,    1.   ],
       [   1.   ,   85.   ,   66.   , ...,    0.351,   31.   ,    0.   ],
       [   8.   ,  183.   ,   64.   , ...,    0.672,   32.   ,    1.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,    0.245,   30.   ,    0.   ],
       [   1.   ,  126.   ,   60.   , ...,    0.349,   47.   ,    1.   ],
       [   1.   ,   93.   ,   70.   , ...,    0.315,   23.   ,    0.   ]])

Separate the attributes by data values and class

In [5]:
X = array[:,0:8] # [preg, plas, pres, skin, test, mass, pedi, age]
Y = array[:,8] # [class]

In [6]:
num_instances = len(X)
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.951% (4.841%)


You can see that we report both the mean and the standard deviation of the performance measure. 

---