# Why do we use cross validation? how do we use cross validation and how do we train the model?

So, when you do K-fold cross validation, you are testing how well your model is able to get trained by some data and then predict data it hasn't seen. We use cross validation for this because if you train using all the data you have, you have none left for testing. You could do this once, say by using 80% of the data to train and 20% to test, but what if the 20% you happened to pick to test happens to contain a bunch of points that are particularly easy (or particularly hard) to predict? We will not have come up with the best estimate possible of the models ability to learn and predict.

We want to use all of the data. So to continue the above example of an 80/20 split, we would do 5-fold cross validation by training the model 5 times on 80% of the data and testing on 20%. We ensure that each data point ends up in the 20% test set exactly once. We've therefore used every data point we have to contribute to an understanding of how well our model performs the task of learning from some data and predicting some new data.

But the purpose of cross-validation is not to come up with our final model. We don't use these 5 instances of our trained model to do any real prediction. For that we want to use all the data we have to come up with the best model possible. The purpose of cross-validation is model checking, not model building.

Now, say we have two models, say a linear regression model and a neural network. How can we say which model is better? We can do K-fold cross-validation and see which one proves better at predicting the test set points. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances we trained during cross-validation for our final predictive model.

Note that there is a technique called bootstrap aggregation (usually shortened to 'bagging') that does in a way use model instances produced in a way similar to cross-validation to build up an ensemble model, but that is an advanced technique beyond the scope of your question here.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [10]:
'''direct testing, no cross validation yet'''
#load the dataset, data + label
iris = datasets.load_iris()

#splitting the data
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size = 0.4, random_state = 1);

#train a linear svm model
clf = svm.SVC(kernel='linear').fit(X_train,y_train)

#see the score of the model
clf.score(X_test,y_test)


0.9833333333333333

In [11]:
from sklearn.model_selection import cross_val_score

In [19]:
'''k-fold cross validation'''

#instead of training a model, we define a model
clf = svm.SVC(kernel='linear', C=1.0)

#now, we can do k-fold cross validation
scores = cross_val_score(clf, iris.data, iris.target, cv = 5)

#print the mean score and the 95% confidence interval of the score
print("Accuracy: %0.2f(+/- %0.2f)" % (scores.mean(),scores.std()*2))

Accuracy: 0.98(+/- 0.03)
