# EVeMa 2018

![logo](assets/logo.jpg "Logo")

- Instructor: Žiga Emeršič.

- Authors: 
    - Saúl Calderón, Martín Solís, Ángel García, Blaž Meden, Felipe Meza, Juan Esquivel
    - Mauro Méndez, Manuel Zumbado. 

# Cross Validation

The most straight forward approach to model training and evaluation is to split the data into two parts. However, the data could contain some unwanted properties or relations in one of the parts.

This can mean that our training will be unsuccessul or that we will erroneously evalute the predicition. For example, imagine that test data contains only the simple possible cases. The calculated prediction performance will therefore be misleadingly high.

To counter that to some extent we always need to split data randomly (e.g. we permute data prior splitting it). However, there still is no ensurance, that some unwanted relations will not appear, especially when we are dealing with small amounts of data.

Would not be great if we could train and test our model od the whole dataset? This is where cross validation comes in. We split the data into equal parts and then loop through all the combinations.

There are many different types of cross validation. For the scipy (e.g. from sklearn.model_selection import StratifiedKFold) check http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

Stratified k-fold cross-validation, in which the folds are selected so that each fold contains roughly the same proportions of class labels.

In repeated k-fold cross-validation the devisions are repeated n-times.

With the cross validation we weigh between bias vs variance. If we do not perform cross validation at all, there will, of course, be no variance, but the bias could be high. If we on the other hand, split data into as many parts as there are samples we will decrease the bias, but increase the variance.

## K-Folds Cross Validation
In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

How many folds? Well, the more folds we have, we will be reducing the error due the bias but increasing the error due to variance; the computational price would go up too, obviously — the more folds you have, the longer it would take to compute it and you would need more memory. With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger.


In [2]:
# Load data
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

*What we would usually do ...*

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, test_size=0.45, random_state=0)

X_train.shape, y_train.shape
X_test.shape, y_test.shape

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9705882352941176

Is this really OK? What if we would just by chance perform a very bad split? E.g. trivial samples in train set and very difficult ones in the test set - or vice versa.

Possible solution:

In [16]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
#print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.96666667 1.         0.96666667 0.96666667 1.        ]


## Leave One Out Cross Validation

This is another method for cross validation, Leave One Out Cross Validation (by the way, these methods are not the only two, there are a bunch of other methods for cross validation). In this type of cross validation, the number of folds (subsets) equals to the number of observations we have in the dataset. Each sample is used once as a test set (singleton) while the remaining samples form the training set.

This method is very computationally expensive and should be used on small datasets. If we have large amounts of data, use k-fold.

## Is there anything else?

If we do not have enough data or have a very variable model that introduces too much variance intou our experiments, we can use Bootstrap (also known as Bagging). Here we repeat some samples and also introduce new samples by averaging some of the existing samples.

Authors: *Saul Calderon, Angel García, Blaz Meden, Felipe Meza, Juan Esquivel, Martín Solís, Ziga Emersic, Mauro Mendez, Manuel Zumbado*