---
# Evaluating Models
---
## Cross-validation

A common situation in machine learning is that your model classifies the training data perfectly, but it fizzles when it comes to unseen and new data. This phenomenon is called __overfitting__. Another problem could be that you model actually classifies an unrelated parameter, that randomly correlats with your target parameter. As an example your training data of cats and dogs could be accidentially cosists of only red cats and brown dogs. This can lead to a model that distinguishes beween red and brown instead of the characteristics of dogs and cats.   
<br>
There are several approaches to avoid these problems and measure the general performance of the model. One would be to split your dataset into a training and a testing or even a training, testing and validation set.   
But these approaches do still lead to problems: Imagine for example accidentially splitting your training and testing set in a way that they are close to identical (except for the size). Then you cannot detect whether your model overfits your data.  
And splitting into three datasets is not really possible when you have a small dataset. 

Cross-validation is a method with which you can test and train on the whole dataset. It is often used during the training of the model to test out different sets of hyperparameters. 

### K-fold cross-validation

In k-fold cross-validataion you split the data into k-folds. After this:
- the model is trained on _k - 1_ folds
- the resulting model is tested with the remaining data fold 
This is repeated k-times in a loop. After this the performance is given as the average of all values computed in the loop.

<img src="images/k-fold-cv.png" width="700" style="float: center;"/>

Image from [scikit learn](https://scikit-learn.org/stable/modules/cross_validation.html)

__Splitting the data:__  
The first step that we have to do to use k-fold cross-validation is splitting the data into k-folds. For that we can use `train_test_split` from `sklearn.model_selection`.

In [1]:
# first we take the iris data set:
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
# then we use train_test_split
from sklearn.model_selection import train_test_split

# X is the data and y the target:
X = iris.data
y = iris.target

# use test_train split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)


The splitted data can now be used to evaluate different hyperparameters in the dataset. To see how this looks like we will use again knn.

In [17]:
# we need to import cross_val_score from model selection
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# we first select which k-values we want to test and create an empty array for the scores
k_values = [i for i in range (1,15)]
scores = np.zeros(15)

# then we iterate through all the k-values
for k in k_values:
    # train the model
    knn = KNeighborsClassifier(n_neighbors = k)
    
    # calculate the score with cross_val_score
    score = cross_val_score(knn, X, y, cv = 5)
    scores[k] = (np.mean(score))


print(str(np.mean(scores)) + " accuracy with a standard deviation of " + str(np.std(scores)))    
print('max_score:', np.max(scores))
print('k-value:', np.argmax(scores))

max_score: 0.9800000000000001
0.9066666666666667 accuracy with a standard deviation of 0.24248100345601728
k-value: 6
