<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 10 - Day 2 </h1> </center>

<center> <h2> Part 6: Cross Validation </h2></center>

## Outline
1. <a href='#1'>Cross Validation</a>
2. <a href='#2'>KFold Class</a>
3. <a href='#3'>Calling Function cross_val_score</a>
4. <a href='#4'>Cross Validation: Summary</a>

<a id="1"></a>

## 1. Cross Validation
* Uses **all of your data** for **training and testing**
* Gives a better sense of how well your model will make predictions
* **Splits the dataset** into **_k_ equal-size folds** (unrelated to**&nbsp;k** in the k-nearest neighbors algorithm)
* **Repeatedly trains** your model with **_k_ – 1 folds** and **test the model** with the **remaining fold**
* Consider using **_k_ = 10** with **folds numbered 1 through 10**
	* **train** with **folds 1–9**, then **test** with **fold 10**
	* **train** with **folds 1–8 and 10**, then **test** with **fold 9**
	* **train** with **folds 1–7** and **9–10**, then **test** with **fold 8**
    * ...

In [None]:
import pandas as pd
from sklearn.datasets import load_digits

#load the digits dataset
digits = load_digits()

df = pd.DataFrame(digits.data)
df["target"] = digits.target

features = df.drop("target", axis = 1)
target = df["target"]

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

<a id="2"></a>

## 2. `KFold` Class
* **`KFold`** class and function **`cross_val_score`** perform **k-fold cross validation** 
* **`n_splits=10`** specifies the **number of folds**
* **`shuffle=True`** **randomizes** the data before **splitting it into folds** 
	* Particularly **important** if the **samples** might be **ordered** or **grouped** (as in **Iris dataset** we'll see later)

In [None]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=11, shuffle=True)

<a id="3"></a>

## 3. Calling Function `cross_val_score` to Train and Test Your Model
* **`estimator=knn`** &mdash; **estimator** to validate
* **`X=features`** &mdash; **samples** to use for training and testing
* **`y=target`** &mdash; **target predictions** for the samples
* **`cv=kfold`** &mdash; **cross-validation generator** that defines how to **split** the **samples** and **targets** for training and testing

* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=knn, X=features, y=target, cv=kfold)

In [None]:
scores

In [None]:
scores.min()

In [None]:
scores.max()

In [None]:
print(f'Mean accuracy: {scores.mean():.2%}')

In [None]:
print(f'standard deviation={scores.std():.2%}')

<a id="4"></a>

## 4. Cross Validation: Summary

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=11, shuffle=True)

from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=knn, X=features, y=target, cv=kfold)

print(f'Mean accuracy: {scores.mean():.2%}')
print(f'standard deviation={scores.std():.2%}')