**Table of contents**<a id='toc0_'></a>    
- [Bias-variance trade-off](#toc1_)    
  - [From previous class: Breast cancer dataset](#toc1_1_)    
- [Cross Validation](#toc2_)    
    - [**What is cross-validation?**](#toc2_1_1_)    
    - [**Why cross-validation?**](#toc2_1_2_)    
  - [Stratified K-Fold Cross Validation](#toc2_2_)    
  - [Repeated KFold](#toc2_3_)    
    - [**How to choose K?**](#toc2_3_1_)    
  - [Shuffle Split](#toc2_4_)    
  - [Stratified Shuffle Split](#toc2_5_)    
  - [Time Series Cross Validation](#toc2_6_)    
  - [Extra: Leave-One-Out Cross-Validation](#toc2_7_)    
- [Pickling](#toc3_)    
  - [Save the model](#toc3_1_)    
  - [Load the model](#toc3_2_)    
  - [Save the data](#toc3_3_)    
  - [Load the data](#toc3_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Bias-variance trade-off](#toc0_)

![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*9hPX9pAO3jqLrzt0IE3JzA.png)  
(Source: [Understanding the Bias-Variance Tradeoff, Medium](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229))

**What is bias?**
> Bias is the **difference between the average prediction of our model and the correct value** which we are trying to predict. 
> Models with high bias pay very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. [$^{[1]}$](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229)

**What is variance?**

> Variance is the **variability of model prediction for a given data point** or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. [$^{[1]}$](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229)

**[Extra: The maths of it all](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html)**  

$\underbrace{E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Expected\;Test\;Error} = \underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)^{2}\right]}_\mathrm{Variance} + \underbrace{E_{\mathbf{x}, y}\left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Noise} + \underbrace{E_{\mathbf{x}}\left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_\mathrm{Bias^2}$

**What is noise?**  

> This error measures ambiguity due to your data distribution and feature representation. You can never beat this, it is an aspect of the data. [$^{[2]}$](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html)

**What do we want?**

A model complex enough to understand the data but not so complex that it memorizes the data.

![](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)  

(Source: [Understanding the Bias-Variance Tradeoff, Scott Fortmann-Roe](https://scott.fortmann-roe.com/docs/BiasVariance.html))

## <a id='toc1_1_'></a>[From previous class: Breast cancer dataset](#toc0_)

In [None]:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
cancer = load_breast_cancer()

In [None]:
# Extract dataset into pandas
features = pd.DataFrame(cancer['data'], columns = cancer['feature_names'])
labels = pd.Series(cancer['target'], name = 'labels')

In [None]:
# Display features & labels
display(features)
display(labels)

In [None]:
from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=2)

In [None]:
# Support Vector Machine
from sklearn.ensemble import RandomForestClassifier

# Initialize and fit model
rf_model = RandomForestClassifier(max_depth=3, n_estimators=100)
rf_model.fit(X_train, y_train)

In [None]:
# Review overall accuracy score
from sklearn.metrics import accuracy_score
print('Train accuracy:', accuracy_score(y_train, rf_model.predict(X_train)))
print('Test accuracy:', accuracy_score(y_test, rf_model.predict(X_test)))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, rf_model.predict(X_test)))

# <a id='toc2_'></a>[Cross Validation](#toc0_)

### <a id='toc2_1_1_'></a>[**What is cross-validation?**](#toc0_)

![](https://imgs.search.brave.com/tEBDW7f_GRHyGUhYVI0mmwKHv5NYPdYEFKxDqBUF3mk/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/c2VjdGlvbi5pby9l/bmdpbmVlcmluZy1l/ZHVjYXRpb24vaG93/LXRvLWltcGxlbWVu/dC1rLWZvbGQtY3Jv/c3MtdmFsaWRhdGlv/bi81LWZvbGQtY3Yu/anBlZw)  
(Source: [How to Implement K fold Cross-Validation in Scikit-Learn, Section.io](https://www.section.io/engineering-education/how-to-implement-k-fold-cross-validation/))

### <a id='toc2_1_2_'></a>[**Why cross-validation?**](#toc0_)

When we are using the test set accuracy to tweak our hyperparameters, we are indirectly feeding our model informaton about the test set, i.e. which hyperparameters work best with the test set. Cross validation helps solve this problem by averaging across multiple testing scores.

In [None]:
import pandas as pd
import numpy as np

## <a id='toc2_2_'></a>[Stratified K-Fold Cross Validation](#toc0_)

This should be what you always do whenever you do hyperparameter tuning, i.e. choose the optimal parameters for your model. However, this doesn't work if your traininng process is either too expensive or too long. The cross-validation should always be **stratified based on the target** and this is what `sklearn` does by default with its cross-validation function.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_009.png)  
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

In [None]:
# Applying an example of cross validation
from sklearn.model_selection import cross_validate

# Initialize model and cross validate with 10 folds
results = cross_validate(rf_model, features, labels, cv=10)
print(results.keys())

In [None]:
# Review test scores per validation set
results['test_score']

In [None]:
# Review overall test score
results['test_score'].mean()

In [None]:
# Use a different scoring metric
results = cross_validate(rf_model, features, labels, cv=10, scoring='recall')
print(results.keys())

In [None]:
# Review overall test score
results['test_score'].mean()

## <a id='toc2_3_'></a>[Repeated KFold](#toc0_)

In [None]:
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score

In [None]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialize model and cross validate with 10 folds
scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
scores.mean()

In [None]:
# Use a different scoring metric
scores = cross_val_score(rf_model, features, labels, scoring='recall', cv=cv, n_jobs=-1)
print(scores)
scores.mean()

### <a id='toc2_3_1_'></a>[**How to choose K?**](#toc0_)

> Typical values for k are k=3, k=5, and k=10, with 10 representing the most common value. This is because, given extensive testing, 10-fold cross-validation provides a good balance of low computational cost and low bias in the estimate of model performance as compared to other k values and a single train-test split. [$^{[3]}$](https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/)

## <a id='toc2_4_'></a>[Shuffle Split](#toc0_)

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_008.png)  
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

> The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets. It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state pseudo random number generator. [$^{[4]}$](https://scikit-learn.org/stable/modules/cross_validation.html#shufflesplit)

Even though Shuffle Split is a strategy for cross-validation, it is recommended to use the Stratified Shuffle Split, as it keeps the proportion of target classes equal across all train-validation sets.

## <a id='toc2_5_'></a>[Stratified Shuffle Split](#toc0_)

> StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits by preserving the same percentage for each target class as in the complete set. [$^{[4]}$](https://scikit-learn.org/stable/modules/cross_validation.html#shufflesplit)

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_012.png)  
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Set up the cross validator
cv_sss = StratifiedShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
cv_sss.get_n_splits(features, labels)

In [None]:
# Check what the stratified shuffle split does
for i, (train_indices, test_indices) in enumerate(cv_sss.split(features, labels)):
    print('Split no:', i)
    print('Train indices:', train_indices[:5])
    print('Test indices:', test_indices[:5])

In [None]:
# Now see it in action! ...manually
results = []
for train_index, test_index in cv_sss.split(features, labels):
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    rf_model.fit(X_train, y_train)
    pred = rf_model.predict(X_test)
    results.append(accuracy_score(y_test, pred))

In [None]:
results

In [None]:
# And now using the sklearn 
scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=cv_sss, n_jobs=-1)
print(scores)
scores.mean()

## <a id='toc2_6_'></a>[Time Series Cross Validation](#toc0_)

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_013.png)   
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

In [None]:
occupancy = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/occupancy.csv')
occupancy.set_index('date', inplace=True)
occupancy.head()

In [None]:
features = occupancy.drop('Occupancy', axis=1)
labels = occupancy['Occupancy']

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# Set up the cross validator
ts_sss = TimeSeriesSplit(n_splits=6)
ts_sss.get_n_splits(features)

In [None]:
# Review how the time series split works
for i, (train_index, test_index) in enumerate(ts_sss.split(features)):
    print('Split no:', i)
    print('Train set size:', len(train_index))
    print('Test set size:', len(test_index))

In [None]:
# And see it in action!... manually
results = []
for train_index, test_index in ts_sss.split(features):
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    rf_model.fit(X_train, y_train)
    pred = rf_model.predict(X_test)
    results.append(accuracy_score(y_test, pred))

In [None]:
results

In [None]:
# And now using the sklearn 
scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=ts_sss, n_jobs=-1)
print(scores)
scores.mean()

## <a id='toc2_7_'></a>Extra: [Leave-One-Out Cross-Validation](https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/) [&#8593;](#toc0_)

# <a id='toc3_'></a>[Pickling](#toc0_)

We can pickle many things: ML models, pandas dataframes

In [None]:
import pickle

## <a id='toc3_1_'></a>[Save the model](#toc0_)

In [None]:
with open('rf_model.pkl', 'wb') as file:
    pickle.dump(rf_model, file)

In [None]:
pickle.dump(rf_model, open('rf_model.pkl', 'wb'))

## <a id='toc3_2_'></a>[Load the model](#toc0_)

In [None]:
with open('rf_model.pkl', 'rb') as file:
    rf_model = pickle.load(file)

In [None]:
rf_model = pickle.load(open('rf_model.pkl', 'rb'))

## <a id='toc3_3_'></a>[Save the data](#toc0_)

In [None]:
X_train.to_pickle('train_data.pkl')
y_train.to_pickle('train_label.pkl')

X_test.to_pickle('test_data.pkl')
y_test.to_pickle('test_label.pkl')

## <a id='toc3_4_'></a>[Load the data](#toc0_)

In [None]:
X_train = pd.read_pickle('train_data.pkl')
X_train