# Model Selection and parameter tuning

In [27]:
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')

# Modeling libraries
import statsmodels.formula.api as smf # welcome!!
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve, GridSearchCV
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn import set_config
set_config(display='diagram')

import pandas as pd
np.random.seed(123)

import warnings
warnings.filterwarnings('ignore')

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#validation-set" data-toc-modified-id="validation-set-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>validation set</a></span><ul class="toc-item"><li><span><a href="#Example:-tune-regularization-in-Lasso" data-toc-modified-id="Example:-tune-regularization-in-Lasso-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Example: tune regularization in Lasso</a></span></li></ul></li><li><span><a href="#k-fold-cross-validation" data-toc-modified-id="k-fold-cross-validation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>k-fold cross validation</a></span><ul class="toc-item"><li><span><a href="#GridSearch" data-toc-modified-id="GridSearch-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>GridSearch</a></span></li></ul></li><li><span><a href="#KNN-Neighbors-(classification)" data-toc-modified-id="KNN-Neighbors-(classification)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>KNN Neighbors (classification)</a></span></li><li><span><a href="#Standarization-of-variables" data-toc-modified-id="Standarization-of-variables-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Standarization of variables</a></span></li><li><span><a href="#Extra:-sklearn-pipelines" data-toc-modified-id="Extra:-sklearn-pipelines-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Extra: sklearn pipelines</a></span></li><li><span><a href="#References" data-toc-modified-id="References-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>References</a></span></li></ul></div>

## validation set

When training a model, using different parameters may lead to very different solution. We could chose the ones that minimize the error in the **train set**, but: can you recall why that was not a good idea?

OK, perhaps then we should chose the parameters that minimize the **test set**. But... this turns out to yield over optimistic results. Indeed we would be chosing the parameter settings that best perform in our particular test set, but not necessarily generalizing that well to other unseen observations.

The solution is to create a **validation set** that is used to validate the parameter selections. Remember, the **test set** is to be used *exclusively* for assesing the quality of your model.

![image.png](attachment:image.png)

### Example: tune regularization in Lasso

In [43]:
# load data


In [44]:
# Split target and predictors


# split data in train/ validation/ test/ 


In [45]:
# Select the best model with the validation set approach


7.979797979797979

## k-fold cross validation


![image.png](attachment:image.png)

### GridSearch

We explore a space of parameters and evaluate their impact on the performance of the model. Then, we keep the best set of parameters to calculate our error in the Test set.

**Note** This few lines are really enough to train good models!

## KNN Neighbors (classification)



[See Wikipedia article](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

[See sklearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneig#sklearn.neighbors.KNeighborsClassifier)

![image.png](attachment:image.png)

In [24]:
df_cancer = pd.read_csv('../datasets/breast_cancer.csv')

In [82]:
# [self-guided] Apply a KNN neighborhs classifier with a GridSearchCV

# How does this result compares to the one using LogisticRegression?

## Standarization of variables

Needed with parameter shrinkage (Lasso and Ridge) and whenever we are using metric distances (such as KNN)

## Extra: sklearn pipelines

Encapsulate all parts of the modeling process into a single object. 

## References
* [Introduction to Statistical Learning ISL (Chapter 2)](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)
* [repo from tdpetrou with materials from ISL](https://github.com/tdpetrou/Machine-Learning-Books-With-Python/tree/master/Introduction%20to%20Statistical%20Learning)
