## Validation curves (plotting scores to evaluate models) and Grid Search
https://scikit-learn.org/stable/modules/learning_curve.html#learning-curve<br>
https://scikit-learn.org/stable/modules/grid_search.html#grid-search

In [71]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Temporarily Suppressing Warnings
import warnings
warnings.filterwarnings("ignore")

In [72]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

In [6]:
# np.random.seed(1)
# x = 10 * np.random.rand(50)
# y = np.sin(x) + 0.1 * np.random.randn(50)
# plt.scatter(x, y)

In [5]:
# poly_model = make_pipeline(PolynomialFeatures(),
#                            LinearRegression())

In [3]:
# poly_model.named_steps

## Validation curves in Scikit-Learn

In [1]:
# # http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.validation_curve.html
# from sklearn.model_selection import validation_curve
# degree = np.arange(0, 25)

# train_score, val_score = validation_curve(poly_model, x[:, np.newaxis], y, 
#                                           'polynomialfeatures__degree', degree, cv=5)
# print(train_score.shape)
# print(train_score[0:5,])
# print(val_score.shape)
# print(val_score[0:5,])

In [2]:
# plt.figure(figsize=(12,4))
# plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')
# plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
# plt.xticks(degree)
# plt.legend(loc='best')
# plt.ylim(0, 1)
# plt.xlabel('degree')
# plt.ylabel('score');

This shows precisely the qualitative behavior we expect: the training score is everywhere higher than the validation score; the training score is monotonically improving with increased model complexity; and the validation score reaches a maximum before dropping off as the model becomes over-fit.

Optimal degree is ???

## Validation in Practice: Grid Search

In practice, models generally have more than one knob to turn and plots of validation curves change from lines to multi-dimensional surfaces.In these cases, such visualizations are difficult and we would rather simply find the particular model that maximizes the validation score.

Scikit-Learn provides automated tools to do this in the grid search module.
Here is an example of using grid search to find the optimal polynomial model.
We will explore a two-dimensional grid of model features; namely the polynomial degree and the flag telling us whether to normalize.

In [7]:
# # http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# from sklearn.model_selection import GridSearchCV

# param_grid = {'polynomialfeatures__degree': np.arange(4,17),
#               'linearregression__normalize': [True, False]}

# grid = GridSearchCV(poly_model, param_grid, cv=5)

In [9]:
# grid.fit(x[:, np.newaxis], y)

In [10]:
# grid.best_params_

In [11]:
# grid.best_score_

In [13]:
# grid.grid_scores_

In [14]:
# # Build a model using best parameters

# poly_model = make_pipeline(PolynomialFeatures(10),
#                            LinearRegression())

# poly_model.fit(x[:, np.newaxis], y)
# xfit = np.linspace(0, 10, 1000)
# yfit = poly_model.predict(xfit[:, np.newaxis])


# plt.scatter(x, y)
# plt.plot(xfit, yfit);

## Learning curves in Scikit-Learn

In [15]:
# # http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
# from sklearn.model_selection import learning_curve

# N, train_lc, val_lc = learning_curve(poly_model, x[:, np.newaxis], y, cv=5, 
#                                      train_sizes=np.linspace(0.3, 1, 20))

In [16]:
# np.linspace(0.3, 1, 20)

In [17]:
# N

In [18]:
# plt.figure(figsize=(12,4))

# plt.plot(N, np.mean(train_lc, 1), color='blue', label='training score')
# plt.plot(N, np.mean(val_lc, 1), color='red', label='validation score')
# plt.hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
#                  color='gray', linestyle='dashed')

# plt.ylim(0, 1)
# plt.xlim(N[0], N[-1])
# plt.xlabel('training size')
# plt.ylabel('score')
# plt.title('degree = 10')
# plt.legend(loc='best');