# Regularisation (L2 or Ridge)
## Cambridge ML Commando Course

In this notebook we will:
- create noisy data based on a pure signal
- create regressors with various non-linear features
- test their fits and plot them, along with printing their cross-validation scores
- implement ridge regression to smooth out overfit, inspect how it works
- plot a validation curve for our ridge regressor
- plot learning curves for all our regressors

In [None]:
%matplotlib inline
%pylab inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

import numpy as np
import scipy as sp
import sklearn
import IPython
import platform

from sklearn import preprocessing

# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

print ('Python version:', platform.python_version())
print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sklearn.__version__)
print ('matplotlib version:', matplotlib.__version__)


### Generate noisy data
Start by creating a "truth" function (here the sine function) and use it to generate some noisy samples

In [None]:
X_pure = np.arange(0,2*3.1415,0.1)

true_fun = lambda x : np.sin(x)

np.random.seed(666)
X = np.sort(random.choice(X_pure, size=30, replace=False))

y_pure = np.sin(X_pure)
# y = np.sin(X) + (np.random.random(len(X))-0.5)*1.0 # generate points with uniform noise
y = true_fun(X) + np.random.randn(len(X))*0.25 # generate points with Gaussian noise
plt.plot(X_pure, true_fun(X_pure))
plt.scatter(X,y)

### Create and train regressors
Import the tools we need from sklearn.  We use PolynomialFeatures to create higher order and interaction features from our samples in X, which we scale as normal.

We use Pipeline objects to organise these feature creation and scaling steps so that we don't need to apply each step explicitly.

Here we create:
- reg: standard linear regression
- reg3: order-3 (cubic) polynomial regression
- reg15: order-15 (quindecic) polynomial regression
- ridge15: another quindecic polynomial regression but this time with L2 regularisation (you will need to uncomment the display code to see it)

Note how:
- The line _underfits_ the true curve
- The cubic fits pretty well
- The order-15 curve _overfits_ the true curve
- The ridge-regression curve is a better fit


In [None]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
print(X.shape, y.shape)

X = X.reshape(-1,1)
X_pure = X_pure.reshape(-1,1)

plt.ylim(-1.6, 1.5)
plt.plot(X_pure,true_fun(X_pure), linestyle="--", label="true")

reg = LinearRegression()
reg.fit(X,y)
plt.scatter(X, y)
plt.plot(X_pure, reg.predict(X_pure), label="linear")

scores = cross_val_score(reg, X, y, scoring="neg_mean_squared_error", cv=10)

scaler3 = StandardScaler()
poly3 = PolynomialFeatures(3)
steps = [
    ("poly",poly3),
    ("scale",scaler3),
    ("reg",LinearRegression())
]
reg3 = Pipeline(steps)

reg3.fit(X, y)
plt.plot(X_pure, reg3.predict(X_pure), label="cubic")
scores3 = cross_val_score(reg3, X, y, scoring="neg_mean_squared_error", cv=10)


scaler15 = StandardScaler()
poly15 = PolynomialFeatures(15)
steps = [
    ("poly",poly15),
    ("scale",scaler15),
    ("reg",LinearRegression())
]
reg15 = Pipeline( steps )

reg15.fit(X,y)
plt.plot(X_pure, reg15.predict(X_pure), label="quindecic (15)")
scores15 = cross_val_score(reg15, X, y, scoring="neg_mean_squared_error", cv=10)

steps = [
    ("poly",poly15),
    ("scale",scaler15),
    ("reg",Ridge(alpha=0.01))
]
ridge15 = Pipeline( steps )
ridge15.fit(X,y)
plt.plot(X_pure, ridge15.predict(X_pure), label="ridge (15)")
ridge_scores = cross_val_score(ridge15, X, y, scoring="neg_mean_squared_error", cv=10)

plt.legend()


print("Linear model")
print(-np.mean(scores), np.std(scores))

print("Cubic model")
print(-np.mean(scores3), np.std(scores3))

print("Quindecic model")
print(-np.mean(scores15), np.std(scores15))

print("Ridge regression regularisation")
print(-np.mean(ridge_scores), np.std(ridge_scores))

plt.gcf().set_size_inches(10,8)

### Varying the regularisation weight
In this section we vary the _alpha_ value (often styled $\lambda$ in the literature) to see how this varies the smoothing of the curve.  We plot several values.

Try out different values and see what happens.

In [None]:
alphas = np.logspace(-7, 3, 7)
ax = plt.figure().gca()
ax.scatter(X, y)
for a in alphas:
    steps = [
    ("poly",poly15),
    ("scale",scaler15),
    ("reg",Ridge(alpha=a))
    ]
    new_ridge = Pipeline( steps )
    new_ridge.fit(X,y)
#     new_ridge_scores = cross_val_score(new_ridge, X, y, scoring="neg_mean_squared_error", cv=10)
    ax.plot(X_pure, new_ridge.predict(X_pure), label=a)
plt.ylim(-1.5,1.5)
plt.legend(title="alpha")
plt.gcf().set_size_inches(10,8)
plt.title('Ridge (15): Fit to datapoints under increasing regularisation')
plt.axis('tight')
plt.show()

### Constraining the coefficients
Ridge regression is just Linear Regression with L2 regularisation.  As we increase the regularisation hyperparameter (alpha), we cause the coefficients of the regression terms to shrink.

Here we plot them for various values of alpha.

In [None]:
n_alphas = 20
alphas =np.logspace(-4, 3, n_alphas)

coefs = []
powers = None
scores = []
for a in alphas:
    steps = [
    ("poly",poly15),
    ("scale",scaler15),
    ("reg",Ridge(alpha=a))
    ]
    new_ridge = Pipeline( steps )
    new_ridge.fit(X,y)    
    coefs.append(new_ridge.named_steps["reg"].coef_)
    score = numpy.mean(cross_val_score(new_ridge, X, y, scoring="neg_mean_squared_error", cv=10))
    scores.append(score)
    if powers is None: # get these for use in labelling the series, so we can see what Poly features we have
        powers = new_ridge.named_steps["poly"].powers_

coefs = numpy.array(coefs).T # puts the coefficients into a time-series per row
# #############################################################################
# Display results

plt.gcf().set_size_inches(10,8)
ax = plt.gca()

ax.set_xscale('log')
# ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
ax.set_ylabel('Coefficient weight')
axsc = ax.twinx()
print(alphas.shape, len(coefs))

for power,coef_vals in zip(powers, coefs):
    ax.plot(alphas, coef_vals, label="$x^{}$".format("{"+str(int(power))+"}"))
ax.legend()
axsc.plot(alphas, scores, linestyle="--", label="-ve MSE")
axsc.set_ylabel('-ve MSE score (higher=better!)')
axsc.legend()
    
plt.xlabel('alpha')
plt.title('Ridge coefficients as a function of the regularisation param')
plt.axis('tight')
plt.show()


### Validation curve
Validation curves are useful tools for seeing which value is best of our regularisation hyperparameter.  Here we search through several value of alpha and plot how they affect performance on training and cross-validation datasets.

In [None]:
from sklearn.model_selection import validation_curve
def plot_validation_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, param_name="C", param_range = np.logspace(-3, 5, 10)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_scores, test_scores = validation_curve(
    estimator, X, y, param_name=param_name, scoring="neg_mean_squared_error", param_range=param_range,
    cv=cv, n_jobs=n_jobs)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
            
    plt.grid()

    plt.xlabel(param_name)
    plt.ylabel("Score")
    lw = 2
    plt.semilogx(param_range, train_scores_mean, label="Training score",
                 color="darkorange", lw=lw)
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2,
                     color="darkorange", lw=lw)
    plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
                 color="navy", lw=lw)
    plt.fill_between(param_range, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.2,
                     color="navy", lw=lw)

    plt.gcf().set_size_inches(10,5)
    plt.legend(loc="best")
    return plt


In [None]:
X15=scaler15.transform(poly15.transform(X)) # make a 15-degree polynomial version of X
reg = Ridge() # this estimator is cloned for each value of alpha

plot_validation_curve(reg, "Ridge15", X15, y, (-2,2), 
                      cv=10, 
#                       n_jobs=-1, 
                      param_name="alpha", 
                      param_range=np.logspace(-10, 5, 30))

### Learning curves
Learning curves are another useful tool, but they show how our estimator performs on training and cross-validation datasets as the size of the dataset increases.  This lets us know whether we need to get more data, add/remove features, regularise, or change our ML algorithm completely.

In [None]:
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, scoring="neg_mean_squared_error", train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
#     print(test_scores_mean)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", alpha=0.5,
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", alpha=0.5,
             label="Cross-validation score")

    plt.legend(loc="best")
    plt.gcf().set_size_inches(10,5)
    return plt

In [None]:
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, test_size=0.25, random_state=0) # works better for small dataset

train_sizes = np.linspace(0.1, 1.0, 20)
print(train_sizes)
plot_learning_curve(reg, "Underfit (linear)", X, y, (-2,0.5), cv=cv, n_jobs=4, train_sizes=train_sizes)
plot_learning_curve(reg3, "Good fit (cubic)", X, y, (-2,0.5), cv=cv, n_jobs=4, train_sizes=train_sizes)
plot_learning_curve(reg15, "Overfit (15)", X, y, (-.2e8,.25e7), cv=cv, n_jobs=4, train_sizes=train_sizes)
plot_learning_curve(ridge15, "Ridge (15)", X, y, (-4,1), cv=cv, n_jobs=4, train_sizes=train_sizes)

# Summary
In this notebook we:
- Created some noisy samples based on a true underlying function
- Trained several regressors of increasing polynomial order
- Checked for under- and over-fit (and good fit)
- Trained a ridge regression model to mitigate overfit
- Iterated over values of alpha to see how these affect regression curve complexity, and to see how coefficients shrink as alpha grows
- Plotted a validation curve for alpha to look for the best value
- Plotted learning curves of our iterators to look for tell-tale signs of good and bad fitting