In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We load our test dataset for this exercise

In [481]:
pd.DataFrame({"x":x, "y":y}).to_csv("data.csv")

In [482]:
tmp = pd.read_csv("data.csv",index_col=0)

x = tmp["x"].values
y = tmp["y"].values

In [None]:
plt.scatter(x,y,s=1.5)

### Let's put ourselves in different modeling complexities

1. Let's create polynomial features again, up to 10 degrees using the [Polynomial transformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) (don't forget to remove the degree 0 polynomial feature)

In [484]:
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree=10,include_bias=False)

xx = pf.fit_transform(x.reshape(-1,1))

xx.shape

(800, 10)

2. Let's put aside some of the data in order to evaluate the general performance - split into `xtrain, xtest, ytrain, ytest`

3. We fit a linear regression over the **original** dataset, save the prediction in `y_predict_1`


In [486]:
from sklearn.linear_model import LinearRegression

lg1 = LinearRegression()

lg1.fit(xtrain[:,0].reshape(-1,1),ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

4. We fit two more linear regression, one using 5 features and one using all the features. Name them respectively `y_predict_5` and `y_predict_10`

In [488]:

x5 = xtrain[:,:3]

lg5 = LinearRegression()
lg5.fit(x5,ytrain)

lg15 = LinearRegression()
lg15.fit(xtrain,ytrain)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

We now have 3 different trained models, with three different capacities of prediction. The higher the polynome used, the more complex the model.

5. We can now plot the three lines corresponding to the three different models on three different plots

In [489]:
x_ordered = np.sort(xtrain,axis=0)
y_predict_1 = lg1.predict(x_ordered[:,0].reshape(-1,1))
y_predict_3 = lg5.predict(x_ordered[:,:3])
y_predict_15 = lg15.predict(x_ordered)

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(1, 3, 1)
plt.scatter(x,y,s=1.5)
plt.plot(x_ordered[:,0], y_predict_1, c="r")
plt.title('1')

plt.subplot(1, 3, 2)
plt.scatter(x,y,s=1.5)
plt.plot(x_ordered[:,0], y_predict_3, c="r")
plt.title('3')

plt.subplot(1, 3, 3)
plt.scatter(x,y,s=1.5)
plt.plot(x_ordered[:,0], y_predict_15, c="r")
plt.title('15')


plt.show();

We have a problem, because 
1. We are sensitive to outliers that don't represent our data
2. We are overfitting the training set 

Let's look at the MSE performance

In [None]:
from sklearn.metrics import mean_squared_error

print(mean_squared_error(lg1.predict(xtest[:,0].reshape(-1,1)),ytest))
print(mean_squared_error(lg5.predict(xtest[:,:3]),ytest))
print(mean_squared_error(lg15.predict(xtest),ytest))

### Which regularization term do we choose?

We can train a Ridge (L2) and a Lasso (L1), with different alpha values (how much influence the penalization terms will have).

❓Train an Ridge for both L1 and L2, for each alpha values listed below. Save those into an array, as well as the score and the best ridge

In [None]:
from sklearn.linear_model import Ridge

coefs = []
alphas = np.logspace(-10, 3, 100);

best_score=0;
best_ridge=0;

for alpha in alphas:
    
    # create a ridge regression with given alpha
    
    # fit the regression
    
    # append the coefs found by the regression
 
    # save the score
    
    # see if the score is better than the best score we got so far. If so, save this regression as the best one


❓We can now compare our best ridge to the previous best model

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(1, 2, 1)
plt.scatter(x,y,s=1.5)
plt.plot(x_ordered[:,0], best_ridge.predict(x_ordered), c="r")
plt.title('ridge');

plt.subplot(1, 2, 2)
plt.scatter(x,y,s=1.5)
plt.plot(x_ordered[:,0], y_predict_15, c="r")
plt.title('linear reg 15 polynome')


plt.show();

❓We want to see the impact of the penalization term. For every alpha, we can plot the coeffs/weights of the associated linear regression. 

In [2]:
# plot every coef as a function of alpha (very straightforward)

As we can see, the Ridge regression is impacting the weights smoothly but L2 is even removing some

Hypothetically, we could go further in the modelisation work and try other 