# Lab 1 : Linear Regression

## G3 SDI - Machine Learning

In this lab, we are going to implement linear regression and ridge regression on a medical data example. The data come from a medical study (Stamey et al., 1989), whose goal was to predict the level of prostate-specific antigen (`lpsa`) from some clinical measurements. These clinical exams are carried out before a possible prostatectomy.

The measurements are log cancer volume `lcavol`, log prostate weight `lweight`, age of the patient `age`, log of benign prostatic hyperplasia amount `lbph`, seminal vesicle invasion `svi`, log of capsular penetration `lcp`, Gleason score `gleason`, and percent of Gleason scores 4 or 5 `pgg45`. The variable `svi` is binary, `gleason` is ordinal, others are quantitative.

### Instructions
* Rename your notebook with your surnames as `lab1_Name1_Name2.ipynb`, and include your names in the notebook.
* Your code, and its output, must be commented !
* Please upload your notebook on Moodle in the dedicated section before the deadline.

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Report written by [name1], [name2], date.
</div>

In [None]:
# Import usual libraries
import numpy as np
from matplotlib import pyplot as plt

### Part 1 - Linear regression

In this first part, we focus on using linear regression.

**Q1.** Load the data from the `.npy` files included in the archive (use `np.load`). How many examples are there ? How many features ?

In [None]:
##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q2.** Check whether there are some missing entries in the dataset (both in X and y). Use `np.isnan`.

In [None]:
##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q3.** Divide the dataset into a training set (80%) and a test set (20%), using `train_test_split` with `random_state = 0` (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)).

In [None]:
from sklearn.model_selection import train_test_split

##########
## YOUR CODE HERE
##########

**Q4.** Standardize the training set, and apply the same operation to the test set. Use `StandardScaler` (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)). Recall what standardization means.

In [None]:
from sklearn.preprocessing import StandardScaler

##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q5.** Compute the auto-covariance matrix from the training set, and display it (you might want to use `plt.imshow`). What can we learn from this ?

In [None]:
##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q6.** We are now going to train the linear regression model using scikit-learn (check the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)). Use the `.fit` method on the training set. Retrieve the coefficients obtained by scikit-learn using the attributes `.intercept_` and `.coef_`, and check that it corresponds to the closed-form solution from the lecture (you might want to use `np.hstack` to concatenate X with a column of ones).

In [None]:
from sklearn.linear_model import LinearRegression

##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q7.** Obtain the model predictions on the test set using the `.predict` method. Then compute the MSE and the MAE (you may want to use the functions below).

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

##########
## YOUR CODE HERE
##########

### Part 2 - Ridge regression

In this second part, we now turn to ridge regression.

**Q1.** Fit the ridge regression model (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)) with $\lambda = 1$, using the `.fit` method on the training set. Again, retrieve the coefficients, and check that they match with the closed-form solution from the lecture. How do they differ from the ones obtained with linear regression ?

In [None]:
from sklearn.linear_model import Ridge

##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q2.** Obtain the model predictions on the test set using the `.predict` method, then compute the MSE and the MAE. Do we get better or worse predictions than before ? Comment.

In [None]:
##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q3.** We are now going to assess the impact of the regularization coefficient $\lambda$.

To do so, vary $\lambda$ from $10^{-3}$ and $10^3$ (use `np.logspace`), and for each value of $\lambda$, retrain the ridge regression model and keep the values of the coefficients (ignoring the intercept).

Display the evolution of the coefficients w.r.t. $\lambda$ (use a logarithmic scale for the x-axis). Comment.

In [None]:
##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>

**Q4.** Now remains the question of choosing the optimal $\lambda$. We are going to select it with a 5-fold cross-validation.

Display the evolution of the cross-validated MSE w.r.t. $\lambda$ (use again a logarithmic scale for the x-axis), and display the best $\lambda$ with a `plt.axvline`.

Now retrain the ridge regression model with the selected $\lambda$, and assess its performance in terms of MSE and MAE. Comment.

In [None]:
from sklearn.model_selection import KFold

# Set-up cross-validation
kf = KFold(n_splits=5)

for train_index, val_index in kf.split(X_train):
    X_train_new, X_val = X_train[train_index], X_train[val_index]
    y_train_new, y_val = y_train[train_index], y_train[val_index]

    ##########
    ## YOUR CODE HERE
    ##########

### Part 3 (Bonus) - LASSO

Display the same kind of plots as in Part 2, but using LASSO regression instead of ridge regression (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html). In particular, comment on the following points :
* Do the regression coefficients evolve in the same way as ridge regression ? What kind of solutions do we obtain ?
* Do we get the same optimal lambda ?

In [None]:
##########
## YOUR CODE HERE
##########

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
Your answer here
</div>