# M2 DataScience (Part II) - Lab session

It's time to practice all (or at least some of) the notions we mention during the second part of this course.

This notebook is split into three (independent) parts: 

- I. Optimization (not strictly related to machine learning). 
- II. Classification and Deep Learning on image datasets. 
- III. Regression and regularization.

**Note:** The Lab is a bit long, do not worry if you don't do everything in 4 hours, and you can chose on which part you want to work in priority depending on your interest. 

Before starting to work, you will need the following packages. Some (most) of them are not native and require to be installed first. You can check if the installation went correctly by running the following code cell. It should not raise error (warnings---such as "you don't have a GPU"---are ok). 

**Note:** After installing a package, you (unfortunately) need to restart the notebook (the loop-arrow on the top horizontal banner). 

Used package and version (note: it's quite likely that having a similar version should work the same, so do not worry if you have numpy `1.22.2` for instance. But just in case, I provide the versions used when this notebook was designed.
- `numpy` version `1.22.4`
- `scikit-learn` version `1.1.1` 
- `matploblib` version `3.5.2`
- `jax` version `0.4.1` 
- `tensorflow` version `2.8.2` 

**Note:** In case you struggle to install the libraries, you can go for an online version using [Google colab](https://colab.research.google.com/) (requires a gmail account as far as I remember), or [Binder](https://mybinder.org/). 

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Run this cell to see if you're ready to get started. 

# numpy : the 101 library for scientific Python
import numpy as np

# Scikit-learn : the most well-known library for basic Machine Learning in Python
import sklearn.linear_model as skl
from sklearn.preprocessing import PolynomialFeatures

# Matplotlib : to plot mathematical content in python.
import matplotlib.pyplot as plt
# A complementary line to use LaTeX with matplotlib. You can try removing it if you have compilation issues. 
plt.rcParams.update({
    "text.usetex": True,
    "font.family": "Helvetica"
})

# Jax : a Google developed library that provides auto-diff. 
# One cool aspect: jax.numpy is a numpy-like API (i.e. you can switch between np.blabla and jnp.blabla faitfully).
import jax
import jax.numpy as jnp

# Tensorflow : a Google developed library for neural networks.
import tensorflow as tf

# A complementary file that provide utilitary functions for the lab.
import utils

_Cells needed to define $\LaTeX$ macro._

$\newcommand{\R}{\mathbb{R}}$

# I. Optimization and automatic differentiation with JAX.

We recall (briefly) the basics of gradient descent: to (locally) minimize a function $F : \R^d \to \R$, we start from some (typically random) $x_0$, and then define a sequence 

$$x_{t+1} = x_t - \eta \nabla F(x_t)$$

where $\eta_t$ is a pre-determined parameter (_learning rate_) that may depend on $t$ (or not). Under reasonnable assumptions, $x_t$ should converge toward a local minimizer of $F$ (global if $F$ is convex). 

## 1. Computing gradients with JAX. 

Note in particular that we need to compute $\nabla F$. While for simple functions, you can do that "by hand", things become quickly messy. 
Thankfully, we will benefit from _Automatic differenciation_ provided (in this lab) by the library `jax`, developped by Alphabet. 
You can find the complete documentation [here](https://github.com/google/jax). 
Note that you can do similar things using `PyTorch`. 

In a nutshell, it works in the following way:
- define your variables (at least the one you want to optimize) as `jax.numpy.array`.
- define functions |f` using such variables. 
- You can get gradients of functions just typing `jax.grad(f)`

See below for an example. 

In [None]:
def f(x):
    '''
    Function that takes x \in \R^d and returns its squared euclidean norm. 
    '''
    return jnp.linalg.norm(x)**2

And now we can just _define_ its gradient with jax:

In [None]:
Df = jax.grad(f)

The object `Df` is a function, it takes $x$ and return $\nabla f(x)$. 

Note however that `jax` always returns arrays, even if they contain a single digit. Note also that it strictly requires `float` as input when computing gradients. That is, write `x = 1.` and not `x = 1`. 

In [None]:
x = jnp.array([3., 4.])
# Compute the value of f at x
f(x)  # Should be 25. 

In [None]:
# Now compute its gradient at x. 
Df(x)

`JAX` provides further control. For instance, you may consider functions of several parameters, some of them being your true variable of interest, and the other being more general parameters. For instance, let us consider the following function : 

$$F : \R^d \ni x \mapsto a \sin(\|x\|^2) + b e^{\|x\|^2}.$$

Here, $a$ and $b$ shound be considered as fixed parameters, and $x$ is the variable. The natural implementation is:

In [None]:
def f(x, a, b):
    nx = x[0]**2 + x[1]**2
    return a * jnp.sin(nx) + b * jnp.exp(nx)

Computing the gradient of this function with respect to $x$ by hand is already a bit painful. Let `JAX` do the work. 

_Note:_ By default, `JAX` computes the gradient with respect to the first variable of `f` (here `x`). If, for some reason, we want the gradient for one or several other parameters, we can provide the keyword `argnums` when calling `jax.grad()`, telling jax the variables with respect to which we want to compute the gradient. 

In [None]:
Df = jax.grad(f)  # the gradient of x \mapsto f(x, a, b)
Df_v2 = jax.grad(f, argnums = (1,2))  # the gradient of (a,b) \mapsto f(x, a, b)

In [None]:
x = jnp.array([1., 1.])
a,b = 2., 3.

print("Grad with respect to x:", Df(x, a=a, b=b))
print("Grad with respect to (a,b):", Df_v2(x, a, b))

_Remark:_ Be careful about the shapes & cie. The gradient with respect to `x` is of shape `1 x 2`, (we differentiate "1 variable in dim 2") but the one with respect to `a,b` is of shape `2 x 1` here (2 variables of dim 1)---we could harmonize stuff by replacing the two variables `a,b` by a single one of dim 2. 

---

**Question 1:** 

$\bullet$ Implement the function (using `jnp` to get access to automatic differentiation) 

$$ F( \cdot ; \theta) : \R^d \ni x \mapsto \|x\|^2 + 3 \cdot \sin(\theta \cdot x),$$

where $\theta \in \R^d$. 

$\bullet$ Define two functions: `DF_x` and `DF_theta`, that respectively compute the gradient of $F$ with respect to $x$ and to $\theta$. 

_Indication:_ Scalar product is obtain with `jnp.dot`.

In [None]:
def F(x, theta):
    # Complete the following
    ...
    
DF_x = ...
DF_theta = ...

---

The function $x \mapsto \nabla F(x)$ goes from $\R^2 \to \R^2$ is called a _gradient field_. Just for fun, we visualize one.

**Question 2:** Run the following code (which uses the function `plot_gradient_field` provided by the file `utils.py`), with $\theta = (1, 1)$. 

From the plot, can you say if the function $F$ is convex (visually)?

In [None]:
utils.plot_gradient_field(DF_x, theta, num=15)

-- Write your comments here --

## 2. (Stochastic) Gradient descent

It's time to implement the gradient descent (and its stochastic version later). 

---

**Question 3:** Complete the following code to implement a (standard) gradient descent over a function `F`, starting from `x0`, with learning rate `lr`, and run for a fixed number of step `n_step`. 

For display purpose, the function should return three things:
- the list of positions $(x_t)_t$ crossed during the descent. 
- the list of losses $(F(x_t)_t$
- the list of gradient $(\nabla F(x_t))_t$. 

In [None]:
def gradient_descent(F, x0, lr, n_step):
    """
    Run the gradient descent algorithm
    
    :param F: a real-valued function, should be compatible with jax.grad
    :param x0: starting point for the GD, should be legal input for F
    :param lr: the learning_rate parameter (float)
    :param n_step: number of step in the iterative loop
    
    :returns: lists of positions [x_i], losses [F(x_i)] and gradients [grad(F)(x_i)] encountered during the descent. 
    """
    x_current = x0
    grad_F = jax.grad(F)
    
    all_x = [x0]
    all_losses = [F(x0)]
    all_grad = [grad_F(x0)]
    
    for t in range(n_step):
        # TO COMPLETE
        ...
        
    return all_x, all_losses, all_grad

You can test your function on the following example, which uses the function `plot_gd_1d` from `utils.py`, which attempts at minimizing the simple function from $\R \to \R$ defined by $F(x) = \frac{x^2}{2} + \sin(x)^4$. 

In [None]:
def F(x):
    return x**2 / 2 + jnp.sin(x)**4

all_x, all_losses, all_grad = gradient_descent(F, x0=3., lr=0.1, n_step=100)
utils.plot_gd_1d(F, all_x, all_losses, all_grad, lr=0.1, xs=np.linspace(-4, 4))

---

**Question 4:** Run a gradient descent and plot it for the function $x \mapsto e^{-\frac{1}{x^2}}$, starting from $x_0 = 1$, with $\eta = 0.1$ and $T = 100$ steps. What do you observe? What's the reason behind this phenomenon?

Same question for the map $x \mapsto |x|$. 

In [None]:
# WRITE YOUR CODE HERE

-- Write your comment here --

---

**Question 5:** Here we propose to run the gradient descent for $T = 100$ steps before stopping. Maybe this is too few (we stop before convergence), or maybe this is too large (we waste computational time). 

What improvement would you suggest to improve on this arbitrary stopping criterion? Under which assumptions should it work?

_Bonus:_ Implement and test your proposition.

-- Write your comments here --

---

$\bullet$ Let us now consider a more ML oriented task, hence the stochastic version of the gradient descent. 

We have data given by observations and labels in $(x,y) \in \R \times \R$, and we suspect that the relation between $y$ and $x$ is of the form 

$$y = x \cdot \exp\left({-\frac{(x-a)^2}{2}}\right) + \exp\left({-\frac{(x-b)^2}{2}}\right) + \epsilon,$$

where $\epsilon \sim \mathcal{N}(0,\sigma)$ is some additional noise that we model using a Gaussian distribution. We want to estimate $a,b$ from our sample $(x_i, y_i)_{i=1}^n$. For compactness, we will set $\theta = (a,b) \in \R^2$. 

We consider a _regression_ task with the mean squared error. 
Thus, the training step aims at minimizing the empirical risk, a.k.a. loss function

$$L : \theta \mapsto \frac{1}{n} \sum_{i=1}^n \| y_i - F(x_i ; \theta)\|^2,$$

with $F(x ; \theta) = x \cdot \exp\left({-\frac{(x-a)^2}{2}}\right) + \exp\left({-\frac{(x-b)^2}{2}}\right)$. 

In [None]:
# Step 1: generate and plot the data

x_train, y_train = utils.data_generation()

fig, ax = plt.subplots()
ax.scatter(x_train, y_train)
ax.grid()

---

**Question 6:** Define a function `F(x,theta)` that computes $F(x ; \theta)$ with $\theta = (a,b)$.

_Note:_ Make it compatible with JAX synthax, or you will have to compute the gradient by yourself! ;-)

In [None]:
def F(x,theta):
    # TO COMPLETE
    ...

---

We recall that the stochastic gradient descent is about producing a sequence of the form 

$$\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \left(\| y_i -  F(x_i, \theta_t)\|^2\right)$$

for a _single_ point $(x_i, y_i)$, where $i \sim \mathrm{Unif}(1,\dots,n)$. 

**Question 7:** Implement the stochastic gradient descent algorithm to minimize $L$. It must return the list of $(\theta_t)_t$ recorded during the descent. 

Comment on the following points: 
- Does your SGD converges? How many steps do you run?
- What kind of stopping criterion did/should you use?
- Does the output quality depends on the initialization? From instance try starting from $\theta_0 = (-1, 1)$, and then $\theta_0 = (1, -1)$. What can you conclude about the objective function $L$?

_Indication:_ to sample a uniform index $i$ between $0$ and $n-1$,  you can use `np.random.randint(n)`. 

_Note:_ This question is a bit open, think about everything you need to implement the SGD, including bonus options to optimize your algorithm. 

In [None]:
def sgd(F, x_train, y_train, theta_0, lr, n_step):
    """
    Run the *stochastic* gradient descent algorithm
    
    :param F: a real-valued function, should be compatible with jax.grad. It takes two arguments (x,theta), and we consider its derivative w.r.t. theta. 
    :param x_train: list (or other iterable) of training observations. 
    :param y_train: list (or other iterable) of training labels. 
    :param theta_0: starting point for the SGD, should be legal input for F
    :param lr: the learning_rate parameter (float)
    :param n_step: number of step in the iterative loop
    
    :returns: lists of positions [theta_i] encountered during the descent. 
    """
    # WRITE YOUR CODE HERE...
    ...
    return thetas

You can run the following cell to check the behavior of your code (it takes `thetas`, the output of your `sgd`).  

In [None]:
thetas = sgd(F, x_train, y_train, theta_0 = jnp.array([-1, 1.]), lr=0.1, n_step=1000)

In [None]:
utils.plot_sgd(F, x_train, y_train, thetas)

---

# II. Classification and Deep Learning in tensorflow

This part is dedicated to a "real" ML task: classification of images. We are given (reasonnably small, but not that small) sets of images that belong to (10) different classes. 

Our goal is to maximize the _test accuracy_ (i.e. proportion of correct predictions by the model on unseen data). As explained in the lectures, this is done by minimizing the _cross entropy_. 

We will considers models as being neural networks, implemented in `tensorflow` (those motivated can try to do the same in `pyTorch` and check the differences). 
One good aspect of `tensorflow` for Deep Learning is that it provides a simple and efficient API (built from `keras`) to design neural networks.

## 1. The MNIST dataset

This is probably the most celebrated image dataset for machine learning. It's a toy dataset made of greyscale images of size $28 \times 28$---that is an image is an `array` of shape `28,28` filled with integer between `0` and `255`---, representing handwritten digits (between $0$ and $9$), which are the labels corresponding to an image. 

In [None]:
# Load the dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize to make value between 0 and 1. Not necessary, but common practice to avoid some numerical effects.
x_train, x_test = x_train / 255.0, x_test / 255.0

Let's first investigate the dataset and do some visualization to fix the ideas.

In [None]:
print("Shape of the training set:", x_train.shape)
print("Small snapshot of training labels:", y_train[:10])
print("Shape of the test set:", x_test.shape)

In [None]:
index = np.random.randint(x_train.shape[0])  # sample a random index in the dataset
im = x_train[index]  # select the corresponding observation
label = y_train[index]  # select the corresponding label

fig, ax = plt.subplots()
ax.imshow(im, cmap='Greys')  # plot the image
ax.set_xticks([])
ax.set_yticks([])
ax.set_title("Image representing a %s" %label, fontsize=18)

Let design our first neural network and train to (hopefully) solve this learning task. Here is the general pipeline:

- Define the model using `tensorflow.keras.models.Sequential`. 
- Compile the model, defining (i) the _optimization_ procedure used to optimize its parameters, (ii) the loss we want to minimize, (iii) the metric we are interested in.
- (Optionnal) Display a summary of your model, just to check it's number of parameters. 
- Train the model on the training set. 
- Test the actual performances of the model on the test set. 

To help you getting started quickly, we provide a "minimal working example" below. 

**Note:** `loss` and `metric` look similar, but basically the `loss` is what you _minimize_, e.g. cross entropy, while the `metric` is what you are interested in "at the end of the day" as a human being, such as the accuracy (which is way more interpretable than saying "we reach a cross entropy of 0.134"). 
In some cases (typically, for regression tasks), they can be the same (the MSE in both cases); but for classification tasks, they can differ .

In [None]:
# Instanciate a tensorflow neural network via keras

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])

---

**Question 1:** Quickly describe this block of code. Intuitively, (i) what does `Sequential` mean? (ii) why the final layer is of dimension `10`? (You can check the documentation of `Flatten` and `Dense` layers [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten) and [there](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense).)

-- Write your comments here --

---

In [None]:
# Now we compile the model and plot a summary of it. 

model.compile(optimizer='SGD',  # the optimization procedure (here, Stochastic Gradient Descent)
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  # The loss we minimize.
              metrics=['accuracy'])  # the metric we are eventually interested in.

model.summary()

---

**Question 2:** 
- What does the parameter `from_logits=True` means in `SparseCategoricalCrossentropy`? We discussed it during the lectures, but if you don't remember you can check the documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy). 
- What is `activation='relu'` ? Are there other possible choices? What happen if we don't specify the activation (as we do for the last layer). You can check the documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/activations). 

-- Write your comments here --

---

In [None]:
# Now we train the model
model.fit(x_train, y_train, epochs=5)

**Question 3:** 
- What is `epochs` here? 
- What does the counter `../1875` represent for each epoch?
- Do you think that our model converged?
- What's the final training accuracy of the model? 

-- Write your comments here --

---

In [None]:
# Finally, we test our model. 
model.evaluate(x_test,  y_test, verbose=2)

---

**Question 4:** What's the test accuracy of your model? Does it overfit? 

-- Write your comments here --

---

We now go a bit deeper. 

In [None]:
i = np.random.randint(x_test.shape[0])

logit = model.predict(np.array([x_test[i]]))

sm = tf.keras.layers.Softmax()(logit)

pred = np.argmax(sm, axis=1)[0]

fig, axs = plt.subplots(1, 2, figsize=(15, 6))
ax = axs[0]
ax.imshow(x_test[i], cmap='Greys')
ax.set_title("True Label = %s, prediction %s" %(y_test[i],pred))
ax.set_xticks([])
ax.set_yticks([])

ax = axs[1]
ax.bar(np.arange(10), sm[0])
ax.set_xticks(np.arange(10))
ax.set_ylim(0,1)

**Question 5:** Interpret the above code (and output).

-- Write your comment there --

---

**Question 6:** Design a similar architecture (still `Sequential` with `Dense` layers) but with a bit more layers (not too much or your laptop will struggle). Discuss the differences (training time, overfitting... anything you find interesting).

In [None]:
# Write your code here

-- Write some comments here --

---

## 2. The CIFAR10 dataset

At this point, you should be satisfied with your performances on MNIST. 
We will thus consider a more advanced dataset, the CIFAR10 dataset. 
It represents 10 different types of objects. 
Images are of size $32 \times 32$ and are RGB-colored (Red Green Blue), so a given image is of shape `32 x 32 x 3` (three _channels_ for each color). 

This part of the lab is highly inspired from [this tutorial](https://www.tensorflow.org/tutorials/images/cnn).

The labels are given as digits between `0` and `9`, and correspond to the different types of objects. 
We provide an array to do the conversion:

In [None]:
# Conversion digit to name:
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

In [None]:
# Load the dataset
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

In [None]:
# Let's do some investigation and visualization. 
print("Shape of training set:", train_images.shape)
print("Shape of test set:", test_images.shape)

i = np.random.randint(train_images.shape[0])  # random index in the training set

fig, ax = plt.subplots(figsize=(8, 8))

ax.imshow(train_images[i])
ax.set_title("Image number %s representing a %s." %(i,class_names[train_labels[i][0]]))

---

**Question 7:** Design a `Sequential` network made of `Dense` layers (with correct input and output shape), compile it, train and test it on this dataset. Comment the result.

In [None]:
# Write your code here.

-- Write comments here --

---

To improve our results on this harder dataset, we will now consider layers more tailored for image classification, namely [Convolutional layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D). 

We provide a Minimal Working Example below. 

**Note:** `Conv2D` layers are made to handle 2D images, so we do not need the `Flatten` layer first (as we did when we wanted to use `Dense` layers), but we should do it at the penultimate step to turn our "image" into a vector of logits. 

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(32, 32, 3)), 
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation='softmax')
])

We compile the model, using the [adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) optimizer, an improved version of the vanilla SGD, commonly used in practice. 

In [None]:
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
model.summary()

In [None]:
# Train the model
model.fit(train_images, train_labels, epochs=2)

In [None]:
# And now evaluate it. 
model.evaluate(test_images, test_labels, verbose=2)

---

**Question 8:** Briefly comment the code/results. 

-- Write your comments here --

---

**Question 9:** Design a model that reach better performances. You should "easily" reach 75%, possibly up to 85% with such methods. To reach state-of-the-art results, you need more advanced techniques such as [ResNet](https://en.wikipedia.org/wiki/Residual_neural_network). If you beat 99.2%, you're officially world champion, congrats---but is starts being suspicious because [some labels are known to be **wrong**](https://franky07724-57962.medium.com/once-upon-a-time-in-cifar-10-c26bb056b4ce). 

In [None]:
# Write your code here

---

**Question 10 (optional):** Find a way to visualize the intermediate representations of your CNN. That is, if your CNN is encoding $F = f_L \circ f_{L-1} \circ \dots \circ f_1(x)$, plot $f_\ell \circ \dots \circ f_1(x)$ for the intermediate $\ell \in \{1,\dots,L\}$. 

In [None]:
# Write your code here.

---

## 3. Some other advantages of CNN over Dense layers. 

As you can see, with a similar number of parameters, CNN are much better than naive Fully-connected NN when it comes to learn from _natural_ 2D images. 

Interestingly, they have some other advantages. 

In this section, we will go back to the MNIST dataset (which was seemingly reasonnably "solved" by a simple Fully-connected network). However, we will consider the context of _distribution shift_: when your **test data** are not exactly similar to the ones used at training time.
This is a very important situation that occurs in practical applications: you train a model on clean data, but when deployed in "real-life", data are slightly different (e.g. some noise; not exactly the same population, etc.) and this can have dramatic consequences in terms of performances. 

In this section, we will consider a simple noise model on test data: each pixel is augmented by a random uniform noise $\epsilon \sim \mathcal{U}(0,\sigma)$ for some $\sigma$, cliped at $1$ (so that we stay with images having pixels values in $(0,1)$). 

In [None]:
def image_corruption(images_set, sigma):
    """
    Corrupt a set of image by adding a white noise of variance sigma. 
    """
    shape = images_set.shape
    
    corrupted_set = np.minimum(images_set + sigma * np.random.rand(*shape), 1)
    
    return corrupted_set

In [None]:
# Load the MNIST dataset again, just to be sure
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize to make value between 0 and 1. Not necessary, but common practice. 
x_train, x_test = x_train / 255.0, x_test / 255.0

In [None]:
# Build the "corrupted" test set.
corrupted_test_set = image_corruption(x_test, sigma=0.5)

In [None]:
# Visualisation of a corrupted image.
i = np.random.randint(corrupted_test_set.shape[0])  # sample random index
fig, ax = plt.subplots()
ax.imshow(corrupted_test_set[i], cmap='Greys')
ax.set_title("Corrupted image of a %s" %y_test[i])

---

**Question 10:** Design `model_FC` and `model_CNN` (respectively a Fully-connected NN and a CNN) and train them on MNIST. They should both reach above 95% of _training_ accuracy, say (it's not central if one is slightly better than the other) ; but keep them simple as much as possible. Then test them on the `corrupted_test_set`.

What do you observe?

In [None]:
# Write your code here. 

model_FC = ...

In [None]:
model_CNN = ...

---

# III. Regression and regularization

The goal of this part of the lab session is to showcase an important technique in training machine learning models (and unfortunately not studied in details during the lectures): _regularization_. 

The key idea is the following one: we want a large class of models $\{F(\cdot, \theta),\ \theta \in \Theta\}$, but at the same time we want to prevent overfitting (which is more likely to occur with larger class of models which are more likely to interpolate training data exactly). For this, we will _regularize_ on $\theta$, roughly saying "if you want a complex model (more likely to overfit), you have to pay for it, so only do it if this is strictly necessary". 

**Polynomial regression:** For the sake of simplicity, we stick to a simple usecase: data and labels are simply in $\R$, and we suspect a polynomial relation between the two, that is

$$y = \sum_{i=0}^d a_i x^i \quad + \epsilon$$

where $\epsilon \sim \mathcal{N}(0,1)$ is some random gaussian noise. The key idea is that the degree $d$ is unknown, but say that we know that is should be lesser than $10$ for sure. Meanwhile, we want to learn the parameters $a_i$. 

We recall the following trick: if we let $\mathbf{x} = (1, x, x^2,\dots,x^d)$ and $\theta = (a_0,a_1,\dots,a_d)$, the relation above can be written

$$y = \theta \cdot \mathbf{x} + \epsilon,$$

so _polynomial regression_ is simply a _linear regression_ on the _augmented variable_ $\mathbf{x}$. 

$\bullet$ Let us first generate and plot the data. 

In [None]:
x_train, y_train, x_test, y_test = utils.generate_data_regression()

$\bullet$ Now, we will train a natural Linear Regression using `scikit-learn` (more precisely, the module `sklearn.linear_model`, imported as `skl` in the first cell of the notebook), see [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) for the documentation. We use the `PolynomialFeatures()` preprocessing method also provided by `scikit-learn` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

We provide the code below to save some time :-). Please **read it carefully** to understand what is happening.

In [None]:
# We define the model (as an object)
model = skl.LinearRegression(fit_intercept=False)  # don't worry about this fit_intercept. 

# We define the maximum degree we consider.
d_max = 10

# We define the "PolynomialFeatures" object that help us to build augmented variables. 
PF = PolynomialFeatures(degree=d_max)

# We build the augmented variables
x_train_augmented = PF.fit_transform(x_train[:,None])

# Now we use our LinearRegression to fit the augmented variable <==> polynomial regression.
model.fit(x_train_augmented, y_train)

# We can now check the coefficients of our trained model. We use a function from utils for better ploting results. 
utils.display_polynom(model)

Though this polynomial looks weird, maybe it's the correct one. Let us look at its MSE (Mean Squared Error), which, we recall, is

$$ \frac{1}{n} \sum_{i=1}^n (\theta \cdot \mathbf{x}_i - y_i)^2. $$

This quantity represents, as it name suggests, the average _squared_ error our model does. So a MSE of $16$ would mean "on average, the square of the error we make is $16$". 
To get a quantity easier to interpret, we take the square root of it (with the previous example, it yields an average error of $4$). 

In [None]:
# We check the training loss. 
print("Training loss (square root MSE):", np.mean((model.predict(x_train_augmented) - y_train)**2)**(0.5))

You should have a seemingly small training loss (I mean, the data should vary between $-30$ and $30$, so an average error lesser than $1$ for instance is quite good). 

But of course, what matter is the test loss. 

In [None]:
# We check the training loss. 
print("Test loss (MSE):", np.mean((model.predict(PF.fit_transform(x_test[:,None])) - y_test)**2)**(0.5))

It is quite likely that the test loss is much higher than the training loss... What happened?

To get a visual idea of how our model behaves in practice, as we are in 1D, we can simply evaluate it on "all points in $(-3,3)$" (up to a discretization). 
Recall that, from `sklearn` perspective, our model is a `LinearRegression` on augmented data obtained through `PolynomialFeatures`. 

In [None]:
t = np.linspace(-3.5, 3.5, 500)
fig, ax = plt.subplots()
ax.scatter(x_train, y_train, label='Training set', c='blue')
ax.scatter(x_test, y_test, label='Training set', c='orange')
ax.plot(t, model.predict(PolynomialFeatures(degree=d_max).fit_transform(t[:,None])), c='red', label='Our model')
ax.set_ylim(min(np.min(y_train), np.min(y_test)) - 5, max(np.max(y_train), np.max(y_test)) + 5)
ax.legend()
ax.grid()

---

**Question 1:** What conclusion can you reach from this plot and previous observations?

-- Write your comments here --

---

We definitely must improve on this. Intuitively, 
- dramatic overfitting is due to the polynomial encoded by the model varying too much,
- Variation in the polynomial are due to large (in absolute value) coefficients (why?). That is, $\theta$ with high norm.

In [None]:
# Let us check the norm of our Theta:
np.linalg.norm(model.coef_)

From this observation, a natural attempt to mitigate overfitting is to **control** the coefficients of our polynomial, that is _penalize_ parameter $\theta$ with high norm (in $L^2$ or $L^1$) sense. 

In a nutshell, instead of only minimizing the MSE in $\theta$, we will instead minimize the following loss:

$$L : \theta \mapsto \frac{1}{n} \sum_{i=1}^n (\theta \cdot \mathbf{x}_i  - y_i)^2 + \alpha \|\theta\|^2,$$

This training procedure is called a **Ridge regression**.

---

**Question 2:** Is this loss function convex? Is there a closed form for the optimal $\theta$? What is the role of the parameter $\alpha \in \R$?

-- Write your comments here --

---

**Question 3:** Adapting the previous code, build a `Ridge` model (representing a Ridge regression, see [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)), train it, check the training and test loss (for the standard MSE, not the Ridge loss), plot the model, the norm of the resulting $\alpha$, and comment the results. 

_Note:_ Try for different values of `alpha`. 

In [None]:
model2 = ...