<a href="https://colab.research.google.com/github/vladbug/AA-googlecolab/blob/main/LinearRegression_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear regression

The goal of this exercise is to implement the linear regression method seen in the course.

Please copy this notebook to your Google account or download it as a Jupyter Notebook.

We start by importing numpy and matplotlib.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Minimize the mean squared error

To generate the data for the exercise, we will use the `scikit-learn` library <https://scikit-learn.org>. It provides a huge selection of already implemented machine learning algorithms for classification, regression or clustering.

If you use Anaconda or Colab, `scikit-learn` should already be installed. Otherwise, install it with `pip` (you may need to restart this notebook afterwards):

```
pip install scikit-learn
```

We will use the method `sklearn.datasets.make_regression` to generate the data. The documentation of this method is available at <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html>.

The following cell imports the method:

In [None]:
from sklearn.datasets import make_regression

We can now generate the data. We start with the simplest case where the inputs have only one dimension. We will generate 100 samples$(x_i, t_i)$ linked by a linear relationship and some noise.

The following code generates the data:

In [None]:
N = 100
X, t = make_regression(n_samples=N, n_features=1, noise=15.0)

`n_samples` is the number of samples generates, `n_features` is the number of input variables and `noise` quantifies how the points deviate from the linear relationship.

**Q:** Print the shape of the arrays `X` and `t` to better understand what is generated. Visualize the dataset using matplotlib (`plt.scatter`). Vary the value of the `noise` argument in the previous cell and visualize the data again.

Now is the time to solve the Linear Regression problem with numpy.

Remember the problem we need to solve:

$$\mathop{\text{minimize} }_\alpha \|X\alpha - y\|^2$$

where
$X\in \mathbb{R}^{N \times (d+1)}$ is the augmented data matrix such that each line $i$ is $x_i = (x_{i1} \; 1)$, containing the features of example $i$.

**Q:** Create matrix $X$ and vector $y$.


**Q:** Identify the optimization variable of the problem.

This is a (convex) quadratic problem. To solve it, we simply differentiate it, obtaining the gradient, and equate the gradient to zero, obtaining

$$ X^T(X\alpha-y)=0.$$

By solving this equation with respect to $w$ we get
$$\alpha = (X^TX)^{-1} X^Ty.$$

In Numpy, check how to do matrix multiplication and how to compute the inverse of a matrix. Read the documentation of [pinv](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html), [matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) and [dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) to start.


**Q:** Using the above identity, compute the regression weights $\alpha$.



**Q:** Visualize the quality of the fit by superposing the learned model to the data with matplotlib.

*Tip*: you can get the extreme values of the xaxis with `X.min()` and `X.max()`. To visualize the model, you just need to plot a line between the points `(X.min(), a[0]*X.min()+a[1])` and `(X.max(), a[0]*X.max()+a[1])`.

Another option is to predict a value for all inputs and plot this vector $\hat{y}$ against the desired values $y$.

**Q:** Make a scatter plot where $y$ is the x-axis and $\hat{y} = \alpha[0]\, x + \alpha[1]$ is the y-axis. How should the points be arranged in the ideal case? Also plot what this ideal relationship should be.

## Scikit-learn

The code that you have written is functional, but extremely slow, as you use for loops in Python. For so little data samples, it does not make a difference, but if you had millions of samples, this would start to be a problem.

The solution is to use optimized implementations of the algorithms, running in C++ or FORTRAN under the hood. We will use here the LMS algorithm provided by `scikit-learn` as you have already installed it and it is very simple to use. Note that one could use tensorflow too, but that would be killing a fly with a sledgehammer.

`scikit-learn` provides a `LinearRegression` object that implements the training procedure above. The documentation is at: <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html>.

You simply import it with:

```python
from sklearn.linear_model import LinearRegression
```

You create the object with:

```python
reg = LinearRegression()
```

`reg` is now an object with different methods (`fit()`, `predict()`) that accept any kind of data and performs linear regression.

To train the model on the data $(X, y)$, simply use:

```python
reg.fit(X, y)
```

The parameters of the model are obtained with `reg.coef_` for $w[0]$ and `reg.intercept_` for $w[1]$.

You can predict outputs for new inputs using:

```python
y = reg.predict(X)
```

**Q:** Apply linear regression on the data using `scikit-learn`. Check the model parameters after learning and compare them to what you obtained previously. Print the mse and make a plot comparing the predictions with the data.