
# Introduction

## Linear Regression

* Acknowledgment: This notebook is being used with the kind permission of Kelwin Fernandes and Ricardo Cruz

We are going to use a linear regression for a small introduction.

See the Linear Regression [documentation here.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

A linear regression is a model of the type:

$y = w\cdot x + b$

(where $w$ and $b$ are discovered automatically based on the data.)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Let's invent our own data.

In [None]:
b = 1000
w0 = -5
w1 = 1e-2
w2 = -6e-6
w3 = 4e-10

x = np.linspace(100, 1000, 25)
y = b + w0*x + w1*x*x + w2*x*x*x + w3*x*x*x*x

plt.plot(x, y, 'o')
plt.xlabel('Area (m²)')
plt.ylabel('Price (1000 €)')
plt.show()

Let's add some random Gaussian noise.

In [None]:
y = b + w0*x + w1*x*x + w2*x*x*x + w3*x*x*x*x
y += np.random.randn(len(x))*20

plt.plot(x, y, 'o')
plt.xlabel('Area (m²)')
plt.ylabel('Price (1000 €)')
plt.show()

Let's fit a linear regression:

$Price = b + w_0Area$

A linear regression will try to find the values of $b$ and $w_0$ which minimize the difference between the real $Price$ and the predicted Price.

In mathematical terms, we want to $\min_{b,w_0}\|Price-\hat{Price}\|^2$.

In [None]:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
m.fit(x[:, np.newaxis], y)
print('score:', m.score(x[:, np.newaxis], y))

b = m.intercept_
w0 = m.coef_[0]

print(b, w0)

plt.plot(x, y, 'o')
plt.plot(x, b+w0*x, '-')
plt.show()

This model is not expressive enough right?

Let us try this model:

$Price = b + w_0Area + w_1{Area}^2$

In [None]:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
_x = np.c_[x, x*x]
m.fit(_x, y)
print('score:', m.score(_x, y))

b = m.intercept_
w0 = m.coef_[0]
w1 = m.coef_[1]
print(b, w0, w1)

plt.plot(x, y, 'o')
plt.plot(x, b+w0*x+w1*x*x, '-')
plt.show()

It's better, but the coefficients are still very different than the original model...

Let us try a model similar to the original model:

$Price = b + w_0Area + w_1{Area}^2 + w_2{Area}^3 + w_3{Area}^4$

In [None]:
order = 4

from sklearn.linear_model import LinearRegression
m = LinearRegression(normalize=True)
_x = np.array([x**(i+1) for i in range(order)]).T
m.fit(_x, y)
print('score:', m.score(_x, y))

b = m.intercept_
ws = m.coef_
print(b, ws)

plt.plot(x, y, 'o')
plt.plot(x, b+np.sum([ws[i]*x**(i+1) for i in range(order)], 0), '-')
plt.show()

**Exercise:** change the previous code to use a polynomial of **order=40** !

$Price = b + w_0Area + w_1{Area}^2 + w_2{Area}^3 + w_3{Area}^4 + \dots + w_{37}{Area}^{38} + w_{38}{Area}^{39} + w_{39}{Area}^{40}$

The error in the training is smaller, but does this model explain better the numbers?

Look at the coefficients... Will this model give good results in the real world?

----

The problem is that many times we do not know the structure of the problem. We do know that prices relative to area follow a polynomial of order 4.

What we can do in these cases is to use a high order polynomial, and then punish coefficients that are **too big!**

In mathematical terms, we want to $\min_{b,\vec{w}}\|Price-\hat{Price}\|^2 + \alpha\sum_i\|w_i\|$.

We will use the Lasso model for that. See here [the documentation.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

In [None]:
order = 40
alpha = 100

from sklearn.linear_model import Lasso
m = Lasso(alpha=alpha, normalize=True, max_iter=100000)
m.fit(np.array([x**(i+1) for i in range(order)]).T, y)

b = m.intercept_
ws = m.coef_
print(b, ws)

plt.plot(x, y, 'o')
plt.plot(x, b+np.sum([ws[i]*x**(i+1) for i in range(order)], 0), '-')
plt.show()

print("Active coefficients:", (np.arange(len(ws)) + 1)[np.abs(ws) > 1e-10])

**Exercise:** change the `alpha` hyperparameter and see how that changes the graphic. (e.g. 0.001)

The result is not as good as the 4-th order polynomial, but, if we do not know the structure of the data, it is better than using a very high polynomial.

This technique is used a lot, including in neural networks, to make sure that simpler hypothesis are emphasized over very complex ones.