# Amphi 6 - An Introduction to Regression: Linear Regression, Ridge, Lasso

# 1. Introduction

In amphi 5, we have already classified basic problems in Machine Learning into supervised and unsupervised learning. We have also looked at examples on Regression and Classification problem.

## 1.1 The loss function

We return to regression problem. Suppose that we would like to predict values of a variable $y$ as a function of some variable $\mathbf X \in \mathbf R^D$. Let $\mathbf X_1, \ldots, \mathbf X_N$ and $y_1, \ldots, y_N$ be values of $\mathbf X$ and $y$, respectively, in $N$ observations. We want to define a function $g(\mathbf X)$ to describe the relation between $y$ and $g(\mathbf X)$, hopefully $y = g(\mathbf X)$ or $f \approx g(\mathbf X)$. 

How do we check that $g(\mathbf X)$ is a good prediction of $y$? We should define a **loss function** $L(g, \mathbf X, y)$ which has small value when $f(\mathbf X) \approx y$ and greater value when $f(\mathbf X)$ is far from $y$.  

In regression, one of the most common choices is the **square loss**. It is convenient for differentiation calculus.

**Square loss (with respect to the estimation of $y$ by $g(\mathbf X)$**
$$ L(g, \mathbf X, y) = \vert y - g(\mathbf X) \vert^2 $$

Suppose that $g(\mathbf X)$ is a good prediction of $y$, then $g(\mathbf X_i) \approx y_i$ for $N$ observations $(\mathbf X_i, y_i), i = 1, \ldots, N$. In practice, we can define the loss with respect to the estimation of $y_1, \ldots, y_N$ by $g(\mathbf X_1), \ldots, g(\mathbf X_N)$ as the sum of square loss on each observation.

**Square loss (for $ N $ observations)**
$$ L(g, \mathbf X_1, \ldots, \mathbf X_N, y_1, \ldots, y_N) = \sum_{n=1}^N \vert y_n - g(\mathbf X_n) \vert^2 $$

## 1.2 Linear model

In regression, the model is called linear if $g(\mathbf X)$ is of the form:
$$
g(\mathbf X) = \mathbf X \cdot \mathbf w + b
$$
where $\mathbf w \in \mathbf R^D$, $\mathbf b \in \mathbf R$.

By adding a new coordinate to variable $\mathbf X$ if necessary, we can suppose that the last coordinate of $\mathbf X$ is always 1. Then the linear model have the form:
$$
g(\mathbf X) = \mathbf X \cdot \mathbf w
$$
Here $b$ in the first representation become the last coordinate of $\mathbf w$.

Hence, without loss of generality, we will use $g(\mathbf X) = \mathbf X \cdot \mathbf w$ as the general form of linear models.

## 1.3 Minimizing the loss function

If we choose the square loss as our loss function (a criterion to evaluate which model is better), linear model as our model, and the observation $(\mathbf X_n, y_n)_{n = 1, \ldots, N}$ as training data, then the evident strategy is to find $\mathbf w$ that minimizes the loss function over the training data. The problems becomes:

$$
\max\limits_{\mathbf w \in \mathbf R^D} \sum_{n=1}^N \vert y_n - \mathbf X_n \cdot \mathbf w \vert^2
$$

Let $\mathbf y = (y_1, \ldots, y_N)^t$ denote the vector in $\mathbf R^N$ whose coordinates are the $N$ observations of $y$, and $\mathbf \Phi$ denote the matrix in $\mathbf R^{N \times D}$ whose rows are $\mathbf X_n^t$, the $N$ observations of $X$.

Then $\sum_{n=1}^N \vert y_n - \mathbf X_n \cdot \mathbf w \vert^2$ becomes $\Vert \mathbf y - \mathbf \Phi \mathbf w \Vert^2$. The problem becomes:

$$
\max\limits_{\mathbf w \in \mathbf R^D} \Vert \mathbf y - \mathbf \Phi \mathbf w \Vert^2
$$

Let $\mathcal L(\mathbf w) = \Vert \mathbf y - \mathbf \Phi \cdot \mathbf w \Vert^2$. This is a function $\mathbf R^D \to \mathbf R$, convex in $\mathbf w$, hence a local minimum (if exists) will be unique and minimize the function.

The minimum can be found by solving:
$$
\nabla_{\mathbf w} \mathcal L = \mathbf 0 \Leftrightarrow 2\mathbf \Phi^t(\mathbf \Phi \mathbf w - \mathbf y) = 0
$$
$$
\Leftrightarrow \mathbf \Phi^t \mathbf \Phi \mathbf w = \mathbf \Phi^t \mathbf y
$$

In case $\mathbf \Phi^t \mathbf \Phi$ invertible, the solution is
$$
\mathbf w = (\mathbf \Phi^t \mathbf \Phi)^{-1} \mathbf \Phi^t \mathbf y
$$

If $\Phi^t\Phi$ is not invertible, the inverse can be replace by the (Moore-Penrose) pseudo inverse matrix or any generalized pseudo inverse.
$$
\mathbf w = (\mathbf \Phi^t \mathbf \Phi)^{+} \mathbf \Phi^t \mathbf y
$$

## 1.4 The Moore-Penrose pseudo inverse matrix in Python

Use **numpy.linalg.pinv** to find the pseudo inverse.

In [34]:
import numpy as np
X = np.array([[1, 1, 1], [1, 2, 3]])
y = np.array([2, 3])
print X.transpose().dot(X) 

[[ 2  3  4]
 [ 3  5  7]
 [ 4  7 10]]


In [35]:
#np.linalg.inv(X.transpose().dot(X) )

In [36]:
print np.linalg.pinv(X.transpose().dot(X))

[[ 2.02777778  0.44444444 -1.13888889]
 [ 0.44444444  0.11111111 -0.22222222]
 [-1.13888889 -0.22222222  0.69444444]]


In [39]:
w = np.linalg.pinv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
print w

[ 1.16666667  0.66666667  0.16666667]


In [40]:
print (X.transpose().dot(X)).dot(w) #Phi^t Phi w

[  5.   8.  11.]


In [42]:
print (X.transpose().dot(y)) #Phi^t y

[ 5  8 11]


## 1.5 Complexity

The solution in closed form can be found in $O(ND^2)$ (case $D << N$) or $O(D^3)$ (case $N << D$).

## 1.6 Implementation

# 2. Gradient descent

# 3. Feature scaling

# 4. Polynomial regression

# 5. Overfitting

# 6. Ridge and Lasso