# Linear Regression

## Introduction

Linear regression is a regression method. This method is considered as a **supervised machine learning** model since the fed data must consists of input features and output target. It performs regression tasks in order to find the **linear relationship** between **independent features** and **target**. Therefore, the use of linear regression mostly fall into two categories including forecasting (or prediction) and variational explaination.

The mathematical form of linear regression simply describes the linear dependence of input variables (features) and output variable (target). Given $\boldsymbol{x} = (x_1, \dots , x_m)$ are independent variables, $y$ is denpendent variable, and $\boldsymbol{\theta} = (\theta_0, \dots, \theta_m)$ are coefficients. The model is as the following
$$
y = \theta_0 + \theta_1 x_1 + \dots + \theta_m x_m + \epsilon = \boldsymbol{\overline{x}}^\top \boldsymbol{\theta} + \epsilon,
$$
where $\epsilon$ is a random error, and $\boldsymbol{\overline{x}} = (1, x_1, \dots, x_m)$.

## Data Modeling
Let $X = ({\boldsymbol{\overline{x}}^{(i)}}^\top)_{i=1}^{n}$ be the input data, and $\boldsymbol{y} = (y^{(i)})_{i=1}^{n}$ be the targets, then the data model can be written as
$$
X \boldsymbol{\theta} \approx \boldsymbol{y}.
$$
If we call $\hat{\boldsymbol{y}} = X \boldsymbol{\theta}$ are the predictions, then $\epsilon = \boldsymbol{y} - \hat{\boldsymbol{y}}$. Our goal is to find $\boldsymbol{\theta}$ that minimize the squared error $\epsilon^2$. Now, our problem reduces to an **opimization problem**. Thus, we need to define a **loss function** (or an **objective function**, according to optimization terms) with respect to $\boldsymbol{\theta}$. And the loss function is defined as

$$
\begin{aligned}
\mathcal{L}(\boldsymbol{\theta}) & = \frac{1}{2} \sum_{i=1}^{n}{(y^{(i)} - \hat{y}^{(i)})^2} \\
& = \frac{1}{2} \sum_{i=1}^{n}{(y^{(i)} - {\boldsymbol{\overline{x}}^{(i)}}^\top \boldsymbol{\theta})^2} \\
& = \frac{1}{2} (\boldsymbol{y} - X \boldsymbol{\theta})^\top (\boldsymbol{y} - X \boldsymbol{\theta}).
\end{aligned}
$$

So far, the readers must be wondering "Why is this function? Why is squared error?". Because the error can be negative infinity, and that would be a huge error. Then "How about absoluted error?", this is not a smooth function, which means that it is not differentiable everywhere. Namely, the absoluted error is not differentiable at the optimal point $0$. Furthermore, the loss function comes more naturally when we consider linear regression in terms of **probablistic modeling** (which will be explain later).

## Optimal Solution

### Least Squared Problem
Now, our loss function is a convex quadratic function, and we want to find its minimum. And fortunately, this loss function has only one minimum. One of the most popular way is solving differential equation equal to zero.

$$
\begin{aligned}
\frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} & = \boldsymbol{0} \\
X^\top ( \boldsymbol{y} - X \boldsymbol{\theta}) & = \boldsymbol{0} \\
X^\top \boldsymbol{y} & = X^\top X \boldsymbol{\theta}.
\end{aligned}
$$

Set $X^\top \boldsymbol{y} = \boldsymbol{b}$ and $X^\top X = A$, then

$$
A \boldsymbol{\theta} = \boldsymbol{b}.
$$

This equation has the form of linear system of equations. If the readers are familiar with linear algebra, then the solution is $\boldsymbol{\theta} = A^{-1} b$ when $A$ is invertible (non-singular). However, this is not the end of our story. What if $A$ is not invertible (singular). We still can solve this equation with pseudo-inverse of $A$, denoted $A^{\dagger}$.

$$
\begin{aligned}
\boldsymbol{\theta} & = A^{\dagger} b \\
& = (X^\top X)^\dagger X^\top \boldsymbol{y}.
\end{aligned}
$$

### Gradient Descent

## A Probabilistic Interpretation

## A Pratical Example of Linear Regression

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

In [26]:
data = pd.read_csv('data/boston_housing.csv')
with open('data/readme.txt', 'r') as f:
    lines = f.readlines()
#     for line in lines:
#         print(line)
# data.head()

Variables:

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2
