# 1. Linear Regression

## Linear Models

A **`linear model`** is defined as:

$$\hat{y} = w_1  x_1 + ... + w_d  x_d + b$$

If we store all the **features** in a vector $\mathbf{x} \in \mathbb{R}^d$ and all the **weights** in a vector $\mathbf{w} \in \mathbb{R}^d$, then we have:

$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$$

The **vectorized** version is given as:

$${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$$

What a linear model does is performing **affine transformation** on the input features which is a combiantion of **linear transformation** (by the weights) and **tranlation** (by the bias term).

## Loss Function

The most commonly used **loss function** is the **`sqaured error`**:

$$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2$$

where $\hat{y}^{(i)}$ is the **prediction** on the $i^{th}$ sample and $y^{(i)}$ is the **true label**.

The **cost function** over the entire dataset is given as:

$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2$$

The **optimal parameters** are found by minimizing the cost function:

$$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b)$$

## Analytical Solution

The **`analytical solution`** of a linear regression model is given as:

$$\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}$$

by minimizing the cost function $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$.

## Random Gradient Descent

We can use **`gradient descent`** to find the **optimal parameters**.

To speed up training, we usually compute the **derivatives** on a radnomly sampled small subset (a **mini batch** $\mathcal{B}$) of the training data.

Then, we update the parameters according to a pre-set **learning-rate** $\eta$:

$$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)$$

More specifically, we have:

$$\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} -   \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b -  \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b)  = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$$

where $|\mathcal{B}|$ is the **batch size**.

## Normality

For a **normally distributed** random variable $x$, if it has mean $\mu$ and variance $\sigma^2$, then the **probability density function** is given as

$$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right)$$

The reason why **square error** can be used as a loss function for linear regression is based on the assumption that the **noise** $\epsilon$ in the data is **normally distributed**:

$$y = \mathbf{w}^\top \mathbf{x} + b + \epsilon$$

where $\epsilon \sim \mathcal{N}(0, \sigma^2)$.

Thus, the **likelihood** of $y$ given $x$ can be written as:

$$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right)$$

According to the **`maximum likelihood estimation`**, we aim to have the parameters $w$ and $b$ that maximize the **likelihood** of the entire training set:

$$P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)})$$

This is equivalent to minimizing the **log likelihood**:

$$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2$$

Given that $\sigma$ is a constant, the above minimization is equivalent to minimizing the sqaure error loss function. Therefore, under the assumption of normality, the **sqaure error is the same as maximum likelihood estimation**.

## Linear Regression as a Neural Network

![](http://d2l.ai/_images/singleneuron.svg)