### Simple Linear Regression Summary ###

Approximate **linear relationship** between $Y$ and $X$ (regress $Y$ on $X$) is:

$$ Y \approx \beta_0 + \beta_1 X$$ 

where $\beta_0$ is the intercept and $\beta_1$ is the slope. 

Use the **training data** to estimate $\beta_0$ and $\beta_1$:

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1.$$

Estimate coefficients using observations pairs consisting of $(x_1, y_1), \cdots, (x_n, y_n)$

Goal is to obtain **coefficient estimates** 
$\hat{\beta}_0$ and $\hat{\beta}_1$ such that the linear model fits the data: 

$\sum_{i=1}^n \hat{y}_i \approx \sum_{i=1}^n( \hat{\beta}_0 + \hat{\beta}_1 x_i)$


**Residual** is difference between the observed response value and the $i$ th response value predicted by the linear model:
$$e_i = y_i - \hat{y}_i.$$ 

**Residual Sum of Squares ($RSS$)** is minimized using least squares to find the best  $\hat{\beta}_0$, $\hat{\beta}_1$ in the model. 

**Best Coefficient Estimate, $\hat{\beta_0}$, $\hat{\beta_1}$**:

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$

$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}.$$

**Sample means**:

$\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$, $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$.

Note: $\sum_{i=1}^n y_i = n \bar{y}$, $\sum_{i=1}^n x_i = n \bar{x}$,


**True relationship**:

$$Y = \beta_0 + \beta_1 x + \epsilon$$

where $\epsilon$ is a non-zero random error or noise. 


**Population regression line**: is the best linear approx. to the true relationship between $X$ and $Y$, where coefficients $\beta_0$, $\beta_1$ define the populaion regression line. 

**Least squares line**: least squares coefficient estimates characterize this line. The more spread out the $x_i$'s the more precise the slope due the denominotor of $\hat{\beta_1}$, $\sum_{i=1}^n(x_i - \bar{x})^2$. If points are concentrated we can easily "turn" the slope, but if they are spread, we can more easily pin down a slope. For experimental design we prefer $x_i$'s more spread out. 

**Population mean**: on random variable $Y$ is $\mu$.

**Sample mean**: have access to $n$ observations from $Y$ then a reasonable estimate for $\mu$ is 

$$\hat{\mu}=\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$$

where $\hat{\mu}$ and $\mu$ are different but $\hat{\mu}$ is a good estimate of $\mu$. The sample mean is a umbiased estimate of the population mean . If we could average a large number of estimates of $\mu$ from a massive number of observations than averaging $\hat{\mu}$ is exactly $\mu$.


**Standard error**: is the average amount the estimate $\hat{\mu}$ difers from $\mu$. The estimate shrinks with $n$

$$Var(\hat\mu)=SE(\hat{\mu})^2=\frac{\sigma^2}{n}$$

where $\sigma^2=Var(\epsilon)$ is the standard deviation of each realization $y_i$ of $Y$. 


$$SE(\hat{\beta}_0)^2 = \sigma^2 \left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right]$$

$$SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}.$$

For $SE(\hat{\beta}_0)^2$, $SE(\hat{\beta}_1)^2$ to be valid the error $\epsilon_i$ for each observation have common variance $\sigma^2$ and uncorrelated. 

**Leverage**: $SE(\hat{\beta}_1)$ is smaller if $x_i$ is more spread out, then there is more leverage to estimate the slope. 

**Residual standard error ($RSE$)**: %\sigma^2$ is usually not known but can be estimated from data 

$$RSE = \sqrt{\frac{RSS}{n-2}}$$

where the $RSS = \sum_{i=1}^n(y_i - \hat{y_i}_^2 

## Cubic vs. Linear Model ##
---

For the training model we use least squares to minimizes the sum of squares residuals over a larger parameter space (parameter vector $\hat{\beta}$ is 4 by 1 instead of 2 by 1). Minimizing over a larger set of possible solutions can't give a higher minimum RSS, and will at worst reproduce a linear fits by setting the extra coefficients to 0 and can often find a way to reduce the training RSS even further. 

## Linear Model Fit

$$
\hat{\beta}_{\text{lin}}
= 
\begin{pmatrix}
\beta_0\\
\beta_1
\end{pmatrix}
$$

minimize

$$
\mathrm{RSS}_{\text{lin}}
= \sum_{i=1}^n 
\Bigl[
    y_i \;-\; (\beta_0 + \beta_1\,x_i)
\Bigr]^2.
$$

In matrix form, let

$$
X_{\text{lin}}
=
\begin{pmatrix}
1 & x_1\\
1 & x_2\\
\vdots & \vdots\\
1 & x_n
\end{pmatrix},
\quad
y
=
\begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_n
\end{pmatrix}.
$$

Then

$$
\mathrm{RSS}_{\text{lin}}
= 
\min_{\beta_0,\;\beta_1}
\;\bigl\|\,y \;-\; X_{\text{lin}}\,\beta\bigr\|^2.
$$

---

## Cubic Model Fit

$$
\hat{\beta}_{\text{cub}}
=
\begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2\\
\beta_3
\end{pmatrix}
$$

minimize

$$
\mathrm{RSS}_{\text{cub}}
= \sum_{i=1}^n 
\Bigl[
    y_i 
    \;-\; 
    (\beta_0 + \beta_1\,x_i + \beta_2\,x_i^2 + \beta_3\,x_i^3)
\Bigr]^2.
$$

In matrix form, let

$$
X_{\text{cub}}
=
\begin{pmatrix}
1 & x_1 & x_1^2 & x_1^3\\
1 & x_2 & x_2^2 & x_2^3\\
\vdots & \vdots & \vdots & \vdots\\
1 & x_n & x_n^2 & x_n^3
\end{pmatrix}.
$$

Then

$$
\mathrm{RSS}_{\text{cub}}
= 
\min_{\beta_0,\;\beta_1,\;\beta_2,\;\beta_3}
\;\bigl\|\,y \;-\; X_{\text{cub}}\,\beta\bigr\|^2.
$$

## Minimizing Over a Larger Set

In least squares, 
$$
\hat{\beta}_{\text{cub}}
\text{ is chosen to minimize }
\|\,y - X_{\text{cub}}\,\beta\|^2
\text{ over }
\mathbb{R}^4.
$$

Because $\mathbb{R}^2$ (linear parameters) is embedded in $\mathbb{R}^4$ (cubic parameters) via
$$
(\beta_0,\,\beta_1)
\;\mapsto\;
(\beta_0,\,\beta_1,\,0,\,0),
$$
the minimum over the bigger space can only be smaller or equal to the minimum over the smaller space. Symbolically:

$$
\min_{(\beta_0,\beta_1,\beta_2,\beta_3)\in \mathbb{R}^4}
\;\bigl\|\,y - X_{\text{cub}}\,\beta\bigr\|^2
\;\;\le\;\;
\min_{(\beta_0,\beta_1)\in \mathbb{R}^2}
\;\bigl\|\,y - X_{\text{cub}}(\beta_0,\beta_1,0,0)\bigr\|^2.
$$

But
$$
X_{\text{cub}}(\beta_0,\,\beta_1,\,0,\,0)
= 
X_{\text{lin}}(\beta_0,\,\beta_1).
$$

For any $(\beta_0,\beta_1)$, the vector of fitted values in the cubic model collapses to the linear model’s fitted values if $\beta_2=0,\;\beta_3=0$. Hence:
$$
\bigl\|\,y - X_{\text{cub}}(\beta_0,\beta_1,0,0)\bigr\|^2
=
\bigl\|\,y - X_{\text{lin}}(\beta_0,\beta_1)\bigr\|^2.
$$

Therefore,
$$
\min_{\beta_0,\beta_1,\beta_2,\beta_3}
\bigl\|\,y - X_{\text{cub}}\,\beta\bigr\|^2
\;\;\le\;\;
\min_{\beta_0,\beta_1}
\bigl\|\,y - X_{\text{lin}}(\beta_0,\beta_1)\bigr\|^2.
$$

By definition, these minima are precisely the **training RSS** for each model:

$$
\mathrm{RSS}_{\text{cub}}
\;\le\;
\mathrm{RSS}_{\text{lin}}.
$$


Let’s use 4 observations:

$$
x = [\,1,\;2,\;3,\;4\,], 
\quad
y = [\,2,\;3,\;6,\;9\,].
$$

So we have \(n=4\) data points:
$$
(x_1,\,y_1)=(1,\,2), 
\quad
(x_2,\,y_2)=(2,\,3), 
\quad
(x_3,\,y_3)=(3,\,6), 
\quad
(x_4,\,y_4)=(4,\,9).
$$

---

### Design Matrices

The linear model is

$$
Y = \beta_0 \;+\; \beta_1\,X \;+\; \varepsilon.
$$

In matrix form, we write $X_{\text{lin}}$ as

$$
X_{\text{lin}}
=
\begin{pmatrix}
1 & 1\\
1 & 2\\
1 & 3\\
1 & 4
\end{pmatrix}_{4\times 2}.
$$

- The first column is all 1’s (for $\beta_0$).
- The second column is $x_i$.


The cubic model is

$$
Y = \beta_0 \;+\; \beta_1\,X \;+\; \beta_2\,X^2 \;+\; \beta_3\,X^3 \;+\; \varepsilon.
$$

Hence $X_{\text{cub}}$ is

$$
X_{\text{cub}}
=
\begin{pmatrix}
1 & 1 & 1^2 & 1^3\\
1 & 2 & 2^2 & 2^3\\
1 & 3 & 3^2 & 3^3\\
1 & 4 & 4^2 & 4^3
\end{pmatrix}
=
\begin{pmatrix}
1 & 1 & 1 & 1\\
1 & 2 & 4 & 8\\
1 & 3 & 9 & 27\\
1 & 4 & 16 & 64
\end{pmatrix}_{4\times 4}.
$$

---


When you switch to test $RSS$ there is no gurantee that the cubic model will have equal of lower $RSS$ than the linear model. In fact, the test $RSS$ can get worse if you add parameters due to $\bold{overfitting}$.

### Overfitting:

When the cubic model uses extra coefficients $\beta_2$ and $\beta_3$ it can chase some random noise in the training set that doesn't generalize. This might reduce the training $RSS$ but might not improve predictions on new data. Sometimes the cubic model might capture the true relationship (especially if it's nonlinear!), giving it a lower $RSS$ than the linear model. But these extra parameters can cause overfitting, causing the test $RSS$ to go up for the cubic model. Therefore, unlike training $RSS$, the test $RSS$ can increase, decrease, or be approximately equal when adding parameters. 

### Bias-Variance Trade-Off: 
The expected test MSE for $x_0$ is 

$$Test\:MSE(\hat{f}) =\mathbb{E}(y_0 - \hat{f}(x_0))^2=\underbrace{Bias^2[\hat{f}(x_0)]}_\text{systematic deviation} + \underbrace{Var[\hat{f}(x_0)]}_\text{variance} + \text{irreducible error}$$

where $\text{irreducible error}=Var(\epsilon).$ Adding parameters like $\beta_2, \beta_3$ usually reduces bias but increases variance. Therefore, if the variance is greater than the bias reduction, then your test error can go up. On the training set, the variance does not cause you to do worse, since more parameters allow you to fit atleast as well, so training RSS never rises. 



---
Show for 

$$y_i = \beta_0 + \beta_1 x_i$$

the $\hat{\beta}_0$, $\hat{\beta}_1$ that minimize $RSS$ using least squares is 

$$\hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})}$$  

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$

where $\bar{y} = \frac{1}{n}\sum_{i=1}^ny_i$, $\bar{x} = \frac{1}{n}\sum_{i=1}^nx_i$.

The residual or error for the $ith$ observation or sample between the actual and estimate is: 

$e_i = y_i - \hat{y_i}$

and for $n$ observations the $RSS$ or residual sum of squares is 

$$\begin{align*}
RSS &= \sum_{i=1}^n e_i^2 \\
    &= e_1^2 + \dots + e_n^2 \\
    &= (y_1 - \hat{y}_1)^2 + \dots + (y_n - \hat{y}_n)^2
\end{align*}
$$

We now must minimize the $RSS$ with respect to $\hat{\beta}_0$, $\hat{\beta}_1$:

$$\begin{align*}
RSS(\hat{\beta}_0,\hat{\beta}_1) &=(y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2
\end{align*}
$$

such that 

$$\begin{align*}
\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_0} &= 2(y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)\cdot(-1) + \dots + 2(y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)\cdot(-1) \\
&=2\sum_{i=1}^n (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i) = 0  
\end{align*}
$$

$$\begin{align*}
\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_1} &= 2(y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)\cdot(-x_1) + \dots + 2(y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)\cdot(-x_n) \\
&=2\sum_{i=1}^n x_i (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i) = 0  
\end{align*}
$$ 

and we now have the following 2 equations: 

$\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_0}=\sum_{i=1}^n (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i)=0$

$\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_1}=\sum_{i=1}^nx_i (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i) = 0 $.

The first equation we can use the fact that $n\hat{\beta}_0=\sum_{i=1}^n\hat{\beta}_0$ and $\bar{x} = \frac{1}{n}\sum_{i=1}^nx_i$ then we have and solve for $\hat{\beta}_0$: 

$$\begin{align*}
\hat{\beta}_0 &= \frac{1}{n}\sum_{i=1}^ny_i - \frac{\hat{\beta}_1}{n}\sum_{i=1}^nx_i \\
             &= \bar{y} - \hat{\beta}_1\bar{x}
\end{align*}
$$.

For the second equation we can identify the following two factorizations: 

$$\begin{align*}
\sum_{i=1}^n(x_i - \bar{x})(x_i - \bar{x}) &= \sum x_i^2 - \bar{x}\sum x_i - \bar{x}\sum x_i + \sum \bar{x}^2 \\
&= \sum x_i^2 - n\bar{x}^2 - n\bar{x}^2 + n\bar{x}^2 \\ 
&= \sum x_i^2 - n\bar{x}^2
\end{align*}
$$.

$$\begin{align*}
\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y}) &= \sum x_iy_i - \bar{x}\sum y_i - \bar{y}\sum x_i + \sum \bar{x}\bar{y} \\
&= \sum x_iy_i  - n\bar{x}\bar{y} - n\bar{y}\bar{x} + n\bar{x}\bar{y}\\ 
&= \sum x_iy_i - n\bar{x}\bar{y}
\end{align*}
$$.

Then from the second equation we have: 

$$\begin{align*}
\sum_{i=1}^n x_i y_i &= \hat{\beta_0}\sum_{i=1}^nx_i + \hat{\beta_1}\sum_{i=1}^n x_i^2 \\
 &= (\bar{y} - \hat{\beta}_1 \bar{x})\sum_{i=1}^nx_i + \hat{\beta_1}\sum_{i=1}^n x_i^2 \\
 &=(\bar{y} - \hat{\beta}_1 \bar{x})(n\bar{x}) + \hat{\beta_1}\sum_{i=1}^n x_i^2 
\end{align*}
$$

Solving for $\hat{\beta_1}$:

$$\begin{align*}
\hat{\beta_1}(\sum_{i=1}^n x_i^2 - n\bar{x}^2) = \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} 
\end{align*}
$$

and using the previous two factorizations we have: 

$$\hat{\beta_1}\sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})$$ 

and solving for $\hat{\beta_1}$:

$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}.$$

---
Let's use the matrix gradient the derivation of OLS using the squared norm (same as RSS). First, let's elaborate the following facts for linear quadratic functions of $\beta$: 

For any $\mathbf{c} \in \mathbb{R}^n$, we have $\nabla_{\beta}(\mathbf{c}^T\mathbf{\beta})=\mathbf{c}$.

For any $X \in \mathbb{R}^{n\times n}$, we have $\nabla_{\beta}(\mathbf{\beta}^TX\mathbf{\beta})=(X + X^T)\beta$.

We are trying to minimize the squared norm with respect to $\beta$: 
$$\begin{align*}
0 &= \nabla||\bold{y} - X\bold{\beta}||^2 \\
&=\nabla (\bold{y} - X\beta)^T(\bold{y} - X\beta)\\
&= \nabla (\bold{y}^T - \beta^TX^T)(\bold{y} - X\beta)\\
&= \nabla(\bold{y}^T\bold{y} - \beta^TX^T\bold{y} - \bold{y}^TX\beta +\beta^TX^TX\beta)
\end{align*}
$$

Since
$\beta \in \mathbb{R}^p$, $\mathbf{y} \in \mathbb{R}^p$, and $X \in \mathbb{R}^{n\times p}$, then $\beta^TX^T\bold{y}=\bold{y}^TX\beta$. We can readily take the transpose since they both result in the same scalar quantity. Continuing on we get 

$$\begin{align*}
0 &= \nabla(\bold{y}^T\bold{y} - \beta^TX^T\bold{y} - \bold{y}^TX\beta +\beta^TX^TX\beta)\\
&= \nabla(\bold{y}^T\bold{y} - 2\beta^TX^T\bold{y} + \beta^TX^TX\beta) \\
&= - 2X^T\bold{y} + (X^TX + X^TX)\beta \\
&= - 2X^T\bold{y}  + 2X^TX\beta
\end{align*}
$$

Solving for $\beta$ we arrive at 

$$\beta = (X^TX)^{-1}X^T\bold{y}.$$