# Linear Regression

- Housing price prediction (sq ft -> price): $\hat y = \theta_0 + \theta_1 x_1$
- Housing price prediction (sq ft, # bedrooms, ... -> price): $\hat y = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n$
- Definitions: 
    - inputs $x$ are "features"
    - outputs $y$ are "targets"
    - the $\hat y$ are "predictions"
    - the pairs $(x,y)$ are "training examples"
    - coefficients $\boldsymbol{\theta} = \theta_0,\cdots,\theta_n$ are "parameters"
    - $\theta_0$ is the "bias"
    - $\theta_1,\cdots,\theta_n$ are the "weights
    - loss function $L(\hat y, y) = J(\boldsymbol{\theta})$
- Notation zoo:
    - $m$: # training examples
    - $n$: # features
    - i<sup>th</sup> training example: $\boldsymbol{x}^{(i)}$
    - j<sup>th</sup> feature vector: $\boldsymbol{x}_j$
    - data matrix: $\boldsymbol{X}$ of shape $(m,n+1)$ if we take $\boldsymbol{x}_0 = (1, \cdots, 1)$
    - target vector: $\boldsymbol{y}$ of shape $(m,1)$
- Goal: Choose "best" function $f(\boldsymbol{x})$ such that $y \approx \hat y = f(\boldsymbol{x})$, in the sense that some loss function $J(\boldsymbol{\theta}) = L(\hat y, y)$ is minimized, where in this case
$$J(\boldsymbol{\theta}) = \frac{1}{2}||f(\boldsymbol{X}|\boldsymbol{\theta}) - \boldsymbol{y}||_2^2$$
- Least Squares: Goal is to solve for the optimal $\boldsymbol{\hat \theta}$ solving the problem
$$\underset{\boldsymbol{\theta}}{\text{minimize}} \ \frac{1}{2}\big(f(\boldsymbol{X}|\boldsymbol{\theta}) - \boldsymbol{y}\big)^\top \big(f(\boldsymbol{X}|\boldsymbol{\theta}) - \boldsymbol{y}\big).$$
- Supposing $f(\boldsymbol{X}|\boldsymbol{\theta}) = \boldsymbol{X}^{\top}\boldsymbol{\theta}$, the gradient $\frac{d}{d\boldsymbol{\theta}} J(\boldsymbol{\theta})$ is just
$$\frac{d}{d\boldsymbol{\theta}} J( \boldsymbol{\theta}) = \big(f(\boldsymbol{X}|\boldsymbol{\theta}) - \boldsymbol{y}\big)^{\top} \boldsymbol{X}.$$
- Using gradient descent, solve for $\boldsymbol{\hat \theta}$ by making updates
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha \cdot \big(f(\boldsymbol{X}|\boldsymbol{\theta}) - \boldsymbol{y}\big)^{\top} \boldsymbol{X}.$$
- Another way to solve is by solving for $\boldsymbol{\hat \theta}$ analytically, which gives the normal equation
$$\boldsymbol{X}^{\top}\boldsymbol{y} = \boldsymbol{X}^{\top}\boldsymbol{X}\boldsymbol{\hat \theta},$$
$$\boldsymbol{\hat \theta} = \big(\boldsymbol{X}^{\top}\boldsymbol{X} \big)^{-1} \boldsymbol{X}^{\top}\boldsymbol{y}.$$
- Consider showing gradient descent work interactively here, for both 2D and 3D situations.
- Trying to model the random variable $\boldsymbol{y} = f(\boldsymbol{X}) + \boldsymbol{\varepsilon}$, where $\boldsymbol{\varepsilon} \sim p(\boldsymbol{x})$ with mean zero.
    - Common choice (Gauss-Markov): $\boldsymbol{\varepsilon} \overset{iid}{\sim} \mathcal{N}(\boldsymbol{0}, \sigma^2\boldsymbol{I})$