# Linear Regression

- Housing price prediction (sq ft -> price): $\hat y = \theta_0 + \theta_1 x_1$
- Housing price prediction (sq ft, # bedrooms, ... -> price): $\hat y = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n$
- Definitions: 
    - inputs $x$ are "features"
    - outputs $y$ are "targets"
    - the $\hat y$ are "predictions"
    - the pairs $(x,y)$ are "training examples"
    - coefficients $\boldsymbol{\theta} = \theta_0,\cdots,\theta_n$ are "parameters"
    - $\theta_0$ is the "bias"
    - $\theta_1,\cdots,\theta_n$ are the "weights
- Notation zoo (Ng notation):
    - $m$: # training examples
    - $n$: # features
    - i<sup>th</sup> training example: $\mathbf{x}^{(i)}$
    - j<sup>th</sup> feature vector: $\mathbf{x}_j$
    - data matrix: $\mathbf{X}$ of shape $(m,n+1)$ if we take $\mathbf{X}_0 = (1, \cdots, 1)$ else shape $(m,n)$
    - target vector: $\mathbf{y}$ of shape $(m,1)$
    - cost function $J(\theta) = \langle L(\hat y, y) \rangle$
- Goal: Choose "best" function $f(\mathbf{X})$ such that $y \approx \hat y = f(\mathbf{X})$, in the sense that some loss function $J(\boldsymbol{\theta}) = L(\hat y, y)$ is minimized, where in this case
$$J(\boldsymbol{\theta}) = \frac{1}{2}||f_{\boldsymbol{\theta}}(\mathbf{X}) - \mathbf{y}||_2^2$$
- Least Squares: Goal is to solve for the optimal $\boldsymbol{\hat \theta}$ solving the problem
$$\underset{\boldsymbol{\theta}}{\text{minimize}} \ \frac{1}{2}\big(f_{\boldsymbol{\theta}}(\mathbf{X}) - \mathbf{y}\big)^\top \big(f_{\boldsymbol{\theta}}(\mathbf{X}) - \mathbf{y}\big).$$
- Supposing $f_{\boldsymbol{\theta}}(\mathbf{X}) = \mathbf{X}^{\top}\boldsymbol{\theta}$, the gradient $\frac{d}{d\boldsymbol{\theta}} J(\boldsymbol{\theta})$ is just
$$\frac{d}{d\boldsymbol{\theta}} J( \boldsymbol{\theta}) = \big(f_{\boldsymbol{\theta}}(\mathbf{X}) - \mathbf{y}\big)^{\top} \mathbf{X}.$$
- Using gradient descent, solve for $\boldsymbol{\hat \theta}$ by making updates
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha \cdot \big(f_{\boldsymbol{\theta}}(\mathbf{X}) - \mathbf{y}\big)^{\top} \mathbf{X}.$$
- Another way to solve is by solving for $\boldsymbol{\hat \theta}$ analytically, which gives the normal equation
$$\mathbf{X}^{\top}\mathbf{y} = \mathbf{X}^{\top}\mathbf{X}\boldsymbol{\hat \theta},$$
$$\boldsymbol{\hat \theta} = \big(\mathbf{X}^{\top}\mathbf{X} \big)^{-1} \mathbf{X}^{\top}\mathbf{y}.$$
- Linear regression can fit more than just lines. Any transformation $g(\mathbf{X})$ would also work,
$$f_{\boldsymbol{\theta}}(\mathbf{X}) = g(\mathbf{X})^{\top}\boldsymbol{\theta}.$$
- Common choices for $g(x)$ include $x^n$, $\log(x)$, $\sqrt{x}$
- Consider showing gradient descent work interactively here, for both 2D and 3D situations.
- Trying to model the random variable $\mathbf{y} = f_{\boldsymbol{\theta}}(\mathbf{X}) + \boldsymbol{\varepsilon}$, where $\boldsymbol{\varepsilon} \sim p(\mathbf{X})$ with mean zero.
    - Common choice (Gauss-Markov): $\boldsymbol{\varepsilon} \overset{iid}{\sim} \mathcal{N}(\boldsymbol{0}, \sigma^2\boldsymbol{I})$
    - Equivalently, $\mathbf{y}|\mathbf{X},\boldsymbol{\theta} \overset{iid}{\sim} \mathcal{N}(f_{\boldsymbol{\theta}}(\mathbf{X}), \sigma^2\boldsymbol{I})$
    - By Central Limit Theorem, provided features are uncorrelated, errors will always be approximately Gaussian
    - The MSE loss is the negative log likelihood when errors are Gauss-Markov, meaning $\mathcal{L}(\boldsymbol{\theta}) = p(\mathbf{y}|\mathbf{X},\boldsymbol{\hat \theta})$ is maximized
- Parametric vs non-parametric models:
    - Parametric: number of parameters is fixed ahead of time
    - Non-parametric: number of parameters can grow with the size of the data
        - need to keep all of the training data around just to make predictions (sklearn does this for you)
        - don't need to feature engineer as much
- Locally weighted regression: Instead of trying to fit the entire training set, when predicting a given point, just fit a line to the training points around that point in real time, then make a prediction based on that.
    - Use some kind of weighting function in the loss to enforce this, $J(\theta) = \sum_{i=1}^m \color{red}{w_i(x)}(f(x^{(i)}|\theta) - y^{(i)})^2$
    - Common choice is a Gaussian weighting function, $w_i(x) = \exp\bigg(-\frac{(x-x^{(i)})^2}{2\tau^2}\bigg)$, which is ~1 near $x^{(i)}$ and ~0 otherwise. 
    - Parameter $\tau$ is a "bandwidth" that determines how wide the window should be. Also controls over/underfitting.
    - Common name: LO(W)ESS for locally estimated (weighted) scatterplot smoothing
    - Shows up in the time series STL decomposition
    - When to use: few features, lots of data, don't want to think about hand-engineering features
- Regularized Least Squares: Minimize $J(\theta) = ||X\theta - y||^2 - \lambda ||\theta||^2$
    - Least squares solution is given by the modified normal equation $X^\top y = (X^\top X - \lambda^2 I) \theta$,
    $$\hat \theta = (X^\top X - \lambda^2 I)^{-1} X^\top y$$
    - Also called ridge regression, $\lambda$ the "ridge parameter"
- Start of with a simple example $y=\theta x$, with one feature and no bias. Then $\hat \theta = \frac{y}{x}=x^{-1}y$.