# Linear regression

## Simple linear regression
In the simplest scenario, the hypothesis, $h(x)$, models a linear affect of an independent variable, $x$, on a dependent variable, $y$.

$$y = h(x) =\theta_{0} + \theta_{1} x$$

The underlying relationship of $\{(x_{0}, y_{1}), (x_{1}, y_{1}), (x_{2}, y_{2}), \ldots, (x_{m}, y_{m})\}$ points can therefore be described as

$$y_{i} = \theta_{0} + \theta_{1} x_{i} + \varepsilon_{i}$$

where $\varepsilon_{i}$ are the residuals, i.e. differences between actual and predicted values of the dependent variable. The "line of best fit" is found by minimizing the square residuals.
$$\min_{\theta_{0}, \theta_{1}}J(\theta_{0}, \theta_{1}) = \sum_{i=0}^{m} \varepsilon_{i}^{2} = \sum_{i=0}^{m} \left(y_{i} - \theta_{0} + \theta_{1} x_{i}\right)^{2}$$

This can be found analytically by finding the stationary point
$$\frac{\partial \varepsilon_{i}^{2}}{\partial \theta_{0}} = 0, \frac{\partial \varepsilon_{i}^{2}}{\partial \theta_{1}} = 0$$

I'm skipping some math because it's a bit more typing that I want to do in LaTeX, but the end result is

$$\theta_{1} = \frac{Cov(x,y)}{Var(x)} = \frac{\sum_{i=1}^{m} (x_{i} - \overline{x})(y_{i} - \overline{y})}{\sum_{i=1}^{m} (x_{i} - \overline{x})^{2}} \\
\theta_{0} = \overline{y} - \theta_{1} \overline{x}$$

## More general linear regression

Now consider more generally, a model for $n$ input variables. 

$$h(x) = \theta_{0} x_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} \ldots \theta_{n} x_{n} = \sum_{i=0}^{n} \theta_{i} x_{i} = \theta^{T}x$$

The cost function for this can be represented as

$$J(x) = \frac{1}{2m}\sum_{i=0}^{m} \left(h_{\theta}(x_{i}) - y_{i}\right)^{2}$$

There are two ways to find the minimum:
1. Numerically with __gradient descent__
  - More on the algorithm shortly
  - Pros: Works well when $n$ is large, $O(kn^{2})$
  - Cons: Need to tune $\alpha$ (the learning rate), need to iterate (can speed up with *feature scaling* and *mean normalization*)
2. Analytically with the __normal equation__
  - $\theta = \left(X^{T}X\right)^{-1}X^{T}y$
  - Pros: No need to tune $\alpha$, no need to iterate, no need for feature scaling
  - Cons: Slow if $n$ is large, calculating the inverse of $(X^{T}X)$ is $O(n^{3})$
    - If $n > 10k$, better to use gradient descent
    - $X^{T}X$ can be noninvertible if two features are very closely related/redundant (i.e. they are linearly dependent) or too many features
