# Linear regression with multiple variables

Also called **multivariate linear regression**.

Variables also called **features**.

> **Notation**
>
> $m$ = number of training examples (feature vectors)
>
> $n$ = number of features in each example (dimension of feature vectors)
>
> $x^{(i)}$ = input of $i^{th}$ training example (feature vector)
>
> $x^{(i)}_{j}$ = value of feature $j$ in $i^{th}$ training example (feature vector)

## Hypothesis (model) for multivariate linear regression

> **Hypothesis for linear regression**
>
> Univariate: $h_{\theta}(x) = \theta_{0} + \theta_{1}x$ ($n = 1$)
>
> Multivariate: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \ldots + \theta_{n}x_{n}$ (arbitrary $n$)

Let us assume that $x^{i}_{0} = 1$ for all $i \in [1, m]$. This allows us to write $x$ as

$$x = \begin{bmatrix}x_{0} \\ x_{1} \\ \vdots \\ x_{n}\end{bmatrix} = \begin{bmatrix}1 \\ x_{1} \\ \vdots \\ x_{n}\end{bmatrix} \in \mathcal{R}^{n+1}$$

in other words, our training data is a set of $n+1$-dimensional vectors. Writing our hypothesis parameters as the vector

$$\theta = \begin{bmatrix}\theta_{0} \\ \theta_{1} \\ \vdots \\ \theta_{n}\end{bmatrix} \in \mathcal{R}^{n+1}$$

allows us to simplify our hypothesis function as the vector product

$$h_{\theta}(x) = \theta^{T}x$$

This product expands to the linear combination

$$h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \ldots + \theta_{n}x_{n}$$

We can think of $\theta_{0}$ as the "base value", and of the remaining $\theta_{i}$ as the weightings of each corresponding feature in the inputs to the final output value.

## Gradient descent for multivariate linear regression


Recall the **mean squared error** cost function

$$J(\theta_{0}, \theta_{1}, \ldots, \theta_{n}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$$

which we can simplify as

$$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$$

where $\theta \in \mathcal{R}^{n+1}$.

Recall also the algorithm for gradient descent, which is to *repeat the following until convergence* for $j \in [0, n]$

$$\theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta)$$

The *partial derivative term* becomes

$$\frac{\partial}{\partial\theta_{j}}J(\theta) = \frac{1}{m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)}})x^{(i)}_{j}$$

Which gives us the **gradient descent algorithm for multivariate linear regression**, which is to *repeat the following until convergence* for $j \in [0, n]$

$$\theta_{j} := \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)}})x^{(i)}_{j}$$

## Gradient descent in practice

There are many practical ways to improve the efficiency of gradient descent in practice.

### Feature scaling

Gradient descent converges more quickly when features are on a similar scale. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, therefore oscillating inefficiently down to an optimum when the ranges are very uneven.

For this reason we can **scale features** to take on similar value ranges. Ideally we want features roughly in the range

$$-1 \le x_{j} \le 1$$

### Mean normalization

Replace features $x_{i}$ with $x_{i} - \mu_{i}$ (except $x_{0}$) to make features have approximately zero mean.

In general **feature scaling** and **mean normalization** means we transform features as

$$x_{j} \leftarrow \frac{x_{j} - \mu_{j}}{s_{j}}$$

Where
- $\mu_{j}$ is the average of feature $x_{j}$
- $s_{j}$ scales the feature $j$ to a range close to $-1 \le x_{j} \le 1$

$s_{j}$ is often the value $max(x_{j}) - min(x_{j})$ or the *standard deviation* of $x_{j}$

### "Debugging" gradient descent

Recall that we seek $\underset{\theta}{minimum}\hspace{2mm}J(\theta)$.

It is often useful to plot the values of $J(\theta)$ against the number of iterations of the gradient descent algorithm. 

The idea is that we should see the value of $J(\theta)$ decreasing as the number of iterations increase. The rate of decrease is a good indication of the efficiency of the algorithm.

We can use a convergence test to declare convergence if the value of $J(\theta)$ decreases by less than a given threshold - this threshold, however, can be difficult to choose. A plot can usually give a better idea of whether the algorithm is converging or not.

If $J(\theta)$ is *increasing* as the iterations increase (ie. *diverging*), or it goes up and down in oscillations, it's usually a sign to choose a smaller learning rate $\alpha$.

> For sufficiently small $\alpha$, $J(\theta)$ should decrease on every iteration (eventually converging)
>
> But choosing an $\alpha$ too small results in slow convergence

### Polynomial regression

Sometimes we can imagine that a polynomial function might fit the training data better. We can map features from a polynomial hypothesis (model) to a linear one.

If for example $h_{\theta}(x) = \theta_{0} + \theta_{1}x + \theta_{2}x^{2} + \theta_{3}x^{3}$, we can let
- $x_{1} = x$
- $x_{2} = x^{2}$
- $x_{3} = x^{3}$

Which gives us a linear hypothesis $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3}$.

Feature scaling becomes incredibly important here, as ranges can be raised to exponents.

Exponents don't always need to be in increasing, discrete ranges. We can use roots etc. as well to manipulate the features to better fit the training data.

We can also *combine* features to create new features. For example, given two features $x_{1}$ and $x_{2}$, we can combine these to get a third feature $x_{3}$.

# Normal equation

> **Gradient descent**
>
> We have a model $h_{\theta}$ with parameters $\theta$
>
> The model has a cost function $J(\theta)$ that we seek to minimize
>
> Gradient descent is an **iterative** algorithm for finding a $\theta$ that minimizes $J(\theta)$

The **normal equation** is an algorithm for solving $\theta$ **analytically**.


## Intuition

The cost function $J(\theta)$ has a minimum where its derivative is zero. We can find this minimum by solving for $\theta$ in

$$\frac{\partial}{\partial\theta}J(\theta) = 0$$

## Algorithm

We construct an $m \times (n + 1)$ **design matrix $X$** out of the training set as follows

$$X = \begin{bmatrix}x^{(1)} \\ x^{(2)} \\ \vdots \\ x^{(m)}\end{bmatrix} \in \mathcal{R}^{m \times (n + 1)}$$

where $m$ is the number of $n + 1$-dimensional feature vectors

$$x^{(i)} = \begin{bmatrix}1 & x_{1} & x_{2} & \ldots & x_{n}\end{bmatrix} \in \mathcal{R}^{n + 1}$$

We also construct an $m$-dimensional vector $y$ out of the training set outputs

$$y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)}\end{bmatrix} \in \mathcal{R}^{m}$$

The **solution $\theta$** is given by

$$\theta = (X^{T}X)^{-1}X^{T}y$$

Or in Octave:
```octave
# `pinv` calculates the (numerical) pseudo-inverse of any matrix
# `inv` calculates the actual inverse if the matrix is invertible
pinv(X'*X)*X'*y
```

## Comparison with gradient descent

Unlike *gradient descent*, there is no need for **feature scaling** with the *normal equation*.

Advantages of **gradient descent**:
- Works well even when $n$ is large (many features)

Disadvantages of **gradient descent**:
- Need to choose learning rate $\alpha$
- Requires many iterations

Advantages of **normal equation**:
- Easy to implement

Disadvantages of **normal equation**
- Computing $(X^{T}X)^{-1} \in \mathcal{R}^{n \times n}$ is slow if $n$ is large (matrix inversion is cubic in $n$ ie. $O(n^{3})$

> **Conclusion**: use *normal equation* if $n$ is reasonably small ($\le$ 10 000), use *gradient descent* when $n$ is large

## The normal equation and non-invertibility

If $X^{T}X$ is non-invertible (singular / degenerate), we can still solve for $\theta$ by numerically computing the psuedo-inverse $(X^{T}X)^{\dagger}$ instead.

See `inv` vs `pinv` in Octave.

If $X^{T}X$ is non-invertible, it's likely that
- Some features are redundant (the columns of $X$ are **linearly dependent**)
- There are too many features ($m \le n$)

Options: either *remove features* or use *regularization*.