## Linear Regression for Multiple Features
- $X$ is a $m \times n$ matrix
  - $m$ is the number of datasets
  - $n$ is the number of features
  - $X_j^i$ denotes the $i^{th}$ example for the $j^{th}$ feature

- $Y$ is a $m \times 1$ vector
  - $m$ number of outputs
  - $Y^{i} = f(X^i)$


### Goal
- We have to train a model for the n features, such that we assign some weight to all of them and get a nice prediction line to fit as many possible datasets to get an accurate Prediction

### Hypothesis
- We have to modify the hypothesis for multiple features to fit
$$\hat{y} = h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$$
- Which is basically a weighted sum where $\theta_i$ denotes the weight of feature $x_i$
$$\hat{y} = \sum_{i=0}^n\theta_ix_i$$
> $x_0$ is a dummy feature which is always $1$
>
> Hence our modified matrix will be $m \times (n+1)$ with $x_0$ being $1$

Given a row vector $\theta$, we can write
$$h_\theta(x) = \theta^Tx$$

#### Modified error function
$$J(\theta) = \frac{1}{m}\sum_{i=1}^m[y^{(i)} - \hat{y}^{(i)}]^2$$
Replacing $\hat{y}$
$$J(\theta) = \frac{1}{m}\sum_{i=1}^m[y^{(i)} - \theta^Tx^{(i)}]$$

#### Calculating gradient
- We have to try to minimise the error, so we modify theta
$$\theta = \theta - \eta \cdot \nabla_{\theta}J(\theta)$$
For a generic $\theta_j$
$$\frac{\delta}{\delta\theta_j}J(\theta) = \frac{\delta}{\delta\theta_j}(h_\theta(x) - y)^2$$
$$= (h_\theta(x) - y)\frac{\delta}{\delta\theta_j}[\sum_{j=0}^n\theta_jx_j]$$
All terms cancel as they dont depend on $\theta_j$, only $x_j$ remains
$$\frac{\delta}{\delta\theta_j}J(\theta) = (\hat{y} - y)x_j$$
For all examples,
$$\frac{\delta}{\delta\theta_j}J(\theta) = \sum_{i=1}^m(\hat{y}^{(i)} - y^{(i)})x^{(i)}$$
Finally,
$$\theta_j = \theta_j - \eta \cdot \sum_{i=1}^m(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$