# Linear regression IV

## With multiple variables (Multivariate linear regression)

This one is more powerfull that works with multiple features (variables). Previously we mapped <code>**between a dependent variable(y) and one or more independent variables(x)**</code>

In [4]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# just for the nicer style
plt.style.use('seaborn-whitegrid')
from pandas import Series, DataFrame
from sklearn.datasets import load_boston

data_set = load_boston()
boston_df = DataFrame(data_set.data)
boston_df.columns = data_set.feature_names
boston_df['Kaina tūks. $'] = data_set.target
sample = boston_df[["RM", "Kaina tūks. $"]]
sample.head()

Unnamed: 0,RM,Kaina tūks. $
0,6.575,24.0
1,6.421,21.6
2,7.185,34.7
3,6.998,33.4
4,7.147,36.2


So we used a single feature x(RM) to predict y(the price). 
And this was our for of hypothesis:

$$ h_{\theta}(x) = \theta_{0} + \theta_{1}x$$

But if we had more than one features. Like age, tax and crime rate and etc. 

In [8]:
sample = boston_df
sample.head()
#print(data_set.DESCR)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Kaina tūks. $
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


This gives a lot more information to predict price. For all of these features(variables) we will use notation of <code>**x1, x2, x3, etc**</code>. For target(output variable) we will continue to use <code>**y**</code>. 

<code>**Notation**</code>:
 * <code>**n**</code> = number of features;
 * <code>**m**</code> = number of training examples;
 * <code>**x(i)**</code> = input variables (feature) of i'th training example;
 * <code>**x(i)(j)**</code> = value of feature j in i'th training example.
 
Lets say x(2) is going to be a vector of features for our second training example :

\begin{equation*}x^{2} = 
\begin{bmatrix}
0.0  \\
7.07 \\
0.0  \\ 
0.0  \\
0.469\\
...
\end{bmatrix}
\in \mathbb{R^{n}}
\end{equation*}

So plainly superscipt is just an index into the training. And the subscirpt is the reference to training sample feature.

$$ x^{(2)}_3$$

## Practice exercise
In our training set above what is:

$$ x^{(4)}_6$$

Now that we have multiple features to deal with. What would our form of hypothesis shoud bel like ?
Previously with a single <code>**x**</code> feature it was :
$$ h_{\theta}(x) = \theta_{0} + \theta_{1}x$$

Now it's going to be this:

$$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$$

Some arbitrary example would look like:

$$h_{\theta}(x) = 80 + 2.2x_{1} + 0.05x_{2} + 3x_{2} + 2.2x_{3} - 5.05x_{4}$$

<code>**Recap**</code> hypothesis is trying to predict the output value <code>**y**</code>with given input values <code>**x**</code>.

Lets simply this equation:
$$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$$

For convenience of notation, define a zeroth feature that always take a value of 1.
$$ x^{(i)}_0 = 1$$

Now our feature vector x  becomes n+1 dimensional vector that is zero indexed.

\begin{equation*}
x = 
\begin{bmatrix}
x_{0} \\
x_{1} \\
x_{2} \\ 
...   \\
x_{n} \\
\end{bmatrix}
\in
\mathbb{R^{n+1}}
\end{equation*}

And our parameters can be a vector:
\begin{equation*}
\theta = 
\begin{bmatrix}
\theta_{0} \\
\theta_{1} \\
\theta_{2} \\ 
...   \\
\theta_{n} \\
\end{bmatrix}
\in
\mathbb{R^{n+1}}
\end{equation*}

So our hypothesis can we rewriten as:
$$h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} = \theta^\top x$$

\begin{equation*}
\begin{bmatrix}
\theta_{0} \theta_{1} \theta_{2}  ... \theta_{n}
\end{bmatrix}
\begin{bmatrix}
x_{0} \\
x_{1} \\
x_{2} \\ 
...   \\
x_{n} \\
\end{bmatrix}
= \theta^\top
\end{equation*}

Its just an convienent way to write hypothesis as just the inner product between our parameter theta vector and our feature vector x and little notation allows us to do this:
$$ x^{(i)}_0 = 1$$


## Gradient descent for multiple variables

How to fit the parameters of that hypothesis, in other words how to use GD with multiple features.
## Outline

<code>**Hypothesis**</code>:
\begin{multline*}
h_{\theta}(x) = \theta^\top x =  \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} +  ... +  \theta_{n}x_{n}
\end{multline*}
\begin{multline*}x_{0} = 1 \end{multline*}
<code>**Parameters**</code>:
\begin{multline*}
\vec{\theta} , n+1 \text{ dimensionl vector}
\end{multline*}

<code>**Cost function**</code>:
\begin{multline*}
J(\vec{\theta}) = \dfrac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)} ) -y^{(i)})^2
\end{multline*}

<code>**Goal**</code>:
\begin{multline*}
\min(\theta_{0}, \theta_{1}, ..., \theta_{n})
\end{multline*}

## Feature Scaling

Making sure that features are on a similr scale:
 * x(1) = size (0-250 sqr.m)
 * x(2) = number of rooms (1-5)
 
Scale them:
 * x(1) = size sqr.m / 250
 * x(2) = no. rooms / 5
 
Try to get every feature into approximately this range:

$$ -1 \leq x_{i} \leq 1 $$

Its fine if you manage to get close enough, the rule of thums is a range like <code>**-3 to 3**</code>:

$$ 0 \leq x_{1} \leq 3 $$

$$ -2 \leq x_{2} \leq 0.5 $$