## Checkout 
<a href="./maths_for_ml.md"> Model selection and statistics for ML</a>
# Multiple Linear Regression
### A statistical technique that uses two or more independent variables to predict the outcome of a dependent variable.

## $$ y = b_0 + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_8D_1 + b_9D_2 $$
![alt text](https://blogs.sas.com/content/iml/files/2018/01/reglabels1.png)

here, 
- **Y** is Dependent variable
- **b<sub>0</sub>** is y-intercept(constant)
- **b<sub>1</sub>** & **X<sub>1</sub>** is Slope coefficent & observations for first Independent variable
- **b<sub>2</sub>** & **X<sub>2</sub>** is Slope coefficent & observations for other Independent variable
- **b<sub>8</sub>** is Slope coefficent for Dummy variables
- **D<sub>1</sub>** is Dummy variable observations.

### Note:
- Dummy variables are catogorical dependent variables (consits of letters) that have been converted to numeric from through encoding.
- always include **(n-1)** number of Dummy variables form a single set. That is If there are 2 columns of Dummy variable, include only 1. For 9 Dummy variables include (8-1)=7 columns.
- For a diffrent set of Dummy variable include another (n-1) set of columns.

### Importing the libraries.

In [50]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 


### Importing the dataset


In [51]:
dataset = pd.read_csv("./reg_dataset/50_Startups.csv")

X = dataset.iloc[:, :-1].values 
Y = dataset.iloc[:, -1].values 

In [52]:
for i in range(5):
    print(f'{X[i]}  --> {Y[i]}')

[165349.2 136897.8 471784.1 'New York']  --> 192261.83
[162597.7 151377.59 443898.53 'California']  --> 191792.06
[153441.51 101145.55 407934.54 'Florida']  --> 191050.39
[144372.41 118671.85 383199.62 'New York']  --> 182901.99
[142107.34 91391.77 366168.42 'Florida']  --> 166187.94


### Transforming the categorical data(city) using OneHotEncoding


In [53]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[-1])], remainder='passthrough') 
# [-1] is column index with categorical data.
X = np.array(ct.fit_transform(X))

In [54]:
for i in range(5):
    print(f'{X[i]}')
# first 3 columns is the column 'city' Encoded in 0 & 1, 

[0.0 0.0 1.0 165349.2 136897.8 471784.1]
[1.0 0.0 0.0 162597.7 151377.59 443898.53]
[0.0 1.0 0.0 153441.51 101145.55 407934.54]
[0.0 0.0 1.0 144372.41 118671.85 383199.62]
[0.0 1.0 0.0 142107.34 91391.77 366168.42]


### Note:
In Multiple linear regression, there is no need to apply feature scalling as, the coefficent **b<sub>1</sub>** in  **b<sub>1</sub>X<sub>1</sub>** compensates for the Extream values.

### Spliting the dataset into Training set and Test set.

In [55]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size=0.2, random_state=0)

### Training the dataset & Predicting the test set

In [56]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
results = regressor.predict(x_test)
for i in range(10):
    diff = round(abs(results[i] - y_test[i]),3)  
    print(f'{round(results[i],3)}  for  {y_test[i]} \t diff ---> {diff}')

103015.202  for  103282.38 	 diff ---> 267.178
132582.278  for  144259.4 	 diff ---> 11677.122
132447.738  for  146121.95 	 diff ---> 13674.212
71976.099  for  77798.83 	 diff ---> 5822.731
178537.482  for  191050.39 	 diff ---> 12512.908
116161.242  for  105008.31 	 diff ---> 11152.932
67851.692  for  81229.06 	 diff ---> 13377.368
98791.734  for  97483.56 	 diff ---> 1308.174
113969.435  for  110352.25 	 diff ---> 3617.185
167921.066  for  166187.94 	 diff ---> 1733.126


### visualising the dataset
we can't plot a multiple regression on a plain. 
<br>Hear is some complex numpy methods for visualizing data.

In [57]:
# this methord tranforms the matrix(vertical(1) <--> horzontal(0)) and concatinates
np.set_printoptions(precision=2)
print(np.concatenate((results.reshape(len(results), 1), y_test.reshape(len(y_test), 1)), axis=1))


[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Note:
- sklearn.linear_model import LinearRegression, handels the Model selction automatical.
- if using other libraries or languages (' R '), one has to take into consideration the the significance level and P-values. <br> and have to decide manualy which model to use.
- It is Optional, can learn <a href='./backwards_elimination.ipynb'>'Backwards elimination'</a> here, 