## Multiple Linear Regression

$$ y_{predict} = b_{0} + b_{1}X_{1} + b_{2}X_{2} + ... + b_{n}X_{n} $$

1. $ y_{predict} $ = Dependent variable
2. $ b_{0} $ = y-intercept (constant)
3. $ b_{1} $ = slope coefficient 1
4. $ X_{1} $ = independent variable 1

### Assumptions of linear regression

1. Linearity - Linear relationship between Y and each X
2. Homoscedasticity - Equal variance
3. Multivariate Normality - Normality of error distribution
4. Independence - of observations. Includes "no autocorrelation"
5. Lack of Multicollinearity - PRedictors are not correlated with each other
6. Outlier check

<img src="./assumption_of_linear_regression.png" height="400px">

### Statistical Significance

For a coin toss,
Assuming we have two hypothesis

1. Null Hypothesis $ H_{0} $ : This is a fair coin
2. Alternative Hypothesis is $ H_{1} $ : This is not a fair coin

Suppose we toss a coin multiple times in a fair world
1. Gets tail -> 50%
2. Gets tail -> 25%
3. Gets tail -> 12.5%
4. Gets tail -> 6.25%
5. Gets tail -> 3.125%
6. Gets tail -> 1.5625%

p-value is dropping (Very unlikely we get 6 tails in a row)

If we lived in second universe of alternative hypothesis, and both sides of coin had 2 tails
then p-value might not have decreased

Statistical significance is $ \alpha = 0.05 $ 

i.e. I am 95% sure i am living in null hypothesis world or fair world

### Building a model

There are 5 methods of building models
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison

#### All-in

Throw in all variables. When to use it-
1. Prior knowledge of variables OR
2. You have to (necessity) OR
3. Preparing for Backward elimination


#### Backward Elimination

Step 1: Select a significance level to stay in the model (eg SL = 0.05) <br>
Step 2: Fit the full model with all possible predictors<br>
Step 3: Consider the predictor with highest P-value. If P > SL, goto Step 4, otherwise go to FIN<br>
Step 4: Remove the predictor<br>
Step 5: Fit model without this variable. Go to step 3<br>

FIN: model is ready


#### Forward Selection

Step 1: Select a significance level to enter the model (eg SL = 0.05) <br>
Step 2: Fit all simple regression models y ~ $x_{n}$ Select the one with lowest p-value<br>
Step 3: Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have<br>
Step 4: Consider the predictor with lowest P-value. If P < SL, go to Step 3, otherwise go to FIN

FIN: Keep the previous model


#### Bidirectional elimination

Step 1: Select a significance level to enter and to stay in the model. eg. SLENTER = 0.05, SLSTAY = 0.05<br>
Step 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)<br>
Step 3: Perform all stpes of Backward Elimination (old variables must have P < SLSTAY to stay)<br>
Step 4: No new variables can enter and no old variables can exit

FIN: Your model is ready

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Importing dataset

In [2]:
dataset = pd.read_csv('./startups.csv')
X = dataset.drop(columns='Profit')
y = dataset['Profit']

In [3]:
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [4]:
y.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

### Encoding categorical data

In [5]:
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# 
# ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['State'])], remainder='passthrough')
# 
# X = pd.DataFrame(ct.fit_transform(X))

X = pd.get_dummies(X, columns=['State'])

In [6]:
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_California,State_Florida,State_New York
0,165349.2,136897.8,471784.1,False,False,True
1,162597.7,151377.59,443898.53,True,False,False
2,153441.51,101145.55,407934.54,False,True,False
3,144372.41,118671.85,383199.62,False,False,True
4,142107.34,91391.77,366168.42,False,True,False


### Splitting the dataset into Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Training the multiple linear regression model on the training set

In [8]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

### Predicting the Test set results

In [11]:
y_pred = regressor.predict(X_test)

np.set_printoptions(precision=2)
# print(np.concatenate((y_pred, y_test)))
y_pred

array([103015.2 , 132582.28, 132447.74,  71976.1 , 178537.48, 116161.24,
        67851.69,  98791.73, 113969.44, 167921.07])

In [12]:
y_test

28    103282.38
11    144259.40
10    146121.95
41     77798.83
2     191050.39
27    105008.31
38     81229.06
31     97483.56
22    110352.25
4     166187.94
Name: Profit, dtype: float64