<a href="https://colab.research.google.com/github/thekkanathashish95/Projects/blob/master/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression

Dummy variable trap occurs because it will give rise to multicolinearity. Multicolinearity occurs due to the fact that one or more independent features directly affect the value of another feature. For instance D1= 1-D2, where D1 and D2 are two dummy variables for France and England

- Always Omit one dummy variable

**Statistical Significance**

If p value is lesser than 5% or any other determined level. we reject the null hypothesis and go with the alternate hypothesis


**5 Methods of Building a Model**

1. All In - all variables are thrown in, we do it if we have prior knowledge
2. Backward Elimination 
3. Forward Selection
4. Bi-directional Elimination
5. Score Comparison

**Backward Elimination Steps**

1. Select Significance level
2. Fit the model with all possible predictors
3. Consider the predictor with highest P-Value. If P>SL, go to step 4, else finish
4. Remove the predictor
5. Fit the model without the removed variable

**Steps in forward selection**

1. Select significance level
2. Fit all simple regression models, select the one with the lowest p value
3. Keep this variable, and fit all possible models with one extra predictor added to the ones you already have.
4. Consider the predictor with the lowest p value, if P < SL, go to step 3 otherwise, go to finish

**Bi - directional elimination**

1. Select a significance level to enter and stay in a model. SLENTER = 0.05, SLSTAY = 0.05
2. Perform the next step of forward selection (New variable must have p < SLENTER
3. Perform all the steps of the backward elimination (Old varibles must have P < SLSTAY to stay)
4. Repeat step 2 and 3 until no new variable can be added or no new variables can exit


**Feature scaling shouldnt be applied to regression as the scaling will be done by multiplying values to coefficients**

**We dont have to manually remove a dummy variable because it will automatically be done by the Multilinear regression class that we import**



## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
X

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

## Encoding categorical data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 1.0, 94657.16, 145077.58

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [None]:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train, y_train)
regressor.predict(X_test)

array([103015.20159795, 132582.27760816, 132447.73845175,  71976.09851258,
       178537.48221057, 116161.24230167,  67851.69209676,  98791.73374687,
       113969.43533014, 167921.06569552])

## Predicting the Test set results

In [None]:
y_pred=regressor.predict(X_test)
np.set_printoptions(precision=2) #This helps in reducing the decimal values to two points
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test),1)),1)).

#reshape will change the array to vertical. 1st argument is no of rows and second argument is number of columns

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


In [None]:

print(np.concatenate((y_pred, y_test)))

[103015.2  132582.28 132447.74  71976.1  178537.48 116161.24  67851.69
  98791.73 113969.44 167921.07 103282.38 144259.4  146121.95  77798.83
 191050.39 105008.31  81229.06  97483.56 110352.25 166187.94]
