<a href="https://colab.research.google.com/github/NicoEssi/Machine_Learning_scikit-learn/blob/master/Multiple_Linear_Regression_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression - Demo


---

A multiple linear regression is modelling the relationship between feature vector (X) and its label (y) in a linear approach.

(y = a * X1 + b * X2 + ... + C; a, b, ... = slopes for feature Xn, C = constant)

## Pros

*   Works on any size of dataset
*   Gives information about the relevance of feature

## Cons

*   Difficult to visualize above three dimensions

and assumes the following:
*   Linear relationship
*   Multivariate normality
*   Little to no multicollinearity
*   No auto-correction
*   Homoscedasticity

---

## 1. Import dependencies and data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

!wget https://raw.githubusercontent.com/NicoEssi/Machine_Learning_scikit-learn/master/50_Startups.csv

--2019-08-23 19:09:03--  https://raw.githubusercontent.com/NicoEssi/Machine_Learning_scikit-learn/master/50_Startups.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2436 (2.4K) [text/plain]
Saving to: ‘50_Startups.csv’


2019-08-23 19:09:03 (40.2 MB/s) - ‘50_Startups.csv’ saved [2436/2436]



## 2. Read CSV and inspect

In [2]:
data = pd.read_csv("./50_Startups.csv")

X = data.iloc[:, :-1]
y = data.iloc[:, 4]

X = X.values # Conversion due to 'Series' object of pandas.core.series module has no attribute 'reshape'

data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## 3. Encode categorical data

In [0]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()
onehotencoder = OneHotEncoder(categorical_features = [3]) #categorical-features keyboard is deprecated; use ColumnTransformer instead in v0.22+
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
X = onehotencoder.fit_transform(X).toarray()

# Addressing the dummy variable trap
X = X[:, 1:]
# though not necessary, as the library handles this automatically

## 4. Feature scaling

In [0]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X[:, 2:5] = sc_X.fit_transform(X[:, 2:5])

# though not necessary either, as the library does this anyway

## 5. Split the dataset into training and test set

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## 6. Initialize and fit the multiple regression model

In [6]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## 7. Predict with the trained model

In [0]:
y_predictions = regressor.predict(X_test)

## 8. Optimize with backward elimination (p-value = 0.050)

In [0]:
import statsmodels.api as sm

# Add bias as it is not accounted for by statsmodel library
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)

In [9]:
X_optimal = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Fri, 23 Aug 2019",Prob (F-statistic):,1.34e-27
Time:,19:09:30,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.12e+05,2312.602,48.414,0.000,1.07e+05,1.17e+05
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,3.663e+04,2108.775,17.369,0.000,3.24e+04,4.09e+04
x4,-748.9975,1448.705,-0.517,0.608,-3668.671,2170.676
x5,3266.2152,2075.251,1.574,0.123,-916.178,7448.608

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,4.49


In [10]:
# remove x2 as P>|t| = 0.990

X_optimal = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Fri, 23 Aug 2019",Prob (F-statistic):,8.49e-29
Time:,19:09:33,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.119e+05,1613.655,69.372,0.000,1.09e+05,1.15e+05
x1,220.1585,2900.536,0.076,0.940,-5621.821,6062.138
x2,3.662e+04,2080.207,17.606,0.000,3.24e+04,4.08e+04
x3,-748.7499,1432.394,-0.523,0.604,-3633.740,2136.240
x4,3266.2019,2052.066,1.592,0.118,-866.872,7399.276

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,3.2


In [11]:
# remove x1 as P>|t| = 0.940

X_optimal = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Fri, 23 Aug 2019",Prob (F-statistic):,4.53e-30
Time:,19:09:35,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.12e+05,1305.649,85.791,0.000,1.09e+05,1.15e+05
x1,3.661e+04,2051.533,17.846,0.000,3.25e+04,4.07e+04
x2,-743.7733,1415.345,-0.526,0.602,-3592.715,2105.168
x3,3296.2630,1991.607,1.655,0.105,-712.633,7305.159

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,2.78


In [12]:
# remove x2 as P>|t| = 0.602

X_optimal = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Fri, 23 Aug 2019",Prob (F-statistic):,2.1600000000000003e-31
Time:,19:09:38,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.12e+05,1295.556,86.459,0.000,1.09e+05,1.15e+05
x1,3.62e+04,1878.872,19.266,0.000,3.24e+04,4e+04
x2,3620.6842,1878.872,1.927,0.060,-159.118,7400.487

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,2.5


In [13]:
# remove x2 as P>|t| = 0.060.

X_optimal = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Fri, 23 Aug 2019",Prob (F-statistic):,3.5000000000000004e-32
Time:,19:09:40,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.12e+05,1331.673,84.114,0.000,1.09e+05,1.15e+05
x1,3.882e+04,1331.673,29.151,0.000,3.61e+04,4.15e+04

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,1.0


## End. Model now optimized