In [46]:
import pandas as pd
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

In [47]:
%matplotlib inline

In [48]:
# In this Multiple Regression model, we are given a dataset of 50 startup companies. It gives you the profit for the
# companies for a financial year and their different spending patterns. So a venture capitalist company has hired you
# as a data scientist and you need to predict if they should invest in a given unknown company given their spending pattern.
# Remember, they are interested in making maximum profit. It's not a YES or no classification, you need to predict
# the profit and then a decision needs to be taken, so it becomes a regression model problem.
# Lets import our dataset
dataset = pd.read_csv('/home/rajatgirotra/study/machine_learning/course/MachineLearningA-ZTemplateFolder/Part2_Regression/Section5_MultipleLinearRegression/50_Startups.csv')

In [49]:
# we have 30 observations in our dataset
dataset

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


In [50]:
# Just like linear regression formula is y = mx + b. Multiple linear regression formula is
# y = b1x1 + b2x2 + b3x3 + ... + b0 (where b0 is the y intercept) and b1, b2, b3, etc are all coefficients
# Independent variables : x1, x2, x3 etc
# dependant variable: y
# constant: b0
# ceofficients: b1, b2, b3 etc


In [51]:
# separate features and labels
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [52]:
# Change categorical data State to quantitative data
# VERY VERY VERY VERY IMPORTANT INFO TO FOLLOW
##########################################################################################
""" 
Converting categorical data to quantitative is done by adding DUMMY VARIABLES. Assume your Multi Liner Regression
Equation is 

y = b0 + b1x1 + b2x2 + b3x3 + ....

where one categorical column is StateName with value either NewYork or California, then you introduce dummy column
NewYork and California with value either 0 or 1(ie like a switch). Each row will have 1 in either NewYork or California

so the equation now becomes

y = b0 + b1x1 + b2x2 + b4D4 + b5D5 (where x3 (ie state col) is split into dummy cols NewYork and California).

This equation is actually wrong, because the dummy cols behave like a switch. So lets say if NewYork is D4, and b4 is 0,
i.e b4D4 = 0, then it means that by default, D5 should be 1. i.e. D5 = 1-D4, and the equation should really be

y = b0 + b1x1 + b2x2 + b4D4, and if b4D4 is 0, the equation becomes y = b0 + b1x1 + b2x2 for California. We say that
the coefficient for California is included in the constant b0. Therefore, never fall in the dummy variable trap.

Always use as n-1 dummy expressions where n is the number of dummy cols.




Also you have read before how to choose good features. Dont use features which are useless. Using too many features
also makes it difficult to represent to a large audience and reason out why so many features are used.
There are some steps to take to build a good model:
1) All-in
2) Backward Elimination
3) Forward Selection
4) Bi-directional Elimination
5) Score comparison 

Step 2, 3, 4 are togther referred to as Stepwise regression

1) All-in means you just know what features are your best predictors (prior knowledge) because
   a) You know that from domain knowledge
   b) or from your experience (you have done such a model before)
   c) or some-one gave you those predictors and asked to use those
   d) Or you are preparing for Backward Elimination
   
2) Backward Elimination: This method has some steps to it.
   a) Step 1: Select a Significance Level (SL).. Example SL = 0.05 (ie 5%)
   b) Step 2: Fit the full model with all possible predictors (All-in)
   c) Step 3: Consider the predictor with the highest P-value. if P > SL, go to step 4, otherwise go to FIN (ie finished, your model is ready)
   d) Step 4: Remove the predictor
   e) Step 5: Fit the model again without the predictor
   f) Step 6: Go to step c again

3) Forward Selection: Much complex than BE.
   a) Step 1: Select a Significance Level (SL).. Example SL = 0.05 (ie 5%)
   b) For each predictor, we fit a simple linear regression model. Then we select the model with the lowest P-value.
   Example: let say you have 4 predictor F1, F2, F3, F4. Then you fit 4 simple linear regression models each for F1, F2, F3, F4
   Let say F3 had the lowest P value.
   c) We keep this variable (F3), and then we fit all possible models with one extra predictor added to the ones you already have
   ie. we now create linear regression models with two variables where one variable is always F3, so options are:
   F1F3, F2F3, F4F3.
   d) Consider the predictor with the lowest P-value. If P < SL, go to step c), otherwise go to FIN (your model is ready)
   Let say this was F2F3. and P value was less than 0.05. So we repeat step c) again, and fit all possible models with one extra
   predictor to the ones we already have, ie F1F2F3, F4F2F3. Ie we create linear regression model with 3 variables.
   Let say this time lowest P value is for F1F2F3 and value > 0.05. So we stop.

4) Bi-directional elimination: Is a combination of 2) and 3)
   a) Step 1: Select a SL to stay and a SL to enter. SLSTAY = 0.05 and SLENTER = 0.05
   b) Step 2: Perform the next step of Forward Selection (ie. new variables must have P < SLENTER to enter)
   So same example as above: you have F3 predictor with the lowest P value. Let say P-value(F3) = 0.02
   c) Step 3: Perform ALL steps of BE. (ie. old variables must have a P-value > SLSTAY to stay)
   at the first iteration P-value(F3) = 0.02 is < 0.05, so we go to FIN in BE
   d) Step 4: go to step b). Let say we get F1F3, F2F3, F4F3 and F2F3 has lowest P-value 0.04
   e) Again at step c)P-value of (F2F3) = 0.04 < 0.05, so we go to FIN in BE
   f) Again at step b) You add F1F2F3, then at step c) you eliminate F2, and you are left with F1F3.
   g) At any step where you can not add any variables or delete any old variables you are done.
   
5) Score Comparison: ie. brute force approach. 
   a) Select a criteria for your model example: r-squared. 
   b) Calculate r-squared for all possible combinations of predictors : 2^n -1 for n predictors
   c) Choose the model with the best criteria.
   Note a good approach as the number of models is growing exponentially and is very resource consuming.

We will use backward elimination in our study as it is the fastest.
A word on p-value: Fill this section when you read more about difference in experimental and theoretical propability.
"""

###########################################################################################
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [53]:
le_X = LabelEncoder()
ohe = OneHotEncoder(categorical_features=[3])

In [54]:
# Column 3 is the state column
X[:, 3] = le_X.fit_transform(X[:, 3])

In [55]:
# Convert this categorical column by adding dummy variables.
X = ohe.fit_transform(X).toarray()

In [56]:
# Avoid the Dummy Variable Trap
X = X[:, 1:]

In [57]:
# separate training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [58]:
# Fit the Multiple Linear Regression model to our data set
from sklearn.linear_model import LinearRegression

In [59]:
regressor = LinearRegression()

In [60]:
regressor = regressor.fit(X_train, y_train)
regressor

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [61]:
# predict now, however we cannot plot like we did in Simple Linear Progression as we have multiple features (4)
# So we cannot plot 5 (4+1 label) dimensions
y_pred = regressor.predict(X_test)

In [62]:
print('y_test \n%s' % y_test)
print('y_pred \n%s' % y_pred)

y_test 
[ 103282.38  144259.4   146121.95   77798.83  191050.39  105008.31
   81229.06   97483.56  110352.25  166187.94]
y_pred 
[ 103015.20159796  132582.27760816  132447.73845175   71976.09851259
  178537.48221054  116161.24230163   67851.69209676   98791.73374688
  113969.43533012  167921.0656955 ]


In [63]:
# But what about Backward Elimination, where did we use it??
# In the model we just built, we used multiple independent variables. What if one of the variables was statistically
# significant (ie they have a greater impact on the profits), or one of the variables was statistically in-significant.
# We could have dropped the insignificant variable and may be our model results would have improved.

# So the model above may not be optimal. Lets build the optimal model using Backward Elimination
import statsmodels.formula.api as smf
# Now the MLR equation is b0 + b1x1 + b2x2 + b3x3. The statsmodels equation does not understand the constant b0.
# So we have to change the equation to b0x0 + b1x1 + b2x2 + b3x3... with x0=1
# ie we need to add a column of one's to our X_train and X_test

In [64]:
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)

In [65]:
X

array([[  1.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          1.65349200e+05,   1.36897800e+05,   4.71784100e+05],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.62597700e+05,   1.51377590e+05,   4.43898530e+05],
       [  1.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          1.53441510e+05,   1.01145550e+05,   4.07934540e+05],
       [  1.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          1.44372410e+05,   1.18671850e+05,   3.83199620e+05],
       [  1.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          1.42107340e+05,   9.13917700e+04,   3.66168420e+05],
       [  1.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          1.31876900e+05,   9.98147100e+04,   3.62861360e+05],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.34615460e+05,   1.47198870e+05,   1.27716820e+05],
       [  1.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          1.30298130e+05,   1.45530060e+05,   3.23876680e+05],


In [66]:
# Start Backward Elimination
# We create a variable X_opt which will eventually just contain the features (predictors) which are statistically
# significant for the dependant variable profit

# start with all predictors (Read the BE technique above)
X_opt = X[:, [0, 1, 2, 3, 4, 5]]

# Fit X_opt to your model. Note we will need to use a new regressor from the statsmodel library
# OLS below ordinary least squares

In [67]:
# See the smf.OLS? help and read the exog argument. An intercept is not included by default
# and should be added by the user. That's why we added the x0 = 1 to the equation above
regressor_ols = smf.OLS(endog = y, exog = X_opt).fit()

In [68]:
# Step 3. Look for the predictor with the highest P value.
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Wed, 22 Nov 2017",Prob (F-statistic):,1.34e-27
Time:,06:42:59,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


In [69]:
# So x2 has higest p value of 0.990; so remove column 2 from X_opt and refit
# Also note that p value for x2 is not zero, just that it is very close to zero.
X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_ols = smf.OLS(endog = y, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Wed, 22 Nov 2017",Prob (F-statistic):,8.49e-29
Time:,06:42:59,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
x1,220.1585,2900.536,0.076,0.940,-5621.821,6062.138
x2,0.8060,0.046,17.606,0.000,0.714,0.898
x3,-0.0270,0.052,-0.523,0.604,-0.131,0.077
x4,0.0270,0.017,1.592,0.118,-0.007,0.061

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


In [70]:
# So x1 has higest p value of 0.940; so remove column 1 from X_opt and refit
X_opt = X[:, [0, 3, 4, 5]]
regressor_ols = smf.OLS(endog = y, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Wed, 22 Nov 2017",Prob (F-statistic):,4.53e-30
Time:,06:42:59,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


In [71]:
# So x1 has higest p value of 0.940; so remove column 1 from X_opt and refit
X_opt = X[:, [0, 3, 5]]
regressor_ols = smf.OLS(endog = y, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Wed, 22 Nov 2017",Prob (F-statistic):,2.1600000000000003e-31
Time:,06:42:59,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [72]:
# So x2 has higest p value of 0.940; so remove column 5 from X_opt and refit
X_opt = X[:, [0, 3]]
regressor_ols = smf.OLS(endog = y, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Wed, 22 Nov 2017",Prob (F-statistic):,3.5000000000000004e-32
Time:,06:42:59,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


In [73]:
# So only the 3rd col in X is statistically significant to predict the profit levels.
# Lets try to make predictions using just that feature and compare it to previous results
X_temp = dataset.loc[:, ['R&D Spend']].values
y_temp = dataset.iloc[:, -1].values

In [74]:
X_temp.shape

(50, 1)

In [75]:
X_opt_train, X_opt_test, y_opt_train, y_opt_test = train_test_split(X_temp, y_temp, test_size=0.2, random_state=0)
regressor_slr = LinearRegression()
regressor_slr.fit(X_opt_train, y_opt_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [76]:
y_opt_pred = regressor_slr.predict(X_opt_test)

In [77]:
print('actual profits\n%s' % y_opt_test)
print('profits predicted with MLR\n%s' % y_pred)
print('profits predicted with MLR with BE\n%s' % y_opt_pred)

actual profits
[ 103282.38  144259.4   146121.95   77798.83  191050.39  105008.31
   81229.06   97483.56  110352.25  166187.94]
profits predicted with MLR
[ 103015.20159796  132582.27760816  132447.73845175   71976.09851259
  178537.48221054  116161.24230163   67851.69209676   98791.73374688
  113969.43533012  167921.0656955 ]
profits predicted with MLR with BE
[ 104667.27805998  134150.83410578  135207.80019517   72170.54428856
  179090.58602508  109824.77386586   65644.27773757  100481.43277139
  111431.75202432  169438.14843539]
