## Assumptions of linear regression
1. Linearity
2. Homogeneity of variance
3. Multivariate normality
4. Independence of errors
5. Lack of multicollinearity (i.e., each vector is not a linear combination of other vectors)

# Building a Model

## Backward Elimination

1. Select a significance level to stay in the model (sig= .05)
2. Fit the full model with all possible predictors
3. Consider the predictor with the highest p-value. If p > .05, go to step4.
4. Remove the predictor
5. Fit model without that variable (rebuild the model, coefficient will be different)
6. Repeat 3-5
7. When p < .05, the model is ready.

## Forward Selection

1. Select a significance level to enter the model (sig= .05)
2. Fit all simple regression models y ~ $x_{n}$, select the one with the lowest p-value
3. Keep this variable and fit all possible models with one extra predictor added 
4. Consider the predictor with the lowest p-value. If p < .05, repeat 3-4
5. When p > .05, keep the previous model and your model is ready

## Bidirectional Elimination (Step-wise Regression)

1. Select a significance level to enter and to stay in the model (e.g., sig_enter =.05, sig_stay= .05)
2. Perform the next step of Forward Selection (new variable must have p < sig_enter to enter)
3. Perform ALL steps of Backward Elimination (old variable must have p < sig_stay to stay)
4. Repeat 2-3
5. No new variables can enter and no old variables can exit, your model is ready

## All possible Models

1. Select a criterion of goofness of fit (e.g., Akaike criterion)
2. Construct all possible regression models: $2^{N} - 1$ total combinations
3. Select the one with the best criterion

In [17]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [18]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [19]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
# label the selected column numerically
X[:, -1] = labelencoder_X.fit_transform(X[:, -1])
X

array([[165349.2, 136897.8, 471784.1, 2],
       [162597.7, 151377.59, 443898.53, 0],
       [153441.51, 101145.55, 407934.54, 1],
       [144372.41, 118671.85, 383199.62, 2],
       [142107.34, 91391.77, 366168.42, 1],
       [131876.9, 99814.71, 362861.36, 2],
       [134615.46, 147198.87, 127716.82, 0],
       [130298.13, 145530.06, 323876.68, 1],
       [120542.52, 148718.95, 311613.29, 2],
       [123334.88, 108679.17, 304981.62, 0],
       [101913.08, 110594.11, 229160.95, 1],
       [100671.96, 91790.61, 249744.55, 0],
       [93863.75, 127320.38, 249839.44, 1],
       [91992.39, 135495.07, 252664.93, 0],
       [119943.24, 156547.42, 256512.92, 1],
       [114523.61, 122616.84, 261776.23, 2],
       [78013.11, 121597.55, 264346.06, 0],
       [94657.16, 145077.58, 282574.31, 2],
       [91749.16, 114175.79, 294919.57, 1],
       [86419.7, 153514.11, 0.0, 2],
       [76253.86, 113867.3, 298664.47, 0],
       [78389.47, 153773.43, 299737.29, 2],
       [73994.56, 122782.75, 30331

In [20]:
onehotencoder = OneHotEncoder(categorical_features = [-1])
# dummy code the selected column
X = onehotencoder.fit_transform(X).toarray()

In [21]:
# AVOID dummy variable trap. But the regression library we use will take care of this
X= X[:, 1:]

#Format numpy output, supress scientific notation
# np.set_printoptions(precision=4,
#                        threshold=10000,
#                        linewidth=150,
#                         suppress=True)

X

array([[     0.  ,      1.  , 165349.2 , 136897.8 , 471784.1 ],
       [     0.  ,      0.  , 162597.7 , 151377.59, 443898.53],
       [     1.  ,      0.  , 153441.51, 101145.55, 407934.54],
       [     0.  ,      1.  , 144372.41, 118671.85, 383199.62],
       [     1.  ,      0.  , 142107.34,  91391.77, 366168.42],
       [     0.  ,      1.  , 131876.9 ,  99814.71, 362861.36],
       [     0.  ,      0.  , 134615.46, 147198.87, 127716.82],
       [     1.  ,      0.  , 130298.13, 145530.06, 323876.68],
       [     0.  ,      1.  , 120542.52, 148718.95, 311613.29],
       [     0.  ,      0.  , 123334.88, 108679.17, 304981.62],
       [     1.  ,      0.  , 101913.08, 110594.11, 229160.95],
       [     0.  ,      0.  , 100671.96,  91790.61, 249744.55],
       [     1.  ,      0.  ,  93863.75, 127320.38, 249839.44],
       [     0.  ,      0.  ,  91992.39, 135495.07, 252664.93],
       [     1.  ,      0.  , 119943.24, 156547.42, 256512.92],
       [     0.  ,      1.  , 114523.61,

In [22]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [23]:
# fit multiple regression model to the training set
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [24]:
# predict the test set results
y_pred= regressor.predict(X_test)
y_pred

array([103015.2016, 132582.2776, 132447.7385,  71976.0985, 178537.4822, 116161.2423,  67851.6921,  98791.7337, 113969.4353, 167921.0657])

In [25]:
y_test

array([103282.38, 144259.4 , 146121.95,  77798.83, 191050.39, 105008.31,  81229.06,  97483.56, 110352.25, 166187.94])

# Backward Elimination

In [26]:
import statsmodels.formula.api as sm

#statsmodels requires users to add column x0 = 1 for intercept
X = np.append(arr= np.ones((50, 1)).astype(int), values= X, axis= 1)
X

array([[     1.  ,      0.  ,      1.  , 165349.2 , 136897.8 , 471784.1 ],
       [     1.  ,      0.  ,      0.  , 162597.7 , 151377.59, 443898.53],
       [     1.  ,      1.  ,      0.  , 153441.51, 101145.55, 407934.54],
       [     1.  ,      0.  ,      1.  , 144372.41, 118671.85, 383199.62],
       [     1.  ,      1.  ,      0.  , 142107.34,  91391.77, 366168.42],
       [     1.  ,      0.  ,      1.  , 131876.9 ,  99814.71, 362861.36],
       [     1.  ,      0.  ,      0.  , 134615.46, 147198.87, 127716.82],
       [     1.  ,      1.  ,      0.  , 130298.13, 145530.06, 323876.68],
       [     1.  ,      0.  ,      1.  , 120542.52, 148718.95, 311613.29],
       [     1.  ,      0.  ,      0.  , 123334.88, 108679.17, 304981.62],
       [     1.  ,      1.  ,      0.  , 101913.08, 110594.11, 229160.95],
       [     1.  ,      0.  ,      0.  , 100671.96,  91790.61, 249744.55],
       [     1.  ,      1.  ,      0.  ,  93863.75, 127320.38, 249839.44],
       [     1.  ,      0

In [27]:
# optimal model
X_opt= X[:, [0,1,2,3,4,5]]
# create object in OLS(ordinary least squares) class
regressor_OLS= sm.OLS(endog= y, exog= X_opt).fit()

In [28]:
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Wed, 20 Feb 2019",Prob (F-statistic):,1.34e-27
Time:,12:01:46,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


In [29]:
X_opt= X[:, [0,3,5]]
regressor_OLS= sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Wed, 20 Feb 2019",Prob (F-statistic):,2.1600000000000003e-31
Time:,12:01:55,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [30]:
X_opt

array([[     1.  , 165349.2 , 471784.1 ],
       [     1.  , 162597.7 , 443898.53],
       [     1.  , 153441.51, 407934.54],
       [     1.  , 144372.41, 383199.62],
       [     1.  , 142107.34, 366168.42],
       [     1.  , 131876.9 , 362861.36],
       [     1.  , 134615.46, 127716.82],
       [     1.  , 130298.13, 323876.68],
       [     1.  , 120542.52, 311613.29],
       [     1.  , 123334.88, 304981.62],
       [     1.  , 101913.08, 229160.95],
       [     1.  , 100671.96, 249744.55],
       [     1.  ,  93863.75, 249839.44],
       [     1.  ,  91992.39, 252664.93],
       [     1.  , 119943.24, 256512.92],
       [     1.  , 114523.61, 261776.23],
       [     1.  ,  78013.11, 264346.06],
       [     1.  ,  94657.16, 282574.31],
       [     1.  ,  91749.16, 294919.57],
       [     1.  ,  86419.7 ,      0.  ],
       [     1.  ,  76253.86, 298664.47],
       [     1.  ,  78389.47, 299737.29],
       [     1.  ,  73994.56, 303319.26],
       [     1.  ,  67532.53, 3047