<img src="../logo.png" alt="Drawing" style="width: 200px;"/>

### Assumptions in Linear Regression:

#### 1. Normally distributed Residuals
Residuals should be normally distributed. This can be checked using histogram of residuals

#### 2. Little to no Multicollinearity 
Multiple regression assumes that the independent variables are not highly correlated with each other.  This assumption is tested using Variance Inflation Factor (VIF) values. One way to deal with multicollinearity is subtracting mean. 

#### 3. Homoscedasticity
This assumption states that the variance of error terms are similar across the values of the independent variables.  A plot of standardized residuals versus predicted values can show whether points are equally distributed across all values of the independent variables.

### Dummy variable trap
This occurs when there is redundant information due to OneHotEncoder. Eg if there are two cities, New York and California, then a since City_New_York with value 0 or 1 is enough to preserve the information. If you make two columns City_New_York and City_California then both will portray same information, just opposite values. This introduces multicollinearity. When there are many unrelated featueres, the model can learn a lot from those. But when there are less features, then the model will be unstable and will undergo huge changes with little change in input value.

#### Dummy variable trap can be avoided by dropping one feature off every subset of dummy variables. 

In [5]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create made up regression dataset
X, y = make_regression(n_samples=100, n_features=20, noise=0.95)
# Create a table to view it
df = pd.DataFrame(X)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1.202706,0.735514,-0.047838,-2.187794,0.839496,-0.304576,-0.896021,-0.077566,0.189385,-0.915344,-0.33609,0.641887,-2.130446,0.413751,0.703817,-0.417254,-1.524003,0.852331,-0.601299,-0.633775
1,-1.010311,0.593582,1.288254,-0.064443,0.739357,-0.297783,1.909916,1.160308,0.548023,-1.319249,0.466051,-0.035059,-0.99732,-1.563429,1.494439,-1.278499,-1.363314,1.736244,1.114452,0.320424
2,0.484955,-0.779991,-0.126014,-0.777282,0.258963,1.402295,-0.363904,-0.177925,-0.054856,-0.727496,1.672051,-0.261492,0.573812,-0.52603,0.099938,-1.606118,1.058471,-0.904352,0.449558,-0.800786
3,-1.2089,0.236642,-1.223117,1.135559,-1.091638,-0.091871,-1.461505,1.239335,-0.627592,0.610811,-1.410274,-0.248248,-0.661891,0.870438,-0.41842,-0.024334,0.442011,0.095342,1.930432,0.220077
4,0.264262,1.1781,0.05936,-0.403637,-0.146125,0.308881,-1.685824,0.951776,-0.315455,-1.347757,0.222088,-1.343091,0.387767,-0.759524,-0.140715,-0.141428,1.368421,0.172622,1.982343,1.073556


### Results on a dataset with no multicollinearity

In [6]:
# Cross Validation will fit the classifier N number of times
# and display the accuracies
cv = cross_val_score(LinearRegression(), X, y, cv=10)
print("Mean: {}".format(cv.mean()))
print("Values: {}".format(cv))

Mean: 0.9999355901283936
Values: [0.99996062 0.99990841 0.99994439 0.99978415 0.99994816 0.99996504
 0.99994635 0.99996064 0.99997748 0.99996068]


In [7]:
# Create the dataset once again. This time, introduce high
# multicollinearity by setting low rank to the input matrix
X, y = make_regression(n_samples=100, n_features=20, noise=0.95, effective_rank=1)
df = pd.DataFrame(X)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-0.023856,0.001896,-0.035178,0.010965,-0.041505,-0.005586,0.028241,0.04542,-0.029781,0.003428,0.009678,-0.007403,0.044065,0.015491,0.002831,0.016153,-0.032155,-0.003927,-0.016608,-0.044117
1,-0.003616,0.005709,-0.033838,-0.007398,0.007179,-0.000807,0.020536,0.032949,-0.002037,0.014506,0.046745,0.006286,0.014733,-0.003404,-0.008365,0.040983,0.016354,-0.020426,-0.014689,-0.031964
2,-0.041412,0.010144,-0.022963,-0.029249,-0.009652,-0.069712,0.047599,0.058922,-0.036025,0.011588,0.018672,0.012359,-0.049443,-0.004176,0.042588,-0.068628,0.011878,0.026099,0.00561,-0.054352
3,-0.009332,-0.011395,-0.036991,-0.023522,0.030445,0.047535,-0.001719,-0.014462,0.065454,0.000787,0.014055,-0.026952,0.040762,0.010221,0.081079,0.037464,-0.050339,0.056508,-0.002087,-0.030048
4,-0.085392,0.030378,0.013877,-0.00827,-0.051678,-0.087146,0.039542,0.067297,-0.045545,0.012721,-0.033489,0.006855,-0.059881,0.01891,0.020875,-0.035731,0.005325,0.029263,0.038071,-0.069311


### Results on dataset with high multicollinearity 

In [8]:
# Cross Validation will fit the classifier N number of times
# and display the accuracies
cv = cross_val_score(LinearRegression(), X, y, cv=10)
print("Mean: {}".format(cv.mean()))


Mean: 0.9508621921561105
Values: [0.92019811 0.94449363 0.98850286 0.96867241 0.95279599 0.94459741
 0.93774898 0.92667373 0.94236117 0.98257764]


<p>&copy; 2018 Stacklabs<p>
