# Linear Regression
Topics covered in this chapter of the book-

* 3.1 Simple Linear Regression ................... 61
  * 3.1.1 Estimating the Coefficients .............. 61
  * 3.1.2 Assessing the Accuracy of the Coefficient Estimates........................ 63
  * 3.1.3 Assessing the AccuracyoftheModel . . . . . . . . . 68
* 3.2 Multiple Linear Regression .................. 71
  * 3.2.1 Estimating the Regression Coefficients . . . . . . . . 72 
  * 3.2.2 SomeImportantQuestions .............. 75
* 3.3 Other Considerations in the Regression Model . . . . . . . . 82
  * 3.3.1 Qualitative Predictors ................. 82
  * 3.3.2 Extensions of the Linear Model . . . . . . . . . . . . 86
  * 3.3.3 Potential Problems................... 92
* 3.4 The Marketing Plan ...................... 102
* 3.5 Comparison of Linear Regression with K -Nearest Neighbors............................ 104

**Following is the summary of concepts along with data and python code-**


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import math

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import *
from sklearn import datasets, linear_model

### Understand Linear Regression
**Linear regression (LR)** is a approach for predicting a quantitative response Y on the basis of some predictor variables, Xs, assumig a linear relationship between Xs and Y. Mathematically, we can write this linear relationship as

Y ≈ β0 + β1X1 + β2X2 ... βnxn 

β0, β1,.. βn are known as the model coefficients or parameters.

The ordinary least squares (OLS) approach chooses β0, β1,.. βn to minimize the RSS (residual sum of squares)- the gap between actual Y and predicted Y.

Some important questions of linear regression-

* *How good is the relationship between the response and predictors?*
F-statistic helps us understand which mathematically equates to ((TSS − RSS)/p)/(RSS/(n−p−1)). 

* *Deciding on important variables, also known as variable selection.*
The p-value of the variable is a good indicator but not the only one. Sometimes, if p is large we are likely to make some false discoveries. There are three classical approaches for this task- 
  * **Forward selection**- Start from null model and keep adding variables to find the lowest RSS.
  * **Backward selection**- Start with all variables, and keep removing the variables with larger p-value till to find lowest RSS or get low individual p-value.
  * **Mixed selection**- Mix of two. Start with null model, keep adding till p-value of variables gets larger and then remove that variable. Continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model. 

* *Model fit.*
Two of the most common numerical measures of model fit are the RSE and R2, the fraction of variance explained. R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable. 

In [None]:
### Multiple regression with 2 predicators- understand quality of model & solution using model statistics like p-value, R-sq etc.
boston = pd.read_csv('../data/Boston.csv', header=0)
boston.shape
lm = smf.ols('medv~lstat+age', data=boston).fit()
print(lm.summary())

In [None]:
### Multiple regression with all predicators in data- understand quality of model & solution using model statistics like p-value, R-sq etc.
formula = "medv~" + "+".join(boston.columns.drop(["medv"]))
lm = smf.ols(formula, data=boston).fit()
lm.summary()

________

### Qualitative Predictors

These are predictors with with two or more fixed categories of values. In case of two levels, they are also known as a factor or dummy variable. For example- gender, ethinicity etc. 

With such variables, each of their value (minus 1) is considered as separate predicator. For eg- If our ethinicity variable has 3 categories- Asian, Caucasian, and African American, then model with only ethinicity variable would be-

yi = β0 + β1xi1 + β2xi2 + εi where 
* xi1 = 1 for Asian else 0 
* xi2 = 1 for Caucasian else 0
* xi1 = 0 & xi2 = 0 condition is for African American

making yi =
* β0+β1+εi if ith person is Asian
* β0+β2+εi if ith person is Caucasian
* β0+εi if ith person is African American


In [None]:
#Experiment with Qualitative Predictors in following data 
carseats = pd.read_csv('../data/Carseats.csv', header=0)
carseats.head()

In [None]:
lm_carseats_dummy = smf.ols('Sales ~ Income + Advertising + Price + Age + C(ShelveLoc)', 
                            data = carseats).fit()
lm_carseats_dummy.summary()
# see you will get 2 coefficients for ShelveLoc variables as it has 3 values and we are considering them categorical


______

### Non-linear Transformations of the Predictors 
Two of the most important assumptions in LR that the relationship between the predictors and response are additive and linear. But we can extend linear regression by-
* Removing the Additive Assumption

Y = β0 +(β1 +β3X2)X1 +β2X2 +ε

* Non-linear Relationships

Accommodating non-linear relationships through polynomial or logarithmic regression.

Y = β0 + β1X1 + β2X1^2 + ε


In [None]:
##Lets try an non-additive model
lm = smf.ols('medv~lstat * age', data=boston).fit()
print(lm.summary())

In [None]:
##Lets try non-linear relationship and compare with linear model
lm_order1 = smf.ols('medv~ lstat', data=boston).fit()
lm_order2 = smf.ols('medv~ lstat+ I(lstat ** 2.0)', data=boston).fit()
print(lm_order2.summary())


**The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. 
We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.**


In [None]:
#Comparison of linear n non-linear model using anova
table = sm.stats.anova_lm(lm_order1, lm_order2)
print(table)


**F-statistic is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat2 is far superior to the model that only contains the predictor lstat.**


______

### Potential Problems

When we fit a linear regression model to a particular data set, many problems may occur. Most common among these are the following:

**1. Non-linearity of the response-predictor relationships**

Residual plots are a useful graphical tool for identifying non-linearity. If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as logX, √X, and X2, in the regression model

**2. Correlation of error terms**

An important assumption of the linear regression model is that the error terms, ε1,ε2,...,εn, are uncorrelated. If the error terms are correlated, we may have an unwarranted sense of confidence in our model. Plots of residuals terms can help indetify the correlation. Such correlations frequently occur in the context of time series data.

**3. Non-constant variance of error terms**

Another assumption with using OLS in LR is that the error terms have a constant variance, Var(εi) = σ or homoscedasticity. When faced with this problem, one possible solution is to transform the response Y using a concave function such as log Y or Y. Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity.

**4. Outliers**

An outlier is a point for which yi is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection. Residual plots can be used to identify outliers. If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation.

**5. High-leverage points**

These are outliers in predictor xi. One can use usual methods of outlier detection like plots, Z-Score, IQR etc and remove or impute them as justified before adding predicator to the model. Specifically for leverage, one can quantify an observation’s leverage using leverage statistic. A large value of this statistic indicates an observation with high leverage. 

**6. Collinearity**

Collinearity refers to the situation in which two or more predictor variables are closely related to one another.
A simple way to detect collinearity is to look at the correlation matrix of the predictors. In case of multicollinearity, i.e. collinearity between three or more variables even if no pair of variables has a particularly high correlation, is to compute the variance inflation factor (VIF). 
When faced with the problem of collinearity, there are two simple solutions. 
The first is to drop one of the problematic variables from the regression. 
The second solution is to combine the collinear variables together into a single predictor by either taking average of their standardized versions or using dimension reduction methods like PCA.