# Linear Regression
Topics covered in this chapter of the book-

* 3.1 SimpleLinearRegression ................... 61
  * 3.1.1 EstimatingtheCoefficients .............. 61
  * 3.1.2 Assessing the Accuracy of the Coefficient Estimates........................ 63
  * 3.1.3 AssessingtheAccuracyoftheModel . . . . . . . . . 68
* 3.2 MultipleLinearRegression .................. 71
  * 3.2.1 Estimating the Regression Coefficients . . . . . . . . 72 3.2.2 SomeImportantQuestions .............. 75
* 3.3 Other Considerations in the Regression Model . . . . . . . . 82
  * 3.3.1 QualitativePredictors ................. 82
  * 3.3.2 ExtensionsoftheLinearModel . . . . . . . . . . . . 86
  * 3.3.3 PotentialProblems................... 92
* 3.4 TheMarketingPlan ...................... 102
* 3.5 Comparison of Linear Regression with K -Nearest Neighbors............................ 104

**Following is the summary of concepts along with data and python code-**

**Linear regression** is a approach for predicting a quantitative response Y on the basis of some predictor variables, Xs, assumig a linear relationship between Xs and Y. Mathematically, we can write this linear relationship as

Y ≈ β0 + β1X1 + β2X2 ... βnxn 

β0, β1,.. βn are known as the model coefficients or parameters.

The ordinary least squares (OLS) approach chooses β0, β1,.. βn to minimize the RSS (residual sum of squares)- the gap between actual Y and predicted Y.

Some important questions of linear regression-

* *How good is the relationship between the response and predictors?*
F-statistic helps us understand which mathematically equates to ((TSS − RSS)/p)/(RSS/(n−p−1)). 

* *Deciding on important variables, also known as variable selection.*
The p-value of the variable is a good indicator but not the only one. Sometimes, if p is large we are likely to make some false discoveries. There are three classical approaches for this task- 
  * **Forward selection**- Start from null model and keep adding variables to find the lowest RSS.
  * **Backward selection**- Start with all variables, and keep removing the variables with larger p-value till to find lowest RSS or get low individual p-value.
  * **Mixed selection**- Mix of two. Start with null model, keep adding till p-value of variables gets larger and then remove that variable. Continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model. 

* *Model fit.*
Two of the most common numerical measures of model fit are the RSE and R2, the fraction of variance explained. R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import math

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import *
from sklearn import datasets, linear_model

In [2]:
Boston = pd.read_csv('/Users/shilpa/Documents/blog/Sharing_ISL_python/data/Boston.csv', header=0)
Boston.shape
lm = smf.ols('medv~lstat+age', data=Boston).fit()
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.551
Model:                            OLS   Adj. R-squared:                  0.549
Method:                 Least Squares   F-statistic:                     309.0
Date:                Sun, 11 Oct 2020   Prob (F-statistic):           2.98e-88
Time:                        03:00:12   Log-Likelihood:                -1637.5
No. Observations:                 506   AIC:                             3281.
Df Residuals:                     503   BIC:                             3294.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     33.2228      0.731     45.458      0.0

In [3]:
formula = "medv~" + "+".join(Boston.columns.drop(["medv"]))
lm = smf.ols(formula, data=Boston).fit()
lm.summary()

0,1,2,3
Dep. Variable:,medv,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Sun, 11 Oct 2020",Prob (F-statistic):,6.72e-135
Time:,03:00:16,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,36.4595,5.103,7.144,0.000,26.432,46.487
crim,-0.1080,0.033,-3.287,0.001,-0.173,-0.043
zn,0.0464,0.014,3.382,0.001,0.019,0.073
indus,0.0206,0.061,0.334,0.738,-0.100,0.141
chas,2.6867,0.862,3.118,0.002,0.994,4.380
nox,-17.7666,3.820,-4.651,0.000,-25.272,-10.262
rm,3.8099,0.418,9.116,0.000,2.989,4.631
age,0.0007,0.013,0.052,0.958,-0.025,0.027
dis,-1.4756,0.199,-7.398,0.000,-1.867,-1.084

0,1,2,3
Omnibus:,178.041,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,783.126
Skew:,1.521,Prob(JB):,8.84e-171
Kurtosis:,8.281,Cond. No.,15100.0


## Non-linear Transformations of the Predictors 

In [4]:
lm_order1 = smf.ols('medv~ lstat', data=Boston).fit()
lm_order2 = smf.ols('medv~ lstat+ I(lstat ** 2.0)', data=Boston).fit()
print(lm_order2.summary())

                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     448.5
Date:                Sun, 11 Oct 2020   Prob (F-statistic):          1.56e-112
Time:                        03:01:05   Log-Likelihood:                -1581.3
No. Observations:                 506   AIC:                             3169.
Df Residuals:                     503   BIC:                             3181.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          42.8620      0.872     

In [6]:
table = sm.stats.anova_lm(lm_order1, lm_order2)
print(table)

   df_resid           ssr  df_diff     ss_diff           F        Pr(>F)
0     504.0  19472.381418      0.0         NaN         NaN           NaN
1     503.0  15347.243158      1.0  4125.13826  135.199822  7.630116e-28


  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [7]:
sm.stats.anova_lm?