# Linear Regression

Purpose of linear regression is to predict future values of Xi without a Y value. We need to figure out which feature variable contributes the most information to predict/determine Y. With Linear regression, we can model the strength of the relationship between each dependent variable and Y. It is essentially modelling the relationship between independent feature variables and Y. 

**What's the goal?**
- The goal is to estimate the coefficients: Y = b0 + b1X + e where e is random noise. 
- Once we estimate the coefficients, you can use it to predict new values of Y. 
- How do we estimate these coefficients? Method called **least squares** is the most common. 
    - Least squares regression estimates the coefficient of a linear model equation by minimizing the difference between actual Y values and predicted Y values or otherwise known as minimising the sum of the squared residuals. 

In [5]:
from sklearn.datasets import load_boston
boston = load_boston()
#boston is a dict
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [7]:
boston.data.shape

(506, 13)

In [8]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [10]:
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


In [14]:
df = pd.DataFrame(boston.data)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [17]:
cols = boston.feature_names
cols

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [19]:
df = pd.DataFrame(boston.data, columns= cols)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [22]:
df['PRICE'] = boston.target
df['PRICE'].head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: PRICE, dtype: float64

In [23]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [29]:
%%bash
pip3 install statsmodels

Collecting statsmodels
  Downloading https://files.pythonhosted.org/packages/c6/c8/f620ee78110170e2c4d014eed15d6436984f25f7ff71435820f5d89f478b/statsmodels-0.11.0-cp38-cp38-macosx_10_9_x86_64.whl (8.5MB)
Collecting patsy>=0.5 (from statsmodels)
  Downloading https://files.pythonhosted.org/packages/ea/0c/5f61f1a3d4385d6bf83b83ea495068857ff8dfb89e74824c6e9eb63286d8/patsy-0.5.1-py2.py3-none-any.whl (231kB)
Collecting scipy>=1.0 (from statsmodels)
  Downloading https://files.pythonhosted.org/packages/90/d2/44b70a930ad28da8f65d8c294ac88b20f561e5d650b85efea80381566db1/scipy-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl (28.8MB)
Installing collected packages: patsy, scipy, statsmodels
Successfully installed patsy-0.5.1 scipy-1.4.1 statsmodels-0.11.0


You should consider upgrading via the 'pip install --upgrade pip' command.


In [33]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

In [43]:
reg = ols('PRICE ~ RM', data = df).fit()

In [44]:
reg.summary()

0,1,2,3
Dep. Variable:,PRICE,R-squared:,0.484
Model:,OLS,Adj. R-squared:,0.483
Method:,Least Squares,F-statistic:,471.8
Date:,"Tue, 04 Feb 2020",Prob (F-statistic):,2.49e-74
Time:,23:47:52,Log-Likelihood:,-1673.1
No. Observations:,506,AIC:,3350.0
Df Residuals:,504,BIC:,3359.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-34.6706,2.650,-13.084,0.000,-39.877,-29.465
RM,9.1021,0.419,21.722,0.000,8.279,9.925

0,1,2,3
Omnibus:,102.585,Durbin-Watson:,0.684
Prob(Omnibus):,0.0,Jarque-Bera (JB):,612.449
Skew:,0.726,Prob(JB):,1.02e-133
Kurtosis:,8.19,Cond. No.,58.4


In [45]:
#Coefficient of RM is 9.1
# We can interpret this as if we had a house with 4 bedrooms and another house 5 bedrooms
# The difference between these two houses will be our Y value of 9.1K
# Our confidence interval would be between 8.2K and 9.9K. 

In [58]:
from sklearn.linear_model import LinearRegression
X = df.drop("PRICE", axis =1) #matrice
y = df['PRICE'] #vector
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: PRICE, dtype: float64

In [59]:
linReg = LinearRegression()
linReg.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [60]:
print(f"Estimated intercept coefficients are {linReg.intercept_}")
print(f"Estimated coefficients are {linReg.coef_}")

Estimated intercept coefficients are 36.45948838509015
Estimated coefficients are [-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]


In [65]:
#Creating a dataframe out of columns and estimated coefficients for each column
pd.DataFrame(zip(df.columns, linReg.coef_), columns= ['features', 'estimated_coeffs'])

Unnamed: 0,features,estimated_coeffs
0,CRIM,-0.108011
1,ZN,0.04642
2,INDUS,0.020559
3,CHAS,2.686734
4,NOX,-17.766611
5,RM,3.809865
6,AGE,0.000692
7,DIS,-1.475567
8,RAD,0.306049
9,TAX,-0.012335
