# Subject: Classical Data Analysis

## Session 1 - Regression

### Exercise 1 



Considering the OLS presented in Demo 2 develop a new regression analysis based on the independent variable “LSTAT — percentage of lower status of the population”. 

- Interpret and discuss the OLS Regression Results. 
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.


The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.



# Linear Regression in Statsmodels

### Regression model with Statsmodels and without a constant:

In [6]:
import statsmodels.api as sm
import numpy as np
import pandas as pd

from sklearn import datasets
data = datasets.load_boston()

df_x = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=["MEDV"])

In [5]:
print(data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [7]:
df_x.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [11]:
import statsmodels.api as sm

X = df_x["LSTAT"]
y = df_y["MEDV"]

model = sm.OLS(y,X).fit()
predictions = model.predict()
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.449
Model:,OLS,Adj. R-squared:,0.448
Method:,Least Squares,F-statistic:,410.9
Date:,"Fri, 13 Oct 2017",Prob (F-statistic):,2.7099999999999998e-67
Time:,16:43:42,Log-Likelihood:,-2182.4
No. Observations:,506,AIC:,4367.0
Df Residuals:,505,BIC:,4371.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
LSTAT,1.1221,0.055,20.271,0.000,1.013,1.231

0,1,2,3
Omnibus:,1.113,Durbin-Watson:,0.369
Prob(Omnibus):,0.573,Jarque-Bera (JB):,1.051
Skew:,0.112,Prob(JB):,0.591
Kurtosis:,3.009,Cond. No.,1.0


>**Hypothesis:**

> There is a negative correlation between housing price and socio-economic status

> the lower the socio-economic status, the higher the housing price

### Interpreting the Table 

> Dependent Variable or "Y" that was used is MEDV or the house price/value

> Covariance is nonrobust or it means the line is highly senstivie to outliers, due to the model's many assumptions

> R-squared: (0.449)
it means thatr roughly only about 44.9% of the variance found in housing price can be explained by the socio-economic status and 55.1% cannot be explained by status.



> *LSTAT-Coefficient*:
(1.1221 USD) is the value of housing price (Y), when there is no change in socio economic status (X) 

> *Std Error*
we can say that we are 99.9% (100-0.055) confident the value of the estimated coefficient is within the true coefficient

> Our distribution is skewed to the right

### Regression model with Statsmodels and with a constant:

> Put your code here

In [15]:
X = sm.add_constant(X)
model = sm.OLS(y,X).fit()
predictions = model.predict()
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.544
Model:,OLS,Adj. R-squared:,0.543
Method:,Least Squares,F-statistic:,601.6
Date:,"Fri, 13 Oct 2017",Prob (F-statistic):,5.08e-88
Time:,17:38:28,Log-Likelihood:,-1641.5
No. Observations:,506,AIC:,3287.0
Df Residuals:,504,BIC:,3295.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,34.5538,0.563,61.415,0.000,33.448,35.659
LSTAT,-0.9500,0.039,-24.528,0.000,-1.026,-0.874

0,1,2,3
Omnibus:,137.043,Durbin-Watson:,0.892
Prob(Omnibus):,0.0,Jarque-Bera (JB):,291.373
Skew:,1.453,Prob(JB):,5.36e-64
Kurtosis:,5.319,Cond. No.,29.7


### Interpreting the Table 


> Rsquared (0.544): roughly 54.4% of the variance is explained by the change in socio economic status

> The coefficient of LSTAT shows that for every 10% increase socio-economic status, there is a 95decrease of house price.

> Standard erros (0.039) or we are 99.9% confident that our estimated coefficient is close to the true coefficient.

# Linear Regression in SKLearn 

> Put your code here

In [18]:
from sklearn import linear_model, datasets

data = datasets.load_boston()

df = pd.DataFrame(data.data, columns=data.feature_names)

df2 = pd.DataFrame(df, columns=["LSTAT"])
target = pd.DataFrame(data.target, columns=["MEDV"])

X = df2
y = target["MEDV"]

lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)

In [20]:
print(predictions[:4])

[ 29.8225951   25.87038979  30.72514198  31.76069578]


In [21]:
lm.score(X,y)

0.54414629758647992

score or Rsquared is 0.544, which means that the status (X) can explain 54.4% of changes in Y(house price) while the 45.6% is/could be explained by other factors.

In [22]:
lm.coef_

array([-0.95004935])

There is a negative correlation between house price and status. For every 1% increase in status, the house price decreases by 9.5USD

In [23]:
lm.intercept_

34.55384087938311