# DS-SF-23 | Codealong 05 | Inferential Statistics for Model Fit

## Inferential Statistics | Motivating Example

In [26]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

In [27]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-05-start.csv'), index_col = 'ID')

We are using our usual SF housing dataset but we added two new variables `M1` and `M2` to it.

In [5]:
df

Unnamed: 0_level_0,Address,DateOfSale,SalePrice,IsAStudio,BedCount,...,Size,LotSize,BuiltInYear,M1,M2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",12/4/15,710000.0,0.0,1.0,...,550.0,,1980.0,1.099658,0.097627
15063505,"740 Francisco St, San Francisco, CA",11/30/15,2150000.0,0.0,,...,1430.0,2435.0,1948.0,3.687657,0.430379
15063609,"819 Francisco St, San Francisco, CA",11/12/15,5600000.0,0.0,2.0,...,2040.0,3920.0,1976.0,8.975475,0.205527
15064044,"199 Chestnut St APT 5, San Francisco, CA",12/11/15,1500000.0,0.0,1.0,...,1060.0,,1930.0,2.317325,0.089766
15064257,"111 Chestnut St APT 403, San Francisco, CA",1/15/16,970000.0,0.0,2.0,...,1299.0,,1993.0,1.380945,-0.152690
...,...,...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",1/15/16,390000.0,1.0,,...,264.0,,2012.0,0.428094,-0.804647
2126960082,"355 1st St UNIT 1905, San Francisco, CA",11/20/15,860000.0,0.0,1.0,...,691.0,,2004.0,1.302833,0.029844
2128308939,"33 Santa Cruz Ave, San Francisco, CA",12/10/15,830000.0,0.0,3.0,...,1738.0,2299.0,1976.0,1.608882,0.876824
2131957929,"1821 Grant Ave, San Francisco, CA",12/15/15,835000.0,0.0,2.0,...,1048.0,,1975.0,1.025920,-0.542707


### Exploratory Analysis on `M1` and `M2` and how they relate to `SalePrice`

In [None]:
# simplify dataframe to relevant columns and calculate the correlation between each

In [11]:
df[['SalePrice', 'M1', 'M2']].corr()

Unnamed: 0,SalePrice,M1,M2
SalePrice,1.0,0.970612,0.022003
M1,0.970612,1.0,0.166624
M2,0.022003,0.166624,1.0


### Your first Machine Learning Models!

#### Machine Learning Model #1 | `SalePrice` as a function of `M1`

In [28]:
X = df[ ['M1'] ]
y = df.SalePrice

model = smf.OLS(y, X).fit()

In [29]:
model.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.963
Model:,OLS,Adj. R-squared:,0.963
Method:,Least Squares,F-statistic:,25670.0
Date:,"Thu, 19 May 2016",Prob (F-statistic):,0.0
Time:,20:27:10,Log-Likelihood:,-14393.0
No. Observations:,1000,AIC:,28790.0
Df Residuals:,999,BIC:,28790.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
M1,6.241e+05,3894.990,160.228,0.000,6.16e+05 6.32e+05

0,1,2,3
Omnibus:,1044.296,Durbin-Watson:,1.921
Prob(Omnibus):,0.0,Jarque-Bera (JB):,901486.247
Skew:,3.948,Prob(JB):,0.0
Kurtosis:,149.879,Cond. No.,1.0


#### Machine Learning Model #2 | `SalePrice` as a function of `M2`

In [31]:
x1 = df[ ['M2'] ]
y1 = df.SalePrice

model2 = smf.OLS(y1, x1).fit()

In [32]:
model2.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.06941
Date:,"Thu, 19 May 2016",Prob (F-statistic):,0.792
Time:,20:29:58,Log-Likelihood:,-16036.0
No. Observations:,1000,AIC:,32070.0
Df Residuals:,999,BIC:,32080.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
M2,3.195e+04,1.21e+05,0.263,0.792,-2.06e+05 2.7e+05

0,1,2,3
Omnibus:,1664.6,Durbin-Watson:,0.971
Prob(Omnibus):,0.0,Jarque-Bera (JB):,986904.813
Skew:,10.532,Prob(JB):,0.0
Kurtosis:,155.453,Cond. No.,1.0
