###  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [9]:
import math
import numpy as ny
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from sqlalchemy import create_engine
from sklearn import linear_model
import statsmodels.api as sm
import warnings 

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

warnings.filterwarnings('ignore')



## load the house price data set

In [5]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

#create connection to database based on credentials 
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

# create a dataframe from the imported data
house_price_data = pd.read_sql_query('select * from houseprices',con=engine)

#dispose of the connection 
engine.dispose()

In [6]:
# Y is the target variable
U = house_price_data['saleprice']
# X is the feature set which includes
# is_male and is_smoker variables
T = house_price_data[['lotarea','grlivarea','yearbuilt','garagearea']]

T = sm.add_constant(T)

results = sm.OLS(U,T).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.691
Model:,OLS,Adj. R-squared:,0.69
Method:,Least Squares,F-statistic:,812.4
Date:,"Mon, 27 Jan 2020",Prob (F-statistic):,0.0
Time:,15:34:33,Log-Likelihood:,-17687.0
No. Observations:,1460,AIC:,35380.0
Df Residuals:,1455,BIC:,35410.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.623e+06,8.52e+04,-19.053,0.000,-1.79e+06,-1.46e+06
lotarea,0.6622,0.121,5.477,0.000,0.425,0.899
grlivarea,79.3633,2.550,31.118,0.000,74.360,84.366
yearbuilt,832.0119,43.844,18.976,0.000,746.007,918.017
garagearea,78.2947,6.882,11.376,0.000,64.794,91.795

0,1,2,3
Omnibus:,375.654,Durbin-Watson:,2.016
Prob(Omnibus):,0.0,Jarque-Bera (JB):,20536.558
Skew:,0.262,Prob(JB):,0.0
Kurtosis:,21.366,Cond. No.,1080000.0


In [7]:
house_price_data[['lotarea','grlivarea','yearbuilt','garagearea']].corr()


Unnamed: 0,lotarea,grlivarea,yearbuilt,garagearea
lotarea,1.0,0.263116,0.014228,0.180403
grlivarea,0.263116,1.0,0.19901,0.468997
yearbuilt,0.014228,0.19901,1.0,0.478954
garagearea,0.180403,0.468997,0.478954,1.0


in the above Summaries we can see that the model has an R statistic of 69% a descent score that is unlikely overfit the Fprob statistic is zero and the P values for the coefficients are all zero AIC and BIC are both lower than the weather model for Apparent temperature  difference. 

In [30]:
#house_price_data['poolarea'].value_counts()
house_price_data.columns

Index(['id', 'mssubclass', 'mszoning', 'lotfrontage', 'lotarea', 'street', 'alley', 'lotshape', 'landcontour', 'utilities', 'lotconfig', 'landslope', 'neighborhood', 'condition1', 'condition2', 'bldgtype', 'housestyle', 'overallqual', 'overallcond', 'yearbuilt', 'yearremodadd', 'roofstyle', 'roofmatl', 'exterior1st', 'exterior2nd', 'masvnrtype', 'masvnrarea', 'exterqual', 'extercond', 'foundation', 'bsmtqual', 'bsmtcond', 'bsmtexposure', 'bsmtfintype1', 'bsmtfinsf1', 'bsmtfintype2', 'bsmtfinsf2', 'bsmtunfsf', 'totalbsmtsf', 'heating', 'heatingqc', 'centralair', 'electrical', 'firstflrsf', 'secondflrsf', 'lowqualfinsf', 'grlivarea', 'bsmtfullbath', 'bsmthalfbath', 'fullbath', 'halfbath', 'bedroomabvgr', 'kitchenabvgr', 'kitchenqual', 'totrmsabvgrd', 'functional', 'fireplaces', 'fireplacequ', 'garagetype', 'garageyrblt', 'garagefinish', 'garagecars', 'garagearea', 'garagequal', 'garagecond', 'paveddrive', 'wooddecksf', 'openporchsf', 'enclosedporch', 'threessnporch', 'screenporch',
     

In [33]:
house_price_data['has_central_air'] =pd.get_dummies(house_price_data['centralair'], drop_first=True )

house_price_data['total_baths'] = house_price_data.fullbath + house_price_data.bsmtfullbath +.5*house_price_data.halfbath+.5*house_price_data.bsmthalfbath

house_price_data.head()


Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,condition1,condition2,bldgtype,housestyle,overallqual,overallcond,yearbuilt,yearremodadd,roofstyle,roofmatl,exterior1st,exterior2nd,masvnrtype,masvnrarea,exterqual,extercond,foundation,bsmtqual,bsmtcond,bsmtexposure,bsmtfintype1,bsmtfinsf1,bsmtfintype2,bsmtfinsf2,bsmtunfsf,totalbsmtsf,heating,heatingqc,centralair,electrical,firstflrsf,secondflrsf,lowqualfinsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,kitchenqual,totrmsabvgrd,functional,fireplaces,fireplacequ,garagetype,garageyrblt,garagefinish,garagecars,garagearea,garagequal,garagecond,paveddrive,wooddecksf,openporchsf,enclosedporch,threessnporch,screenporch,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice,has_central_air,total baths,total_baths
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500,1,3.5,3.5
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500,1,2.5,2.5
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500,1,3.5,3.5
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000,1,2.0,2.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000,1,3.5,3.5


In [36]:
# Y is the target variable
Y = house_price_data['saleprice']
# X is the feature set which includes
# is_male and is_smoker variables
X = house_price_data[['lotarea','grlivarea','yearbuilt','garagearea','overallcond','has_central_air','total_baths']]

X = sm.add_constant(X)

results = sm.OLS(Y,X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.709
Model:,OLS,Adj. R-squared:,0.708
Method:,Least Squares,F-statistic:,506.2
Date:,"Mon, 27 Jan 2020",Prob (F-statistic):,0.0
Time:,15:56:00,Log-Likelihood:,-17642.0
No. Observations:,1460,AIC:,35300.0
Df Residuals:,1452,BIC:,35340.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.793e+06,1.08e+05,-16.586,0.000,-2.01e+06,-1.58e+06
lotarea,0.6231,0.118,5.280,0.000,0.392,0.855
grlivarea,73.8663,2.911,25.373,0.000,68.156,79.577
yearbuilt,887.9175,55.266,16.066,0.000,779.508,996.327
garagearea,75.7833,6.691,11.327,0.000,62.659,88.908
overallcond,9574.6030,1147.089,8.347,0.000,7324.473,1.18e+04
has_central_air,-2082.6479,5192.170,-0.401,0.688,-1.23e+04,8102.308
total_baths,8138.3280,2090.669,3.893,0.000,4037.274,1.22e+04

0,1,2,3
Omnibus:,408.126,Durbin-Watson:,1.993
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23510.665
Skew:,0.411,Prob(JB):,0.0
Kurtosis:,22.642,Cond. No.,1410000.0


by adding the additional varialbes we increased the R squared by a percent and the new valuses are signifficant(pvaluse of 0) with he exception of has airconditioning there was little change in BIC and AIC the F statisic was decreased by more than 300 but still had a significant P value 


In [38]:
# Y is the target variable
Y = house_price_data['saleprice']
# X is the feature set which includes
# is_male and is_smoker variables
X = house_price_data[['lotarea','grlivarea','yearbuilt','garagearea','overallcond','total_baths']]

X = sm.add_constant(X)

results = sm.OLS(Y,X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.709
Model:,OLS,Adj. R-squared:,0.708
Method:,Least Squares,F-statistic:,590.8
Date:,"Mon, 27 Jan 2020",Prob (F-statistic):,0.0
Time:,16:01:17,Log-Likelihood:,-17642.0
No. Observations:,1460,AIC:,35300.0
Df Residuals:,1453,BIC:,35340.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.778e+06,1.01e+05,-17.592,0.000,-1.98e+06,-1.58e+06
lotarea,0.6210,0.118,5.269,0.000,0.390,0.852
grlivarea,73.8559,2.910,25.378,0.000,68.147,79.565
yearbuilt,879.5451,51.158,17.193,0.000,779.194,979.896
garagearea,75.6716,6.683,11.323,0.000,62.562,88.781
overallcond,9433.6116,1091.592,8.642,0.000,7292.347,1.16e+04
total_baths,8163.8421,2089.098,3.908,0.000,4065.872,1.23e+04

0,1,2,3
Omnibus:,408.829,Durbin-Watson:,1.993
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23462.772
Skew:,0.416,Prob(JB):,0.0
Kurtosis:,22.621,Cond. No.,1320000.0


removal of the airconditioning term improved the F statistic this is the best model so far 