In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [2]:
#load data
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [3]:
#select variables for model
categories2 = ['mszoning', 'street','centralair', 'kitchenqual']
#create dummy variables
house_df = pd.concat([house_df,pd.get_dummies(house_df.mszoning, prefix='mszoning', drop_first=True)], axis=1)
zoning_column_names = list(pd.get_dummies(house_df.mszoning, prefix='mszoning', drop_first=True).columns)
house_df['street_access'] = pd.get_dummies(house_df.street, drop_first=True)
house_df['has_AC'] = pd.get_dummies(house_df.centralair, drop_first=True)
house_df = pd.concat([house_df,pd.get_dummies(house_df.kitchenqual, prefix='kitchenqual', drop_first=True)], axis=1)
kitchen_column_names = list(pd.get_dummies(house_df.kitchenqual, prefix='kitchenqual', drop_first=True).columns)

In [4]:
#target variable
Y = house_df['saleprice']
#feature set
X = house_df[['overallqual', 'totalbsmtsf', 'firstflrsf','grlivarea', 'garagecars', 'garagearea', 
             'street_access', 'has_AC'] + zoning_column_names + kitchen_column_names]

#define linear model
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.794
Model:,OLS,Adj. R-squared:,0.792
Method:,Least Squares,F-statistic:,371.8
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,0.0
Time:,13:48:13,Log-Likelihood:,-17390.0
No. Observations:,1460,AIC:,34810.0
Df Residuals:,1444,BIC:,34900.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.691e+04,1.89e+04,-1.426,0.154,-6.39e+04,1.01e+04
overallqual,1.777e+04,1163.014,15.280,0.000,1.55e+04,2.01e+04
totalbsmtsf,20.6404,4.047,5.100,0.000,12.702,28.579
firstflrsf,4.3663,4.793,0.911,0.362,-5.035,13.767
grlivarea,44.4075,2.521,17.617,0.000,39.463,49.352
garagecars,1.419e+04,2843.841,4.990,0.000,8613.025,1.98e+04
garagearea,7.2384,9.889,0.732,0.464,-12.161,26.638
street_access,-5206.4437,1.54e+04,-0.337,0.736,-3.55e+04,2.51e+04
has_AC,1.016e+04,4306.673,2.358,0.018,1708.935,1.86e+04

0,1,2,3
Omnibus:,500.709,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,56929.274
Skew:,-0.544,Prob(JB):,0.0
Kurtosis:,33.572,Cond. No.,65400.0


* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.

The F-statistic for this model is 371.8, and the associated p-value is 0. This means that our model adds information to the reduced model. The R-squared value is 0.794 and the adjusted R-squared value is 0.792. This means that our model explains a significant amount of the variance, but there is still some room for improvement. The AIC score is 3.481e+04, and the BIC score is 3.490e+04.

Do you think your model is satisfactory? If so, why?
I think this model is satisfactory based on the R-squared and adjusted R-squared value, but with some tweaking, I would expect that this model could be improved.

In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.

In [5]:
#model 2 - remove zoning variables and variables with overlap
#feature set
X2 = house_df[['overallqual', 'grlivarea', 'garagearea', 
             'street_access', 'has_AC'] + kitchen_column_names]

#define linear model
X2 = sm.add_constant(X2)

results2 = sm.OLS(Y, X2).fit()

results2.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.771
Model:,OLS,Adj. R-squared:,0.77
Method:,Least Squares,F-statistic:,612.3
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,0.0
Time:,13:58:35,Log-Likelihood:,-17466.0
No. Observations:,1460,AIC:,34950.0
Df Residuals:,1451,BIC:,35000.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.02e+04,1.82e+04,-0.562,0.574,-4.58e+04,2.54e+04
overallqual,2.076e+04,1151.792,18.027,0.000,1.85e+04,2.3e+04
grlivarea,49.0067,2.429,20.178,0.000,44.243,53.771
garagearea,61.6688,5.930,10.399,0.000,50.036,73.301
street_access,1789.3265,1.58e+04,0.114,0.910,-2.91e+04,3.27e+04
has_AC,1.85e+04,4401.443,4.204,0.000,9870.966,2.71e+04
kitchenqual_Fa,-6.287e+04,8346.717,-7.533,0.000,-7.92e+04,-4.65e+04
kitchenqual_Gd,-5.404e+04,4394.815,-12.297,0.000,-6.27e+04,-4.54e+04
kitchenqual_TA,-6.885e+04,5024.271,-13.704,0.000,-7.87e+04,-5.9e+04

0,1,2,3
Omnibus:,374.373,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24942.075
Skew:,0.017,Prob(JB):,0.0
Kurtosis:,23.249,Cond. No.,39200.0


The F-statistic for this model is 612.3, which is higher than the F-statistic for the first model. The R-squared value is 0.771 and the adjusted R-squared value is 0.770, which are both lower than the original model. The AIC score for this model is 3.495e+04 and the BIC score for this model is 3.500e+04, which are both higher than the corresponding values for the first model.

In [7]:
#define model 3
#feature set - add year built
X3 = house_df[['overallqual', 'totalbsmtsf', 'firstflrsf','grlivarea', 'garagecars', 'garagearea', 
             'street_access', 'has_AC', 'yearbuilt'] + zoning_column_names + kitchen_column_names]

#define linear model
X3 = sm.add_constant(X3)

results3 = sm.OLS(Y, X3).fit()

results3.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.797
Model:,OLS,Adj. R-squared:,0.795
Method:,Least Squares,F-statistic:,354.1
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,0.0
Time:,14:10:54,Log-Likelihood:,-17380.0
No. Observations:,1460,AIC:,34790.0
Df Residuals:,1443,BIC:,34880.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-4.252e+05,9.32e+04,-4.563,0.000,-6.08e+05,-2.42e+05
overallqual,1.641e+04,1197.021,13.711,0.000,1.41e+04,1.88e+04
totalbsmtsf,18.2488,4.059,4.496,0.000,10.286,26.211
firstflrsf,5.9757,4.777,1.251,0.211,-3.395,15.346
grlivarea,47.1761,2.584,18.256,0.000,42.107,52.245
garagecars,1.155e+04,2890.242,3.997,0.000,5882.533,1.72e+04
garagearea,9.3790,9.840,0.953,0.341,-9.924,28.682
street_access,-3048.6771,1.53e+04,-0.199,0.842,-3.31e+04,2.7e+04
has_AC,6588.9260,4357.421,1.512,0.131,-1958.631,1.51e+04

0,1,2,3
Omnibus:,536.991,Durbin-Watson:,1.991
Prob(Omnibus):,0.0,Jarque-Bera (JB):,65190.848
Skew:,-0.662,Prob(JB):,0.0
Kurtosis:,35.709,Cond. No.,299000.0


The F-statistic for this model is 354.1, which is lower than the F-statistic for the original model. The R-squared value for this model is 0.797, and the adjusted R-squared value for this model is 0.795, which are both a little bit higher than those for the original model. The AIC score for this model is 3.479e+04 and the BIC score for this model is 3.488e+04, which are both slightly lower than those for the first model.

In [8]:
#define model 4
#define interaction of quality and ground floor living area
house_df['qual_area'] = house_df['overallqual'] * house_df['grlivarea']
#add interaction term to feature set
X4 = house_df[['overallqual', 'totalbsmtsf', 'firstflrsf','grlivarea', 'garagecars', 'garagearea', 
             'street_access', 'has_AC', 'qual_area'] + zoning_column_names + kitchen_column_names]

#define linear model
X4 = sm.add_constant(X4)

results4 = sm.OLS(Y, X4).fit()

results4.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.804
Model:,OLS,Adj. R-squared:,0.802
Method:,Least Squares,F-statistic:,370.9
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,0.0
Time:,14:22:27,Log-Likelihood:,-17353.0
No. Observations:,1460,AIC:,34740.0
Df Residuals:,1443,BIC:,34830.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.522e+04,2.02e+04,2.235,0.026,5537.630,8.49e+04
overallqual,4693.6763,1895.515,2.476,0.013,975.416,8411.937
totalbsmtsf,15.6771,3.990,3.929,0.000,7.850,23.504
firstflrsf,7.7097,4.692,1.643,0.101,-1.493,16.913
grlivarea,-16.7536,7.515,-2.229,0.026,-31.495,-2.012
garagecars,1.757e+04,2801.934,6.269,0.000,1.21e+04,2.31e+04
garagearea,-4.1320,9.738,-0.424,0.671,-23.234,14.970
street_access,-6632.6083,1.51e+04,-0.441,0.660,-3.62e+04,2.29e+04
has_AC,1.269e+04,4211.834,3.014,0.003,4430.639,2.1e+04

0,1,2,3
Omnibus:,1124.62,Durbin-Watson:,2.014
Prob(Omnibus):,0.0,Jarque-Bera (JB):,201716.427
Skew:,-2.678,Prob(JB):,0.0
Kurtosis:,60.334,Cond. No.,316000.0


The F-statistic for this model is 370.9, which is lower than the value for the first model. The R-squared value is 0.804 and the adjusted R-squared value is 0.802, which are the highest values for any of the 4 models. The AIC score for this model is 3.474e+04 and the BIC score is 3.483e+04. These are the lowest values of any of the models. Overall, this is the best of the four models.