3. House prices model
In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

Load the houseprices data from Thinkful's database.
Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
Now, exclude the insignificant features from your model. Did anything change?
Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'
table_name = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houses_df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

In [3]:
Y = houses_df['saleprice']
X = houses_df[['overallqual','grlivarea', 'garagecars']]

X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.739
Method:                 Least Squares   F-statistic:                     1375.
Date:                Fri, 24 Jan 2020   Prob (F-statistic):               0.00
Time:                        14:58:10   Log-Likelihood:                -17563.
No. Observations:                1460   AIC:                         3.513e+04
Df Residuals:                    1456   BIC:                         3.516e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -9.883e+04   4842.897    -20.408      

In my previous model we see that there is a -\\$9000 constant, and that for every point increase in overall quality there is a \\$2,700 increase in sale price. Likewise for every square foot of ground floor living area there is a \\$50 increase. And lastly there is a \\$2,130 increase for each additional space for a car in the garage. All of these are significant contributors to the sale price. 

Each of these features is significant and makes sense. I'm going to run another model that includes more features below to see what else I can suss out. 

In [5]:
houses_df["type_utilities"] = pd.get_dummies(houses_df.utilities, drop_first=True)
houses_df["type_street"] = pd.get_dummies(houses_df.street, drop_first=True)

Y = houses_df['saleprice']
X = houses_df[['overallqual','grlivarea', 'garagecars', 'type_utilities', 'type_street', 'lotarea']]

X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     724.8
Date:                Fri, 24 Jan 2020   Prob (F-statistic):               0.00
Time:                        15:59:30   Log-Likelihood:                -17533.
No. Observations:                1460   AIC:                         3.508e+04
Df Residuals:                    1453   BIC:                         3.512e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const          -1.331e+05   1.72e+04     -7.

Ok, so in this model we see that type of utilities has a large p-value, and type street is also on the cusp of not being significant. Let's see what happens if I drop those. 

In [6]:
Y = houses_df['saleprice']
X = houses_df[['overallqual','grlivarea', 'garagecars', 'lotarea']]

X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.749
Model:                            OLS   Adj. R-squared:                  0.748
Method:                 Least Squares   F-statistic:                     1084.
Date:                Fri, 24 Jan 2020   Prob (F-statistic):               0.00
Time:                        16:00:59   Log-Likelihood:                -17536.
No. Observations:                1460   AIC:                         3.508e+04
Df Residuals:                    1455   BIC:                         3.511e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -1.032e+05   4788.762    -21.549      

In this model we see that Lot Area contributes \\$0.81 per square foot of extra lot. But the constant increased, which tells me that there's still a lot of unexplained behavior going on. 