Challenge
In this module, we learned how to approach and solve regression problems using linear regression models. Throughout the module, you worked on a house price dataset from Kaggle. In this challenge, you will keep working on this dataset.

The scenario
The housing market is one of the most crucial parts of the economy for every country. Purchasing a home is one of the primary ways to build wealth and savings for people. In this respect, predicting prices in the housing market is a very central topic in economic and financial circles.

The house price dataset from Kaggle includes several features of the houses along with their sale prices at the time they are sold. So far, in this module, you built and implemented some models using this dataset.

In this challenge, you are required to improve your model with respect to its prediction performance.

To complete this challenge, submit a Jupyter notebook containing your solutions to the following tasks.

Steps
Load the houseprices data from Thinkful's database.


In [134]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import pylab
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sqlalchemy import create_engine
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sqlalchemy import create_engine
import seaborn as sns

In [135]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df1 = pd.read_sql_query('select* from houseprices', con = engine)

engine.dispose()

Do data cleaning, exploratory data analysis, and feature engineering. You can use your previous work in this module. But make sure that your work is satisfactory.


In [136]:
X = pd.concat([df1[['overallcond', 'grlivarea', 'garagecars', 'fullbath', 'halfbath','totalbsmtsf']], pd.get_dummies(df1['mszoning'], prefix = 'mszoning', drop_first = True)], axis = 1)
y = df1['saleprice']


Now, split your data into train and test sets where 20% of the data resides in the test set.


In [137]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=40)



Build several linear regression models including Lasso, Ridge, or ElasticNet and train them in the training set. 

In [138]:
from sklearn.linear_model import LinearRegression
lrm = LinearRegression()
lrm.fit(X_train, y_train)
y_lrmpredict = lrm.predict(X_test)


print("R-squared of the model on the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_lrmpredict)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_lrmpredict)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_lrmpredict)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_lrmpredict) / y_test)) * 100))


R-squared of the model on the training set is: 0.6965164646961184
-----Test set statistics-----
R-squared of the model on the test set is: 0.7389648850278968
Mean absolute error of the prediction is: 25011.104252306544
Mean squared error of the prediction is: 1404518000.3393893
Root mean squared error of the prediction is: 37476.89955611843
Mean absolute percentage error of the prediction is: 15.19866626940877


Use k-fold cross-validation to select the best hyperparameters if your models include one!


In [139]:
from sklearn.linear_model import Ridge
RG = Ridge()
pms = [{'alpha':[np.power(10.0,p) for p in np.arange(0,40,1)]}]

from sklearn.model_selection import GridSearchCV
RGcv = GridSearchCV(RG, param_grid=pms)

RGcv.fit(X_train, y_train)

print('Best lambda = ' + str(RGcv.best_estimator_.alpha))
y_test_predRG = RGcv.predict(X_test)



from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse

RGbest = Ridge(alpha =1)
RGbest.fit(X_train, y_train)

y_test_predRG = RGbest.predict(X_test)

print("Best R-squared of the model for Ridge on the training set is: {}".format(RGcv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the Ridge model on the test set is: {}".format(RGcv.score(X_test, y_test)))
print("Mean absolute error of the Ridge prediction is: {}".format(mean_absolute_error(y_test, y_test_predRG)))
print("Mean squared error of the Ridge prediction is: {}".format(mse(y_test, y_test_predRG)))
print("Root mean squared error of the Ridge prediction is: {}".format(rmse(y_test, y_test_predRG)))
print("Mean absolute percentage error of the Ridge prediction is: {}".format(np.mean(np.abs((y_test - y_test_predRG) / y_test)) * 100))


Best lambda = 1.0
Best R-squared of the model for Ridge on the training set is: 0.6962786176425262
-----Test set statistics-----
R-squared of the Ridge model on the test set is: 0.7381916312441654
Mean absolute error of the Ridge prediction is: 24956.28247687255
Mean squared error of the Ridge prediction is: 1408678547.314831
Root mean squared error of the Ridge prediction is: 37532.36666285289
Mean absolute percentage error of the Ridge prediction is: 15.126013113346207


In [140]:
from sklearn.linear_model import  LassoCV, RidgeCV, ElasticNetCV
alphas = [np.power(10.0,p) for p in np.arange(0,40,1)]
lasso_cv = LassoCV(alphas=alphas, cv=5)

lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("Best alpha value is: {}".format(lasso_cv.alpha_))
print("R-squared of the Lasso model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 100.0
R-squared of the Lasso model in training set is: 0.6958410716205874
-----Test set statistics-----
R-squared of the model in test set is: 0.7374029333273326
Mean absolute error of the prediction is: 24990.598794263475
Mean squared error of the prediction is: 1412922192.5467768
Root mean squared error of the prediction is: 37588.85729237824
Mean absolute percentage error of the prediction is: 15.141948317106404


In [141]:

alphas = [np.power(10.0,p) for p in np.arange(-3,40,1)]
EL_cv = ElasticNetCV(alphas=alphas, cv=10, max_iter=100000)

EL_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = EL_cv.predict(X_train)
y_preds_test = EL_cv.predict(X_test)

print("Best alpha value is: {}".format(EL_cv.alpha_))
print("R-squared of the Elastic model in training set is: {}".format(EL_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(EL_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 0.001
R-squared of the Elastic model in training set is: 0.6964035133647133
-----Test set statistics-----
R-squared of the model in test set is: 0.7386066243935697
Mean absolute error of the prediction is: 24970.353022990657
Mean squared error of the prediction is: 1406445647.1610792
Root mean squared error of the prediction is: 37502.60853808811
Mean absolute percentage error of the prediction is: 15.145751864607234


Evaluate your best model on the test set.


The best model is Ridge regression with an lambda of 1. This produces the least % error between train and test set (15.2%) paired with similar closeness in r-squared values to the training set (~0.7 and ~0.74 respectively).

So far, you have only used the features in the dataset. However, house prices can be affected by many factors like economic activity and the interest rates at the time they are sold. So, try to find some useful factors that are not included in the dataset. Integrate these factors into your model and assess the prediction performance of your model. Discuss the implications of adding these external variables into your model.

In [142]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1460 non-null   int64  
 1   mssubclass     1460 non-null   int64  
 2   mszoning       1460 non-null   object 
 3   lotfrontage    1201 non-null   float64
 4   lotarea        1460 non-null   int64  
 5   street         1460 non-null   object 
 6   alley          91 non-null     object 
 7   lotshape       1460 non-null   object 
 8   landcontour    1460 non-null   object 
 9   utilities      1460 non-null   object 
 10  lotconfig      1460 non-null   object 
 11  landslope      1460 non-null   object 
 12  neighborhood   1460 non-null   object 
 13  condition1     1460 non-null   object 
 14  condition2     1460 non-null   object 
 15  bldgtype       1460 non-null   object 
 16  housestyle     1460 non-null   object 
 17  overallqual    1460 non-null   int64  
 18  overallc

In [143]:
df1['yrsold'].value_counts()

2009    338
2007    329
2006    314
2008    304
2010    175
Name: yrsold, dtype: int64