# Version 1: A Simple Regression Model

This dataset has 81 features. For a quick first attempt, we decided to select those that have to do with size and put together a simple regression model.

In [1]:
# Standard imports
import pandas as pd
import numpy as np

# Read datasets
train = pd.read_csv('data/train.csv')

train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


I'll select the following 16 features, all of which relate to home size in some way:

<ul>
    <li><code>LotArea</code>: Lot size in square feet</li>
    <li><code>BsmtFinSF1</code>: Type 1 finished square feet</li>
    <li><code>BsmtFinSF2</code>: Type 2 finished square feet</li>
    <li><code>BsmtUnfSF</code>: Unfinished square feet of basement area</li>
    <li><code>TotalBsmtSF</code>: Total square feet of basement area (<code>BsmtFinSF1</code> + <code>BsmtFinSF2</code> + <code>BsmtUnfSF</code>)</li>
    <li><code>1stFlrSF</code>: First Floor square feet</li>
    <li><code>2ndFlrSF</code>: Second floor square feet</li>
    <li><code>LowQualFinSF</code>: Low quality finished square feet (all floors)</li>
    <li><code>GrLivArea</code>: Above grade (ground) living area square feet (<code>1stFlrSF</code> + <code>2ndFlrSF</code> + <code>LowQualFinSF</code>)</li>
    <li><code>GarageArea</code>: Size of garage in square feet</li>
    <li><code>WoodDeckSF</code>: Wood deck area in square feet</li>
    <li><code>OpenPorchSF</code>: Open porch area in square feet</li>
    <li><code>EnclosedPorch</code>: Enclosed porch area in square feet</li>
    <li><code>3SsnPorch</code>: Three season porch area in square feet</li>
    <li><code>ScreenPorch</code>: Screen porch area in square feet</li>
    <li><code>PoolArea</code>: Pool area in square feet</li>
</ul>

In [2]:
# Select square footage features
sq_ft_features = [
    'LotArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
    '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea',
    'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'
]

# Ensure no missing values
train[sq_ft_features].isna().sum()

LotArea          0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
dtype: int64

And, above, we've confirmed that we're in the clear as far as null values are concerned.

Using these features, then, I'll create a quick boosted forrest model and see how it performs.

In [3]:
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Split training set into training and cross validation sets
X = train[sq_ft_features]
y = train['SalePrice']
X_train, X_cv, y_train, y_cv = train_test_split(X, y, random_state=1)

# Train model
model = XGBRegressor()
model.fit(X_train, y_train)
y_predict = model.predict(X_cv)

# Evaluate model
print("Root-Mean-Squared-Error: ${:,.2f}".format(mean_squared_error(y_cv, y_predict, squared=False)))

Root-Mean-Squared-Error: $35,443.19


In [4]:
35000/163000

0.2147239263803681

On average, then, our predictions are ~\\$35,000 off, which is +/- 21% from our median home price of \\$163,000. A good first effort, but we can do better.

Before moving on to the next notebook, though, I'll quickly recalculate this model's performance using the Kaggle competition evaluation metric, which is slightly different:

> Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

In [5]:
# Evaluate model using RMSE between log of predicted and actual target values
mean_squared_error(np.log(y_cv), np.log(y_predict), squared=False)

0.19676565733194548

Based on the current leaderboard, a score like this would put us somewhere around 3,000th place.