# Version 1: A Simple Regression Model

This dataset has 81 features. For a quick first attempt, I'm going to select those that have to do with size and put together a simple regression model.

In [26]:
# Standard imports
import pandas as pd
import numpy as np

# Read datasets
train = pd.read_csv('data/train.csv')

train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [38]:
# Select square footage features
sq_ft_features = ['LotArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
                  '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea',
                  'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'
                 ]

# Ensure no missing values
train[sq_ft_features].isna().sum()

LotArea          0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
dtype: int64

In [44]:
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Split training set into training and cross validation sets
X = train[sq_ft_features]
y = train['SalePrice']
X_train, X_cv, y_train, y_cv = train_test_split(X, y, random_state=1)

# Train model
model = XGBRegressor()
model.fit(X_train, y_train)
y_predict = model.predict(X_cv)

# Evaluate model
mean_squared_error(y_cv, y_predict, squared=False)

35443.192303608324

I'm going to reassess the model accuracy using logs of predicted and actual sales price per the competition instructions to get a sense for where this model would land on the leaderboard.

> Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

In [45]:
# Evaluate model using RMSE between log of predicted and actual target values
mean_squared_error(np.log(y_cv), np.log(y_predict), squared=False)

0.19676565733194548