# Version 02: Quick and Dirty

Curious to see how well we can do if we pull together a quick model that uses all non-null features.

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load training set
train = pd.read_csv('data/train.csv')

train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [2]:
# Select features with na values
na_cols = train.isna().sum()

# Display features we're going to drop
na_cols[na_cols > 0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

Looks like there are 19 features with null values. For the purposes of this quick and dirty model, we're gonna drop 'em all! Curious to see how much better a model we get than last time.

In [3]:
# Select non-null features and separate target set
X = train.dropna(axis=1).drop(columns='Id')
y = X.pop('SalePrice')

print("New dataset consists of {} features".format(len(X.columns)))

New dataset consists of 60 features


In [4]:
# Recast `object` features to `category`, and one hot encode
cat_features = X.select_dtypes(include='object').columns.tolist()
X = pd.get_dummies(X, columns=cat_features)

In [5]:
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, mean_squared_log_error, make_scorer
from sklearn.model_selection import ShuffleSplit

# Create model
model = XGBRegressor(random_state=1)

# Run and evaluate model
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=1)
scorer = make_scorer(mean_squared_error, greater_is_better=False, squared=False)
scores = cross_val_score(model, X, y, scoring=scorer, cv=cv)

print('Mean RMSE score: ${:,.2f} (${:,.2f})'.format(abs(scores).mean(), scores.std()))

Mean RMSE score: $33,208.80 ($6,901.83)


Last time our mean RMSE was \\$42,194.37, so this is a substantial improvement of about 20-25\%. With a median home price of \\$163,000, we're looking at an error of +/- 20\%. Big improvement with zero feature engineering!

Before closing out, just want to re-evaluate our model using the particular metric Kaggle will use, so we can estimate our competitiveness:

In [6]:
# Evaluate model using RMSE between log of predicted and actual target values
scorer = make_scorer(mean_squared_log_error, greater_is_better=False, squared=False)
scores = cross_val_score(model, X, y, scoring=scorer, cv=cv)

print('Mean RMSLE score: {:.6F} ({:.6F})'.format(abs(scores).mean(), scores.std()))

Mean RMSLE score: 0.151047 (0.012416)


Huge jump. This puts us somewhere around 2,200th place at the time of writing.