This is the initial notebook to explore the data and run the first models. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder

from fastai.imports import *
from fastai.tabular.all import *

from numpy import random
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

import eli5
from eli5.sklearn import PermutationImportance

from scipy.stats import skew, norm
from catboost import CatBoostRegressor

In [2]:
test = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')
train = pd.read_csv('train.csv')

Based on initial analysis the train and test datasets have similar characteristics, so it will be easier to combine them for imputation and data analysis work. 

In [3]:
train_test = pd.concat([train, test], ignore_index=True)

In [4]:
pd.isnull(train_test).sum()[pd.isnull(train_test).sum() > 0]

MSZoning           4
LotFrontage      486
Alley           2721
Utilities          2
Exterior1st        1
Exterior2nd        1
MasVnrType        24
MasVnrArea        23
BsmtQual          81
BsmtCond          82
BsmtExposure      82
BsmtFinType1      79
BsmtFinSF1         1
BsmtFinType2      80
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
Electrical         1
BsmtFullBath       2
BsmtHalfBath       2
KitchenQual        1
Functional         2
FireplaceQu     1420
GarageType       157
GarageYrBlt      159
GarageFinish     159
GarageCars         1
GarageArea         1
GarageQual       159
GarageCond       159
PoolQC          2909
Fence           2348
MiscFeature     2814
SaleType           1
SalePrice       1459
dtype: int64

It looks like Alley, FireplaceQu, PoolQC, Fence and MiscFeature have significant numbers of missing data. So it will be best to eliminate those rows. There are a number of rows that have less than 5 rows with missing data. Since some of these are categorical and some are continuous data, their missing data will be replaced with the most frequent value. 

In [5]:
drop_high_nan=['Alley','FireplaceQu','PoolQC','Fence','MiscFeature']
train_test=train_test.drop(drop_high_nan,axis=1)
small_nan_cols = ['MSZoning', 'Utilities', 'Exterior1st', 'Exterior2nd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 
                  'TotalBsmtSF', 'Electrical', 'BsmtFullBath', 'BsmtHalfBath', 'KitchenQual', 'Functional', 'GarageCars', 
                  'GarageArea','SaleType', 'SaleCondition']
small_impute = SimpleImputer(strategy='most_frequent')
train_test[small_nan_cols] = pd.DataFrame(small_impute.fit_transform(train_test[small_nan_cols]),columns=small_nan_cols)

The following columns seem to have one value significantly larger than the rest, and it would probably be best to use the mode, or most common, value to feel each NaN value: MasVnrType, MasVnrArea, BsmtCond, BsmtExposure, BsmtFinType2, GarageType, GarageFinish, GarageQual, and GarageCond. That is represents 9 out of the 13 columns. 

BsmtQual has two values larger than the rest: Gd and TA. But it only has 2.8% NaNs, so simply using the mode might be good enough.  

GarageYrBlt has 59 NaNs out 2919 rows which is only 2%. It has a dispersed set of values, so it might be easiest just to have any NaNs have the same value as YearBuilt. 

BsmtFinType1 has only 2.7% value of NaNs, and most two of its largest values are GLQ and Unf. It might be easiest to use the mode here. 

LotFrontage has 486 NaNs out of 2919 rows which is a pretty high 16.7%. It has a dispersed range of values, but looking at its characteristics from the describe function above, it seems to have a pretty even distribution with a mean of 10,168 and a median of 9,453. So using the mean to fill in the NaNs seems reasonable. If it turns out this value has a high impact on the predictability of the SalePrice, then it might be good to revisit this assumption. 

In [6]:
mode_cols = ['MasVnrType', 'MasVnrArea', 'BsmtCond', 'BsmtExposure', 'BsmtFinType2', 'GarageType', 'GarageFinish', 
             'GarageQual','GarageCond', 'BsmtQual', 'BsmtFinType1']
mode_impute = SimpleImputer(strategy='most_frequent')
train_test[mode_cols] = pd.DataFrame(mode_impute.fit_transform(train_test[mode_cols]),columns=mode_cols)
train_test['LotFrontage'].fillna((train_test['LotFrontage'].mean()), inplace=True)
train_test['GarageYrBlt'] = train_test['GarageYrBlt'].fillna(train_test['YearBuilt'])

In [7]:
pd.isnull(train_test).sum()[pd.isnull(train_test).sum() > 0]

SalePrice    1459
dtype: int64

In [8]:
train_test['BsmtQual'].unique()

array(['Gd', 'TA', 'Ex', 'Fa'], dtype=object)

In [9]:
train_test['BsmtFinType1'].unique()

array(['GLQ', 'ALQ', 'Unf', 'Rec', 'BLQ', 'LwQ'], dtype=object)

In [10]:
train_test.BsmtQual = train_test.BsmtQual.replace({"Ex": 110, "Gd": 95, "TA": 85, "Fa": 75, "Po": 60, "NA": 0})

In [11]:
train_test.BsmtFinType1 = train_test.BsmtFinType1.replace({"GLQ": 6, "ALQ": 5, "BLQ": 4, "Rec": 3, "LwQ": 2, "Unf": 1,
                                                         "NA": 0})

It will be necessary to identify all the columns that have non-numeric object values and then convert them to numeric values. 

In [12]:
obj_cols = list(train_test.select_dtypes(['object']).columns)
obj_cols

['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'BsmtFullBath',
 'BsmtHalfBath',
 'KitchenQual',
 'Functional',
 'GarageType',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'SaleType',
 'SaleCondition']

In [13]:
for column in obj_cols:
     train_test[column] = pd.factorize(train_test[column], sort=True)[0]

  uniques = Index(uniques)


To create some new columns that might compound the effects of the existing columns with higher impact on the xgb score. 

In [14]:
train_test['QualCondSum'] = train_test['OverallQual'] + train_test['OverallCond']
train_test['RemodTime'] = train_test['YearRemodAdd'] - train_test['YearBuilt']
train_test['BsmtFinTypeSF1'] = train_test['BsmtFinType1'] * train_test['BsmtFinSF1']
train_test['TotalFlrSF'] = train_test['1stFlrSF'] + train_test['2ndFlrSF']
train_test['TotalFinSF'] = train_test['GrLivArea'] + train_test['BsmtFinSF1']
train_test['GarageCarArea'] = train_test['GarageArea'] * train_test['GarageCars']
train_test['TotalSF'] = train_test['1stFlrSF'] + train_test['2ndFlrSF'] + train_test['TotalBsmtSF']

To now eliminate the columns that have negative or zero influence on the xgb score. 

In [15]:
train_test.drop(['Utilities','Street','TotRmsAbvGrd','3SsnPorch','YrSold','Exterior2nd','LotConfig',
                              'HouseStyle','EnclosedPorch','WoodDeckSF','Foundation','RoofMatl','Electrical',
                              'GarageType','LotFrontage','SaleType','MoSold','BsmtExposure','1stFlrSF',
                              'BsmtFinSF1','Exterior1st','KitchenQual','TotalFlrSF'], axis=1, inplace=True)

To look at how many features have a skew above 0.5, since high skew can be an issue in regression analysis. 

In [16]:
skew_features = train_test.apply(lambda x: skew(x)).sort_values(ascending=False)

In [17]:
high_skew = skew_features[skew_features > 0.5]
skew_index = high_skew.index

In [18]:
skew_index

Index(['MiscVal', 'PoolArea', 'LotArea', 'LowQualFinSF', 'Heating',
       'Condition2', 'LandSlope', 'KitchenAbvGr', 'ScreenPorch',
       'BsmtHalfBath', 'BsmtFinSF2', 'Condition1', 'OpenPorchSF', 'BldgType',
       'RemodTime', 'MasVnrArea', 'RoofStyle', 'MSSubClass', 'GarageCarArea',
       'GrLivArea', 'TotalFinSF', 'BsmtFinTypeSF1', 'TotalSF', '2ndFlrSF',
       'BsmtQual', 'Fireplaces', 'HalfBath', 'BsmtFullBath', 'OverallCond'],
      dtype='object')

In [19]:
print("There are {} numerical features with Skew > 0.5 :".format(high_skew.shape[0]))
skewness = pd.DataFrame({'Skew' :high_skew})
skew_features.head(30)

There are 29 numerical features with Skew > 0.5 :


MiscVal           21.947195
PoolArea          16.898328
LotArea           12.822431
LowQualFinSF      12.088761
Heating           12.078788
Condition2        12.060093
LandSlope          4.975157
KitchenAbvGr       4.302254
ScreenPorch        3.946694
BsmtHalfBath       3.931594
BsmtFinSF2         3.476562
Condition1         2.983114
OpenPorchSF        2.535114
BldgType           2.192261
RemodTime          2.063712
MasVnrArea         1.556310
RoofStyle          1.553307
MSSubClass         1.375457
GarageCarArea      1.296672
GrLivArea          1.269358
TotalFinSF         1.159630
BsmtFinTypeSF1     0.932490
TotalSF            0.922771
2ndFlrSF           0.861675
BsmtQual           0.796302
Fireplaces         0.733495
HalfBath           0.694566
BsmtFullBath       0.624832
OverallCond        0.570312
HeatingQC          0.486656
dtype: float64

To normalize the features with skew above 0.5.

In [20]:
for i in skew_index:
    train_test[i] = np.log1p(train_test[i])

  result = getattr(ufunc, method)(*inputs, **kwargs)


To create a column with the log of the SalePrice to match the evaluation process in the contest. 

In [21]:
train_test['LogSalePrice'] = train_test['SalePrice'].apply(np.log)

To separate the train_test dataset back into the train and test datasets, identify the independent and dependent columns, and create the validation split.

In [22]:
train = train_test[train_test['SalePrice'].notnull()].copy()
test = train_test[train_test['SalePrice'].isnull()].drop(['SalePrice','LogSalePrice'],axis=1)
X = train.drop(['SalePrice','LogSalePrice'],axis=1)
y = train.LogSalePrice

In [27]:
def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

In [24]:
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size = 0.5,random_state=42)
X_train,y_train =X,y

In [25]:
cat = CatBoostRegressor()
cat_model = cat.fit(X_train,y_train,
                     eval_set = (X_val,y_val),
                     plot=True,
                     verbose = 0)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [28]:
cat_pred = cat_model.predict(X_val)
cat_score = rmse(y_val, cat_pred)
cat_score

0.03240179521645853

In [29]:
submit = test[['Id']]
submit = submit.reset_index(drop=True)

In [30]:
cat.fit(X,y)

Learning rate set to 0.043466
0:	learn: 0.3871859	total: 4.13ms	remaining: 4.13s
1:	learn: 0.3755345	total: 7.37ms	remaining: 3.68s
2:	learn: 0.3645334	total: 10.4ms	remaining: 3.44s
3:	learn: 0.3545474	total: 13.5ms	remaining: 3.35s
4:	learn: 0.3447667	total: 17.1ms	remaining: 3.4s
5:	learn: 0.3356328	total: 19.9ms	remaining: 3.29s
6:	learn: 0.3264438	total: 22.9ms	remaining: 3.25s
7:	learn: 0.3179319	total: 26.4ms	remaining: 3.28s
8:	learn: 0.3098801	total: 29.1ms	remaining: 3.2s
9:	learn: 0.3020381	total: 31.7ms	remaining: 3.13s
10:	learn: 0.2945284	total: 34ms	remaining: 3.06s
11:	learn: 0.2867855	total: 36.3ms	remaining: 2.99s
12:	learn: 0.2801866	total: 39ms	remaining: 2.96s
13:	learn: 0.2737452	total: 41.9ms	remaining: 2.95s
14:	learn: 0.2670141	total: 44.8ms	remaining: 2.94s
15:	learn: 0.2606411	total: 48ms	remaining: 2.95s
16:	learn: 0.2547670	total: 51.1ms	remaining: 2.96s
17:	learn: 0.2495707	total: 54.2ms	remaining: 2.96s
18:	learn: 0.2440967	total: 56.9ms	remaining: 2.94s


227:	learn: 0.0886528	total: 583ms	remaining: 1.97s
228:	learn: 0.0885812	total: 586ms	remaining: 1.97s
229:	learn: 0.0884702	total: 588ms	remaining: 1.97s
230:	learn: 0.0882771	total: 591ms	remaining: 1.97s
231:	learn: 0.0881281	total: 594ms	remaining: 1.97s
232:	learn: 0.0879758	total: 596ms	remaining: 1.96s
233:	learn: 0.0879096	total: 599ms	remaining: 1.96s
234:	learn: 0.0878181	total: 602ms	remaining: 1.96s
235:	learn: 0.0876806	total: 605ms	remaining: 1.96s
236:	learn: 0.0874538	total: 608ms	remaining: 1.96s
237:	learn: 0.0873204	total: 611ms	remaining: 1.96s
238:	learn: 0.0871554	total: 614ms	remaining: 1.96s
239:	learn: 0.0870397	total: 617ms	remaining: 1.95s
240:	learn: 0.0869822	total: 621ms	remaining: 1.96s
241:	learn: 0.0869109	total: 624ms	remaining: 1.96s
242:	learn: 0.0867594	total: 628ms	remaining: 1.96s
243:	learn: 0.0866104	total: 631ms	remaining: 1.95s
244:	learn: 0.0865133	total: 633ms	remaining: 1.95s
245:	learn: 0.0864743	total: 636ms	remaining: 1.95s
246:	learn: 

419:	learn: 0.0694738	total: 1.18s	remaining: 1.63s
420:	learn: 0.0693975	total: 1.18s	remaining: 1.62s
421:	learn: 0.0693020	total: 1.18s	remaining: 1.62s
422:	learn: 0.0692111	total: 1.19s	remaining: 1.62s
423:	learn: 0.0691254	total: 1.19s	remaining: 1.61s
424:	learn: 0.0690616	total: 1.19s	remaining: 1.61s
425:	learn: 0.0689222	total: 1.19s	remaining: 1.61s
426:	learn: 0.0688388	total: 1.2s	remaining: 1.61s
427:	learn: 0.0686974	total: 1.2s	remaining: 1.6s
428:	learn: 0.0685781	total: 1.2s	remaining: 1.6s
429:	learn: 0.0685043	total: 1.21s	remaining: 1.6s
430:	learn: 0.0684682	total: 1.21s	remaining: 1.6s
431:	learn: 0.0683727	total: 1.21s	remaining: 1.59s
432:	learn: 0.0682755	total: 1.21s	remaining: 1.59s
433:	learn: 0.0682127	total: 1.22s	remaining: 1.59s
434:	learn: 0.0681956	total: 1.22s	remaining: 1.58s
435:	learn: 0.0681119	total: 1.22s	remaining: 1.58s
436:	learn: 0.0680419	total: 1.22s	remaining: 1.58s
437:	learn: 0.0679214	total: 1.23s	remaining: 1.57s
438:	learn: 0.06782

647:	learn: 0.0536829	total: 1.76s	remaining: 957ms
648:	learn: 0.0536499	total: 1.76s	remaining: 955ms
649:	learn: 0.0535743	total: 1.77s	remaining: 952ms
650:	learn: 0.0535279	total: 1.77s	remaining: 949ms
651:	learn: 0.0534726	total: 1.77s	remaining: 946ms
652:	learn: 0.0534669	total: 1.77s	remaining: 944ms
653:	learn: 0.0534400	total: 1.78s	remaining: 941ms
654:	learn: 0.0533221	total: 1.78s	remaining: 938ms
655:	learn: 0.0532593	total: 1.78s	remaining: 935ms
656:	learn: 0.0532162	total: 1.79s	remaining: 933ms
657:	learn: 0.0531723	total: 1.79s	remaining: 930ms
658:	learn: 0.0531061	total: 1.79s	remaining: 927ms
659:	learn: 0.0531021	total: 1.79s	remaining: 925ms
660:	learn: 0.0530520	total: 1.8s	remaining: 922ms
661:	learn: 0.0529684	total: 1.8s	remaining: 919ms
662:	learn: 0.0529321	total: 1.8s	remaining: 917ms
663:	learn: 0.0528719	total: 1.81s	remaining: 914ms
664:	learn: 0.0527908	total: 1.81s	remaining: 912ms
665:	learn: 0.0527481	total: 1.81s	remaining: 909ms
666:	learn: 0.0

879:	learn: 0.0430795	total: 2.34s	remaining: 319ms
880:	learn: 0.0430645	total: 2.35s	remaining: 317ms
881:	learn: 0.0430201	total: 2.35s	remaining: 314ms
882:	learn: 0.0429768	total: 2.35s	remaining: 311ms
883:	learn: 0.0429303	total: 2.35s	remaining: 309ms
884:	learn: 0.0429022	total: 2.35s	remaining: 306ms
885:	learn: 0.0428804	total: 2.36s	remaining: 303ms
886:	learn: 0.0428538	total: 2.36s	remaining: 301ms
887:	learn: 0.0428040	total: 2.36s	remaining: 298ms
888:	learn: 0.0427741	total: 2.37s	remaining: 296ms
889:	learn: 0.0427564	total: 2.37s	remaining: 293ms
890:	learn: 0.0426898	total: 2.37s	remaining: 290ms
891:	learn: 0.0426533	total: 2.37s	remaining: 287ms
892:	learn: 0.0426167	total: 2.38s	remaining: 285ms
893:	learn: 0.0425739	total: 2.38s	remaining: 282ms
894:	learn: 0.0425350	total: 2.38s	remaining: 279ms
895:	learn: 0.0425058	total: 2.38s	remaining: 277ms
896:	learn: 0.0424500	total: 2.38s	remaining: 274ms
897:	learn: 0.0424198	total: 2.39s	remaining: 271ms
898:	learn: 

<catboost.core.CatBoostRegressor at 0x1ad13a83970>

In [31]:
submit_predict = cat.predict(test)
submit_predict = np.exp(submit_predict)

In [32]:
submit['SalePrice'] = submit_predict

In [33]:
submit.to_csv('submit_init_cat_boost.csv', index=False)

Reducing feature skewness actually slightly increased the Kaggle score. 