This is the initial notebook to explore the data and run the first models. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder

from fastai.imports import *
from fastai.tabular.all import *

from numpy import random
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

import eli5
from eli5.sklearn import PermutationImportance

from scipy.stats import skew, norm
from catboost import CatBoostRegressor

In [2]:
test = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')
train = pd.read_csv('train.csv')

Based on initial analysis the train and test datasets have similar characteristics, so it will be easier to combine them for imputation and data analysis work. 

In [3]:
train_test = pd.concat([train, test], ignore_index=True)

In [4]:
pd.isnull(train_test).sum()[pd.isnull(train_test).sum() > 0]

MSZoning           4
LotFrontage      486
Alley           2721
Utilities          2
Exterior1st        1
Exterior2nd        1
MasVnrType        24
MasVnrArea        23
BsmtQual          81
BsmtCond          82
BsmtExposure      82
BsmtFinType1      79
BsmtFinSF1         1
BsmtFinType2      80
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
Electrical         1
BsmtFullBath       2
BsmtHalfBath       2
KitchenQual        1
Functional         2
FireplaceQu     1420
GarageType       157
GarageYrBlt      159
GarageFinish     159
GarageCars         1
GarageArea         1
GarageQual       159
GarageCond       159
PoolQC          2909
Fence           2348
MiscFeature     2814
SaleType           1
SalePrice       1459
dtype: int64

It looks like Alley, FireplaceQu, PoolQC, Fence and MiscFeature have significant numbers of missing data. So it will be best to eliminate those rows. There are a number of rows that have less than 5 rows with missing data. Since some of these are categorical and some are continuous data, their missing data will be replaced with the most frequent value. 

In [5]:
drop_high_nan=['Alley','FireplaceQu','PoolQC','Fence','MiscFeature']
train_test=train_test.drop(drop_high_nan,axis=1)
small_nan_cols = ['MSZoning', 'Utilities', 'Exterior1st', 'Exterior2nd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 
                  'TotalBsmtSF', 'Electrical', 'BsmtFullBath', 'BsmtHalfBath', 'KitchenQual', 'Functional', 'GarageCars', 
                  'GarageArea','SaleType', 'SaleCondition']
small_impute = SimpleImputer(strategy='most_frequent')
train_test[small_nan_cols] = pd.DataFrame(small_impute.fit_transform(train_test[small_nan_cols]),columns=small_nan_cols)

The following columns seem to have one value significantly larger than the rest, and it would probably be best to use the mode, or most common, value to feel each NaN value: MasVnrType, MasVnrArea, BsmtCond, BsmtExposure, BsmtFinType2, GarageType, GarageFinish, GarageQual, and GarageCond. That is represents 9 out of the 13 columns. 

BsmtQual has two values larger than the rest: Gd and TA. But it only has 2.8% NaNs, so simply using the mode might be good enough.  

GarageYrBlt has 59 NaNs out 2919 rows which is only 2%. It has a dispersed set of values, so it might be easiest just to have any NaNs have the same value as YearBuilt. 

BsmtFinType1 has only 2.7% value of NaNs, and most two of its largest values are GLQ and Unf. It might be easiest to use the mode here. 

LotFrontage has 486 NaNs out of 2919 rows which is a pretty high 16.7%. It has a dispersed range of values, but looking at its characteristics from the describe function above, it seems to have a pretty even distribution with a mean of 10,168 and a median of 9,453. So using the mean to fill in the NaNs seems reasonable. If it turns out this value has a high impact on the predictability of the SalePrice, then it might be good to revisit this assumption. 

In [6]:
mode_cols = ['MasVnrType', 'MasVnrArea', 'BsmtCond', 'BsmtExposure', 'BsmtFinType2', 'GarageType', 'GarageFinish', 
             'GarageQual','GarageCond', 'BsmtQual', 'BsmtFinType1']
mode_impute = SimpleImputer(strategy='most_frequent')
train_test[mode_cols] = pd.DataFrame(mode_impute.fit_transform(train_test[mode_cols]),columns=mode_cols)
train_test['LotFrontage'].fillna((train_test['LotFrontage'].mean()), inplace=True)
train_test['GarageYrBlt'] = train_test['GarageYrBlt'].fillna(train_test['YearBuilt'])

In [7]:
pd.isnull(train_test).sum()[pd.isnull(train_test).sum() > 0]

SalePrice    1459
dtype: int64

In [8]:
train_test['BsmtQual'].unique()

array(['Gd', 'TA', 'Ex', 'Fa'], dtype=object)

In [9]:
train_test['BsmtFinType1'].unique()

array(['GLQ', 'ALQ', 'Unf', 'Rec', 'BLQ', 'LwQ'], dtype=object)

In [10]:
train_test.BsmtQual = train_test.BsmtQual.replace({"Ex": 110, "Gd": 95, "TA": 85, "Fa": 75, "Po": 60, "NA": 0})

In [11]:
train_test.BsmtFinType1 = train_test.BsmtFinType1.replace({"GLQ": 6, "ALQ": 5, "BLQ": 4, "Rec": 3, "LwQ": 2, "Unf": 1,
                                                         "NA": 0})

It will be necessary to identify all the columns that have non-numeric object values and then convert them to numeric values. 

In [12]:
obj_cols = list(train_test.select_dtypes(['object']).columns)
obj_cols

['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'BsmtFullBath',
 'BsmtHalfBath',
 'KitchenQual',
 'Functional',
 'GarageType',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'SaleType',
 'SaleCondition']

In [13]:
for column in obj_cols:
     train_test[column] = pd.factorize(train_test[column], sort=True)[0]

  uniques = Index(uniques)


To create some new columns that might compound the effects of the existing columns with higher impact on the xgb score. 

In [14]:
train_test['QualCondSum'] = train_test['OverallQual'] + train_test['OverallCond']
train_test['RemodTime'] = train_test['YearRemodAdd'] - train_test['YearBuilt']
train_test['BsmtFinTypeSF1'] = train_test['BsmtFinType1'] * train_test['BsmtFinSF1']
train_test['TotalFlrSF'] = train_test['1stFlrSF'] + train_test['2ndFlrSF']
train_test['TotalFinSF'] = train_test['GrLivArea'] + train_test['BsmtFinSF1']
train_test['GarageCarArea'] = train_test['GarageArea'] * train_test['GarageCars']
train_test['TotalSF'] = train_test['1stFlrSF'] + train_test['2ndFlrSF'] + train_test['TotalBsmtSF']

To now eliminate the columns that have negative or zero influence on the xgb score. 

In [15]:
train_test.drop(['Utilities','Street','TotRmsAbvGrd','3SsnPorch','YrSold','Exterior2nd','LotConfig',
                              'HouseStyle','EnclosedPorch','WoodDeckSF','Foundation','RoofMatl','Electrical',
                              'GarageType','LotFrontage','SaleType','MoSold','BsmtExposure','1stFlrSF',
                              'BsmtFinSF1','Exterior1st','KitchenQual','TotalFlrSF'], axis=1, inplace=True)

To look at how many features have a skew above 0.5, since high skew can be an issue in regression analysis. 

In [16]:
skew_features = train_test.apply(lambda x: skew(x)).sort_values(ascending=False)

In [17]:
high_skew = skew_features[skew_features > 0.5]
skew_index = high_skew.index

In [18]:
skew_index

Index(['MiscVal', 'PoolArea', 'LotArea', 'LowQualFinSF', 'Heating',
       'Condition2', 'LandSlope', 'KitchenAbvGr', 'ScreenPorch',
       'BsmtHalfBath', 'BsmtFinSF2', 'Condition1', 'OpenPorchSF', 'BldgType',
       'RemodTime', 'MasVnrArea', 'RoofStyle', 'MSSubClass', 'GarageCarArea',
       'GrLivArea', 'TotalFinSF', 'BsmtFinTypeSF1', 'TotalSF', '2ndFlrSF',
       'BsmtQual', 'Fireplaces', 'HalfBath', 'BsmtFullBath', 'OverallCond'],
      dtype='object')

In [19]:
print("There are {} numerical features with Skew > 0.5 :".format(high_skew.shape[0]))
skewness = pd.DataFrame({'Skew' :high_skew})
skew_features.head(30)

There are 29 numerical features with Skew > 0.5 :


MiscVal           21.947195
PoolArea          16.898328
LotArea           12.822431
LowQualFinSF      12.088761
Heating           12.078788
Condition2        12.060093
LandSlope          4.975157
KitchenAbvGr       4.302254
ScreenPorch        3.946694
BsmtHalfBath       3.931594
BsmtFinSF2         3.476562
Condition1         2.983114
OpenPorchSF        2.535114
BldgType           2.192261
RemodTime          2.063712
MasVnrArea         1.556310
RoofStyle          1.553307
MSSubClass         1.375457
GarageCarArea      1.296672
GrLivArea          1.269358
TotalFinSF         1.159630
BsmtFinTypeSF1     0.932490
TotalSF            0.922771
2ndFlrSF           0.861675
BsmtQual           0.796302
Fireplaces         0.733495
HalfBath           0.694566
BsmtFullBath       0.624832
OverallCond        0.570312
HeatingQC          0.486656
dtype: float64

To normalize the features with skew above 0.5.

In [20]:
for i in skew_index:
    train_test[i] = np.log1p(train_test[i])

  result = getattr(ufunc, method)(*inputs, **kwargs)


To create a column with the log of the SalePrice to match the evaluation process in the contest. 

In [21]:
train_test['LogSalePrice'] = train_test['SalePrice'].apply(np.log)

To separate the train_test dataset back into the train and test datasets, identify the independent and dependent columns, and create the validation split.

In [22]:
train = train_test[train_test['SalePrice'].notnull()].copy()
test = train_test[train_test['SalePrice'].isnull()].drop(['SalePrice','LogSalePrice'],axis=1)
X = train.drop(['SalePrice','LogSalePrice'],axis=1)
y = train.LogSalePrice

In [23]:
X,y = shuffle(X,y, random_state=42)
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

In [24]:
def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=6))
    return rmse

In [None]:
cat_model = CatBoostRegressor()

In [None]:
print(rmse_cv(cat_model,X,y).mean())

In [27]:
submit = test[['Id']]
submit = submit.reset_index(drop=True)

In [28]:
cat_model.fit(X,y)

Learning rate set to 0.043466
0:	learn: 0.3874866	total: 3.71ms	remaining: 3.71s
1:	learn: 0.3760215	total: 7.14ms	remaining: 3.56s
2:	learn: 0.3651214	total: 10.1ms	remaining: 3.34s
3:	learn: 0.3551511	total: 13.2ms	remaining: 3.29s
4:	learn: 0.3454911	total: 16.3ms	remaining: 3.25s
5:	learn: 0.3363021	total: 18.8ms	remaining: 3.12s
6:	learn: 0.3267778	total: 22ms	remaining: 3.12s
7:	learn: 0.3192633	total: 23.5ms	remaining: 2.91s
8:	learn: 0.3108241	total: 26.5ms	remaining: 2.91s
9:	learn: 0.3027483	total: 28.8ms	remaining: 2.85s
10:	learn: 0.2951138	total: 31.2ms	remaining: 2.81s
11:	learn: 0.2884878	total: 33.7ms	remaining: 2.77s
12:	learn: 0.2814578	total: 36ms	remaining: 2.74s
13:	learn: 0.2748698	total: 39.2ms	remaining: 2.76s
14:	learn: 0.2682594	total: 41.9ms	remaining: 2.75s
15:	learn: 0.2622846	total: 44.3ms	remaining: 2.72s
16:	learn: 0.2558103	total: 46.7ms	remaining: 2.7s
17:	learn: 0.2496203	total: 49.2ms	remaining: 2.68s
18:	learn: 0.2442277	total: 52ms	remaining: 2.68s

229:	learn: 0.0889946	total: 590ms	remaining: 1.97s
230:	learn: 0.0888912	total: 593ms	remaining: 1.97s
231:	learn: 0.0888150	total: 595ms	remaining: 1.97s
232:	learn: 0.0886384	total: 598ms	remaining: 1.97s
233:	learn: 0.0885659	total: 600ms	remaining: 1.96s
234:	learn: 0.0883733	total: 603ms	remaining: 1.96s
235:	learn: 0.0882849	total: 606ms	remaining: 1.96s
236:	learn: 0.0882344	total: 608ms	remaining: 1.96s
237:	learn: 0.0880890	total: 611ms	remaining: 1.96s
238:	learn: 0.0878946	total: 614ms	remaining: 1.96s
239:	learn: 0.0877812	total: 617ms	remaining: 1.95s
240:	learn: 0.0876497	total: 621ms	remaining: 1.96s
241:	learn: 0.0875472	total: 624ms	remaining: 1.95s
242:	learn: 0.0873752	total: 627ms	remaining: 1.95s
243:	learn: 0.0872351	total: 630ms	remaining: 1.95s
244:	learn: 0.0871635	total: 634ms	remaining: 1.95s
245:	learn: 0.0870120	total: 636ms	remaining: 1.95s
246:	learn: 0.0868888	total: 639ms	remaining: 1.95s
247:	learn: 0.0867273	total: 641ms	remaining: 1.94s
248:	learn: 

455:	learn: 0.0665091	total: 1.19s	remaining: 1.42s
456:	learn: 0.0664546	total: 1.19s	remaining: 1.42s
457:	learn: 0.0664303	total: 1.19s	remaining: 1.41s
458:	learn: 0.0663516	total: 1.2s	remaining: 1.41s
459:	learn: 0.0662906	total: 1.2s	remaining: 1.41s
460:	learn: 0.0662159	total: 1.2s	remaining: 1.4s
461:	learn: 0.0661161	total: 1.2s	remaining: 1.4s
462:	learn: 0.0660193	total: 1.21s	remaining: 1.4s
463:	learn: 0.0659271	total: 1.21s	remaining: 1.4s
464:	learn: 0.0658411	total: 1.21s	remaining: 1.4s
465:	learn: 0.0657016	total: 1.22s	remaining: 1.39s
466:	learn: 0.0656004	total: 1.22s	remaining: 1.39s
467:	learn: 0.0654975	total: 1.22s	remaining: 1.39s
468:	learn: 0.0654291	total: 1.23s	remaining: 1.39s
469:	learn: 0.0653356	total: 1.23s	remaining: 1.38s
470:	learn: 0.0652374	total: 1.23s	remaining: 1.38s
471:	learn: 0.0651483	total: 1.23s	remaining: 1.38s
472:	learn: 0.0650538	total: 1.24s	remaining: 1.38s
473:	learn: 0.0650026	total: 1.24s	remaining: 1.38s
474:	learn: 0.0649461

680:	learn: 0.0524840	total: 1.77s	remaining: 829ms
681:	learn: 0.0524202	total: 1.77s	remaining: 826ms
682:	learn: 0.0523394	total: 1.77s	remaining: 824ms
683:	learn: 0.0523360	total: 1.78s	remaining: 821ms
684:	learn: 0.0522928	total: 1.78s	remaining: 818ms
685:	learn: 0.0522406	total: 1.78s	remaining: 816ms
686:	learn: 0.0521751	total: 1.78s	remaining: 813ms
687:	learn: 0.0521461	total: 1.79s	remaining: 811ms
688:	learn: 0.0520531	total: 1.79s	remaining: 809ms
689:	learn: 0.0519729	total: 1.79s	remaining: 806ms
690:	learn: 0.0519298	total: 1.8s	remaining: 804ms
691:	learn: 0.0518871	total: 1.8s	remaining: 801ms
692:	learn: 0.0518271	total: 1.8s	remaining: 799ms
693:	learn: 0.0517768	total: 1.8s	remaining: 796ms
694:	learn: 0.0517319	total: 1.81s	remaining: 794ms
695:	learn: 0.0517285	total: 1.81s	remaining: 791ms
696:	learn: 0.0516614	total: 1.81s	remaining: 789ms
697:	learn: 0.0516083	total: 1.82s	remaining: 786ms
698:	learn: 0.0515600	total: 1.82s	remaining: 783ms
699:	learn: 0.05

906:	learn: 0.0427118	total: 2.35s	remaining: 241ms
907:	learn: 0.0426827	total: 2.35s	remaining: 238ms
908:	learn: 0.0426346	total: 2.35s	remaining: 236ms
909:	learn: 0.0426149	total: 2.36s	remaining: 233ms
910:	learn: 0.0426124	total: 2.36s	remaining: 231ms
911:	learn: 0.0425900	total: 2.36s	remaining: 228ms
912:	learn: 0.0425237	total: 2.37s	remaining: 225ms
913:	learn: 0.0424885	total: 2.37s	remaining: 223ms
914:	learn: 0.0424872	total: 2.37s	remaining: 220ms
915:	learn: 0.0424108	total: 2.37s	remaining: 218ms
916:	learn: 0.0423920	total: 2.38s	remaining: 215ms
917:	learn: 0.0423551	total: 2.38s	remaining: 212ms
918:	learn: 0.0423090	total: 2.38s	remaining: 210ms
919:	learn: 0.0423076	total: 2.38s	remaining: 207ms
920:	learn: 0.0422545	total: 2.38s	remaining: 205ms
921:	learn: 0.0421887	total: 2.39s	remaining: 202ms
922:	learn: 0.0421368	total: 2.39s	remaining: 199ms
923:	learn: 0.0420709	total: 2.39s	remaining: 197ms
924:	learn: 0.0420304	total: 2.39s	remaining: 194ms
925:	learn: 

<catboost.core.CatBoostRegressor at 0x284ae0e44f0>

In [29]:
submit_predict = cat_model.predict(test)
submit_predict = np.exp(submit_predict)

In [30]:
submit['SalePrice'] = submit_predict

In [31]:
submit.to_csv('submit_cv_cat_boost.csv', index=False)

Reducing feature skewness actually slightly increased the Kaggle score. 