*Mei-Cheng Shih, 2016*

This kernel is inspired by the post of **JMT5802**.  The aim of this kernel is to use XGBoost to replace RF which was used as the core of the Boruta package.  Since XGBoost generates better quality predictions than RF in this case, the output of this kernel is expected to be mor representative.   Moreover, the code also includes the data cleaning process I used to build my model

**First, import packages for data cleaning and read the data**

In [1]:
from scipy.stats.mstats import mode
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import LabelEncoder

"""
Read Data
"""
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
target = train['SalePrice']
train = train.drop(['SalePrice'],axis=1)
trainlen = train.shape[0]

**Combined the train and test set for cleaning**

In [2]:
df1 = train.head()
df2 = test.head()
pd.concat([df1, df2], axis=0, ignore_index=True)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal
5,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
6,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
7,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
8,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
9,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [3]:
alldata = pd.concat([train, test], axis=0, join='outer', ignore_index=True)
alldata = alldata.drop(['Id','Utilities'], axis=1)
alldata.dtypes

MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
                  ...   
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object


**Dealing with the NA values in the variables, some of them equal to 0 and some equal to median, based on the txt descriptions**

In [6]:
fMedlist=['LotFrontage']
fArealist=['MasVnrArea','TotalBsmtSF','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtFullBath', 'BsmtHalfBath','MasVnrArea','Fireplaces','GarageArea','GarageYrBlt','GarageCars']

for i in fArealist:
    alldata.ix[pd.isnull(alldata.ix[:,i]),i]=0
        
for i in fMedlist:
    alldata.ix[pd.isnull(alldata.ix[:,i]),i] = np.nanmedian(alldata.ix[:,i])    

** Transforming Data **
Use integers to encode categorical data.

** Convert all ints to floats for XGBoost **

In [4]:
alldata.ix[:,(alldata.dtypes=='int64') & (alldata.columns != 'MSSubClass')]=alldata.ix[:,(alldata.dtypes=='int64') & (alldata.columns!='MSSubClass')].astype('float64')

In [5]:
alldata['MSSubClass']

0        60
1        20
2        60
3        70
4        60
5        50
6        20
7        60
8        50
9       190
10       20
11       60
12       20
13       20
14       20
15       45
16       20
17       90
18       20
19       20
20       60
21       45
22       20
23      120
24       20
25       20
26       20
27       20
28       20
29       30
       ... 
2889     30
2890     50
2891     30
2892    190
2893     50
2894    120
2895    120
2896     20
2897     90
2898     20
2899     80
2900     20
2901     20
2902     20
2903     20
2904     20
2905     90
2906    160
2907     20
2908     90
2909    180
2910    160
2911     20
2912    160
2913    160
2914    160
2915    160
2916     20
2917     85
2918     60
Name: MSSubClass, dtype: int64

In [10]:
alldata.head(20)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,5,3,65.0,8450.0,1,0,3,3,4,0,...,0.0,0.0,0,0,0,0.0,2.0,2008.0,8,4
1,0,3,80.0,9600.0,1,0,3,3,2,0,...,0.0,0.0,0,0,0,0.0,5.0,2007.0,8,4
2,5,3,68.0,11250.0,1,0,0,3,4,0,...,0.0,0.0,0,0,0,0.0,9.0,2008.0,8,4
3,6,3,60.0,9550.0,1,0,0,3,0,0,...,0.0,0.0,0,0,0,0.0,2.0,2006.0,8,0
4,5,3,84.0,14260.0,1,0,0,3,2,0,...,0.0,0.0,0,0,0,0.0,12.0,2008.0,8,4
5,4,3,85.0,14115.0,1,0,0,3,4,0,...,0.0,0.0,0,3,3,700.0,10.0,2009.0,8,4
6,0,3,75.0,10084.0,1,0,3,3,4,0,...,0.0,0.0,0,0,0,0.0,8.0,2007.0,8,4
7,5,3,68.0,10382.0,1,0,0,3,0,0,...,0.0,0.0,0,0,3,350.0,11.0,2009.0,8,4
8,4,4,51.0,6120.0,1,0,3,3,4,0,...,0.0,0.0,0,0,0,0.0,4.0,2008.0,8,0
9,15,3,50.0,7420.0,1,0,3,3,0,0,...,0.0,0.0,0,0,0,0.0,1.0,2008.0,8,4


In [8]:
le = LabelEncoder()
nacount_category = np.array(alldata.columns[((alldata.dtypes=='int64') | (alldata.dtypes=='object')) & (pd.isnull(alldata).sum()>0)])
category = np.array(alldata.columns[((alldata.dtypes=='int64') | (alldata.dtypes=='object'))])
Bsmtset = set(['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2'])
MasVnrset = set(['MasVnrType'])
Garageset = set(['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond'])
Fireplaceset = set(['FireplaceQu'])
Poolset = set(['PoolQC'])
NAset = set(['Fence','MiscFeature','Alley'])

# Put 0 and null values in the same category
for i in nacount_category:
    if i in Bsmtset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['TotalBsmtSF']==0), i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]), i] = alldata.ix[:,i].value_counts().index[0]
    elif i in MasVnrset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['MasVnrArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Garageset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['GarageArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Fireplaceset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['Fireplaces']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Poolset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['PoolArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in NAset:
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]='Empty'
    else:
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]

for i in category:
    alldata.ix[:,i]=le.fit_transform(alldata.ix[:,i])

train = alldata.ix[0:trainlen-1, :]
test = alldata.ix[trainlen:alldata.shape[0],:]

In [9]:
alldata.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,5,3,65.0,8450.0,1,0,3,3,4,0,...,0.0,0.0,0,0,0,0.0,2.0,2008.0,8,4
1,0,3,80.0,9600.0,1,0,3,3,2,0,...,0.0,0.0,0,0,0,0.0,5.0,2007.0,8,4
2,5,3,68.0,11250.0,1,0,0,3,4,0,...,0.0,0.0,0,0,0,0.0,9.0,2008.0,8,4
3,6,3,60.0,9550.0,1,0,0,3,0,0,...,0.0,0.0,0,0,0,0.0,2.0,2006.0,8,0
4,5,3,84.0,14260.0,1,0,0,3,2,0,...,0.0,0.0,0,0,0,0.0,12.0,2008.0,8,4


**Import required packages for Feature Selection Process**

In [9]:
import xgboost as xgb
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle

**Start the code, drop some outliers.  The outliers were detected by package statsmodel in python, skip details here**

Learn how to do this!

In [6]:
o=[30, 462, 523, 632, 968, 970, 1298, 1324]

train=train.drop(o,axis=0)
target=target.drop(o,axis=0)

train.index=range(train.shape[0])
target.index=range(train.shape[0])

**Set XGB model, the parameters were obtained from CV based on a Bayesian Optimization Process**   

In [7]:
est=xgb.XGBRegressor(colsample_bytree=0.4,
                 gamma=0.045,                 
                 learning_rate=0.07,
                 max_depth=20,
                 min_child_weight=1.5,
                 n_estimators=300,                                                                    
                 reg_alpha=0.65,
                 reg_lambda=0.45,
                 subsample=0.95)

**Start the test process, the basic idea is to permutate the order of elements in each of the columns randomly and see the impact of the permutation**

**For the evaluation metric of feature importance, I used ((MSE of pertutaed data)-(MSE of original data))/(MSE of original data)**

In [8]:
n=200

scores=pd.DataFrame(np.zeros([n, train.shape[1]]))
scores.columns=train.columns
ct=0

for train_idx, test_idx in ShuffleSplit(train.shape[0], n, .25):
    ct+=1
    X_train, X_test = train.ix[train_idx,:], train.ix[test_idx,:]
    Y_train, Y_test = target.ix[train_idx], target.ix[test_idx]
    r = est.fit(X_train, Y_train)
    acc = mean_squared_error(Y_test, est.predict(X_test))
    for i in range(train.shape[1]):
        X_t = X_test.copy()
        X_t.ix[:,i]=shuffle(np.array(X_t.ix[:, i]))
        shuff_acc =  mean_squared_error(Y_test, est.predict(X_t))
        scores.ix[ct-1,i]=((acc-shuff_acc)/acc)

Generate output, the mean, median, max and min of the scores fluctuation 

In [10]:
fin_score=pd.DataFrame(np.zeros([train.shape[1], 4]))
fin_score.columns=['Mean','Median','Max','Min']
fin_score.index=train.columns
fin_score.ix[:,0]=scores.mean()
fin_score.ix[:,1]=scores.median()
fin_score.ix[:,2]=scores.min()
fin_score.ix[:,3]=scores.max()

See the importances of features.  The higher the value, the less important the factor.

In [11]:
pd.set_option('display.max_rows', None)
fin_score.sort_values('Mean',axis=0)

Unnamed: 0,Mean,Median,Max,Min
OverallQual,-1.745853,-1.744054,-3.160713,-0.908557
GrLivArea,-0.718646,-0.7081993,-1.407366,-0.269067
TotalBsmtSF,-0.691805,-0.6770063,-1.258033,-0.232703
GarageCars,-0.229122,-0.2223006,-0.499151,0.068078
2ndFlrSF,-0.22049,-0.2165775,-0.417533,-0.079036
ExterQual,-0.126791,-0.1214447,-0.376415,0.028661
TotRmsAbvGrd,-0.126063,-0.1173449,-0.345731,0.032275
1stFlrSF,-0.117124,-0.1102345,-0.354299,0.048251
BsmtFinSF1,-0.111459,-0.1075463,-0.228526,-0.002136
LotArea,-0.098098,-0.09367906,-0.199207,0.022418


**The result is a little bit difference from what JMT5802 got, but in general they are similar.  For example, OverallQual, GrLivArea are important in both cases, and PoolArea and PoolQC are not important in both cases.  Also, based on the test conducted in link below, it is reasonable to say the differences are not obvious in both cases**

Also, the main code was modified from the example in the link below, special thanks to the author of the blog

http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

**Updates:**

After several tests, I removed the variables in the list below, and this action did improve my score a little bit.
['Exterior2nd', 'EnclosedPorch', 'RoofMatl', 'PoolQC', 'BsmtHalfBath',
  'RoofStyle', 'PoolArea', 'MoSold', 'Alley', 'Fence', 'LandContour',
  'MasVnrType', '3SsnPorch', 'LandSlope']


In [12]:
est

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.4,
       gamma=0.045, learning_rate=0.07, max_delta_step=0, max_depth=20,
       min_child_weight=1.5, missing=None, n_estimators=300, nthread=-1,
       objective='reg:linear', reg_alpha=0.65, reg_lambda=0.45,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.95)

In [21]:
test.shape[0]

1459

In [22]:
result = pd.Series(est.predict(test))

In [23]:
result.index

RangeIndex(start=0, stop=1459, step=1)

In [28]:
submission = pd.DataFrame({
        "Id": result.index + 1461,
        "SalePrice": result.values
    })

In [29]:
submission.to_csv('submission-xgboost.csv', index=False)