In [28]:
import pandas as pd
import numpy as np

from statistics import mode

In [29]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## Remove outliers

In the EDA file, we've found some data points that shown outlier behavior in the data set, in this section, we will remove them from the training data first.

In [30]:
#drop outliers
#outlier index for "MaxVnrArea"
outlier_loc = []
outlier_loc.append(train[train['MasVnrArea'] == train['MasVnrArea'].max()].index.values)

#outlier index for "TotalBsmtSF"
outlier_loc.append(train[train['TotalBsmtSF'] == train['TotalBsmtSF'].max()].index.values)

#outlier index for "1stFlrSF"
outlier_loc.append(train[train['1stFlrSF'] == train['1stFlrSF'].max()].index.values)

#outlier index for "GrLivArea"
outlier_loc.append(train[train['GrLivArea'] > 4500].index.values)

outlier_loc = list(set(list(np.concatenate(outlier_loc))))

train.drop(outlier_loc, inplace=True)

After removing outliers, we will set our target variable aside after we apply logarithmic function to it. We also want to remove the Ids before we combine the training and testing data set so we can remove variables that have multicollinearity together

In [31]:
train_y = np.log1p(train['SalePrice'])
train.drop(['SalePrice'], axis=1, inplace=True)

train_id = train['Id']
train.drop(['Id'], axis=1, inplace=True)

test_id = test['Id']
test.drop(['Id'], axis=1, inplace=True)

df = pd.concat([train, test], axis= 0)

## Remove variables that showed multicollinearity

In [32]:
#remove "LotFrontage"
df = df.drop('LotFrontage', axis = 1)

#remove "GarageYrBlt", "TotalBastmtSF", "GarageArea"
df = df.drop(["GarageYrBlt", "TotalBsmtSF", "GarageArea"], axis = 1)

#replace houses with remodel year same as build year with value 0
df.YearRemodAdd[df['YearBuilt'] == df['YearRemodAdd']]= 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Next, we will handle the numerical and categorical data invidually.

In [33]:
#spliting object dataframe and numeric dataframe
df_num = df.select_dtypes(exclude= ['object'])
df_cat = df.select_dtypes(include=['object']).copy()

Because we discussed about that "MSSubClass" was mistakenly recognized as a numerical variable, so here we would need add it back to the categorical variable and remove it from the numerical's.

In [34]:
df_cat['MSSubClass'] = df_num['MSSubClass']
df_num = df_num.drop('MSSubClass', axis = 1)

## Categorical Variables

In [35]:
df_cat.isnull().sum()

MSZoning            4
Street              0
Alley            2719
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType         24
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           81
BsmtCond           82
BsmtExposure       82
BsmtFinType1       79
BsmtFinType2       80
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
KitchenQual         1
Functional          2
FireplaceQu      1420
GarageType        157
GarageFinish      159
GarageQual        159
GarageCond        159
PavedDrive          0
PoolQC           2907
Fence            2345
MiscFeature      2811
SaleType            1
SaleCondition       0
MSSubClass          0
dtype: int64

We would see more missing data here because we joined the training and testing data together. However, we can still apply the threshold we set (70).

In [36]:
cat_missing = pd.DataFrame(df_cat.isnull().sum()).reset_index().sort_values(by = 0, ascending=False).head(34)
cat_missing_index = cat_missing.iloc[:,0]

In [37]:
for i in range(len(cat_missing_index)):
    if (cat_missing.iloc[:,1].values[i] >= 70) :
        df_cat[cat_missing_index.values[i]] = df_cat[cat_missing_index.values[i]].fillna('None')
    else:
        df_cat[cat_missing_index.values[i]][df_cat[cat_missing_index.values[i]].isnull()] = mode(df_cat[cat_missing_index.values[i]])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


We will also need to factorize our categorical data

In [38]:
for i in df_cat:
    df_cat[i] = pd.factorize(df_cat[i])[0]

## Numerical Variables

In [39]:
pd.DataFrame(df_num.isnull().sum()).reset_index().sort_values(by = 0, ascending=False).head(20)

Unnamed: 0,index,0
5,MasVnrArea,23
14,BsmtHalfBath,2
13,BsmtFullBath,2
21,GarageCars,1
6,BsmtFinSF1,1
7,BsmtFinSF2,1
8,BsmtUnfSF,1
0,LotArea,0
24,EnclosedPorch,0
22,WoodDeckSF,0


In EDA's section, we did not go over the method that we are going to apply on handling these missing data. Since the amount of missing data is relatively low here, we will just handle the them by replacing them with 0.

In [40]:
df_num = df_num.fillna(0)

## Split train and test data

In [41]:
df = pd.concat([df_num, df_cat], axis=1)
pos = train.shape[0]

train = df[:pos]
test = df[pos:]

train['Id'] = train_id
test['Id'] = test_id

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [42]:
train.to_csv("data/train_preProcess.csv", index= False, encoding='utf-8')
test.to_csv('data/test_preProcess.csv', index= False, encoding='utf-8')