# Creating and Cleaning Features: Treat Missing Values in The Data

The primary methods to treat missing values:
1. Impute to the median or mean of the feature
2. Model that feature to predict the missing values
3. Assign it some default value( other or -999)

## Read in Data

In [3]:
import pandas as pd

df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [16]:
# Check where the missing values are
df.isnull().sum().sort_values(ascending=False)[:19]

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtFinType2      38
BsmtExposure      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64

__Note__ : Some values might be missing if the house did not have that feature, for instance a lot of houses will not have a pool. In this case, the PoolQC value can be replaced with NA : No Pool.

In [19]:
# Fill in missing values for PoolQC with NA
df['PoolQC'] = df['PoolQC'].fillna('NA')


In [23]:
# 0 null values for PoolQC now
df['PoolQC'].isnull().sum()

0

In [24]:
# Fill in missing values for MiscFeature with NA for None
df['MiscFeature'] = df['MiscFeature'].fillna('NA')
df['PoolQC'].isnull().sum()

0

In [26]:
#Fill in missing values for Alley access with NA for No alley access
df['Alley'] = df['Alley'].fillna('NA')
df['Alley'].isnull().sum()

0

In [27]:
#Fill in missing values for Fence with NA for No Fence
df['Fence'] = df['Fence'].fillna('NA')

In [51]:
#Seeing how many houses got 0 fireplaces
boo = (df['Fireplaces'] == 0).sort_values(ascending=False)

x = sum(boo==True)
x
# There are 690 houses with 0 fireplaces, same as 690 FirePlaceQu missing values. So we can change those to NA for no fireplace

690

In [53]:
#Fill in missing values for FirePlaceQu with NA
df['FireplaceQu'] = df['FireplaceQu'].fillna('NA')

In [56]:
#Fill in missing values for LotFrontage with 0

df['LotFrontage'] = df['LotFrontage'].fillna(0)

In [57]:

df.isnull().sum().sort_values(ascending=False)

GarageFinish    81
GarageType      81
GarageQual      81
GarageCond      81
GarageYrBlt     81
                ..
Exterior2nd      0
Exterior1st      0
RoofMatl         0
RoofStyle        0
FirePlaceQu      0
Length: 82, dtype: int64

In [69]:
#Fill in missing values for Garage and Basement related features
import numpy as np

df.fillna({'GarageFinish': 'NA', 'GarageType': 'NA', 'GarageQual': 'NA', 'GarageCond': 'NA','GarageYrBlt': df['GarageYrBlt'].mean(), 
           'BsmtExposure': 'NA', 'BsmtFinType2': 'NA', 'BsmtQual': 'NA', 'BsmtCond' : 'NA', 'BsmtFinType1': 'NA',
              'MasVnrType': 'None', 'MasVnrArea': 0, 'Electrical': 'NA'}, inplace=True)

In [70]:
df.isnull().sum().sort_values(ascending=False)[:15]

Id              0
GarageCars      0
GarageYrBlt     0
GarageType      0
FireplaceQu     0
Fireplaces      0
Functional      0
TotRmsAbvGrd    0
KitchenQual     0
KitchenAbvGr    0
BedroomAbvGr    0
HalfBath        0
FullBath        0
BsmtHalfBath    0
BsmtFullBath    0
dtype: int64

In [71]:
# Saving our cleaned dataframe to csv file

df.to_csv('data/train_clean.csv', index=False)