# House Prices Analysis: A Kaggle Competition

In [220]:
import pandas as pd
from sklearn.preprocessing import Imputer, LabelEncoder

## Data Processing

In [221]:
# load data
train = pd.read_csv('/Users/Tomas/Desktop/Kaggle-House-Prices-Challenge/data/train.csv')
print(train.shape)

(1460, 81)


In [222]:
# columns with missing data
def summarize_missing(df):
    return df.isnull().sum()[df.isnull().any()]

summarize_missing(train)

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

Examining these columns, some features are missing much more data than others. Alley, PoolQC, and MiscFeature are missing for almost the entirety of the dataset. It makes sense to drop these features, as we would be imputing too much of the data. We use a 20% cutoff to determine which columns to impute and which to discard. 

In [223]:
def percentage_missing(df):
    return df.isnull().sum().divide(df.shape[0]).multiply(100)[train.isnull().any()]

percentage_missing(train)

LotFrontage     17.739726
Alley           93.767123
MasVnrType       0.547945
MasVnrArea       0.547945
BsmtQual         2.534247
BsmtCond         2.534247
BsmtExposure     2.602740
BsmtFinType1     2.534247
BsmtFinType2     2.602740
Electrical       0.068493
FireplaceQu     47.260274
GarageType       5.547945
GarageYrBlt      5.547945
GarageFinish     5.547945
GarageQual       5.547945
GarageCond       5.547945
PoolQC          99.520548
Fence           80.753425
MiscFeature     96.301370
dtype: float64

Alley, FireplaceQu, PoolQC, Fence, MiscFeature all are missing more than 20% of their values, so these columns are dropped from the training set. NOTE: Remember to drop these during processing of test data for model building

In [224]:
train = train.drop(['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis=1)
train.shape


(1460, 76)

In [225]:
# Majority of missing values very little of data. Drop all columns with missing values except LotFrontage, impute those
train = train.dropna(axis=0,how='any', subset=['MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
                                              'Electrical', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond'])
summarize_missing(train)

LotFrontage    244
dtype: int64

In [226]:
train.shape

(1338, 76)

We see that the majority of the observations were not removed, so we continue

We decide to impute the remaining data with the mean value for each column

In [227]:
# Impute missing data
imputer = Imputer(missing_values= 'NaN', strategy = 'mean', axis=0)
train['LotFrontage'] = imputer.fit_transform(train['LotFrontage'].values.reshape(-1,1))

In [228]:
# Check to make sure no more missing data
summarize_missing(train)

Series([], dtype: int64)

In [230]:
# Save data to file for later use
train.to_csv('/Users/Tomas/Desktop/Kaggle-House-Prices-Challenge/data/train_modified.csv')

## Modeling