# House Price Prediction: Data Pre-Processing

This notebook applies techniques and methods on Kaggle's housing dataset. As a result of exploratory data analysis (EDA) performed on data it is decided to take following pre-proccing actions
1) Handle missing values
2) Encode Categorical variables with OneHot or Ordinal encoding
3) Standardizing numerical values.

In [16]:
# imports

import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

from utils import sep_columns_from_desc
# import utils as ut


## Load Data

In [4]:
# Load the data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("Dimensions of train: {}".format(train_df.shape))
print("Dimensions of test: {}".format(test_df.shape))

Dimensions of train: (1460, 81)
Dimensions of test: (1459, 80)


In [17]:
cat_cols, num_cols= sep_columns_from_desc(filename='data_description.txt',
                                          data_cols=train_df.columns)

print(f"Total Columns:{len(cat_cols)+len(num_cols)}")

Total Columns:79


Two columns missing from the list is `Id` and `SalePrice`. `Id` does not carry any information and `SalePrice` is the target value.

## Missing values

EDA results shows that 18 columns have missing values and further
- 14/18 columns missing values have a meaning
- 4/18 columns (`GarageYrBlt`, `Electrical`, `MasVnrArea` and `LotFrontage` ), we need a strategy to handle missing values 


In [42]:
# Identify missing value columns and 
miss_col_df = pd.DataFrame(train_df.isna().sum()).reset_index()\
    .rename(columns={0: 'missing_values'})\
    .sort_values(by='missing_values', ascending=False)\
    .query('missing_values > 0')\
    .pipe(lambda x: x.assign(percentage_missing = x.missing_values / train_df.shape[0] * 100))\
    .reset_index()

# miss_col_df


Unnamed: 0,GarageArea,GarageYrBlt
39,0,
48,0,
78,0,
88,0,
89,0,
...,...,...
1349,0,
1407,0,
1449,0,
1450,0,


In [47]:
miss_col_df['index'].values

array(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType',
       'FireplaceQu', 'LotFrontage', 'GarageYrBlt', 'GarageCond',
       'GarageType', 'GarageFinish', 'GarageQual', 'BsmtFinType2',
       'BsmtExposure', 'BsmtQual', 'BsmtCond', 'BsmtFinType1',
       'MasVnrArea', 'Electrical'], dtype=object)