# House Price Prediction: Data Pre-Processing

This notebook applies techniques and methods on Kaggle's housing dataset. As a result of exploratory data analysis (EDA) performed on data it is decided to take following pre-proccing actions
1) Handle missing values
2) Encode Categorical variables with OneHot or Ordinal encoding
3) Standardizing numerical values.

In [34]:
# imports

from typing import List, Tuple

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from utils import sep_columns_from_desc, missing_values_by_col
# import utils as ut


## Load Data

In [2]:
# Load the data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("Dimensions of train: {}".format(train_df.shape))
print("Dimensions of test: {}".format(test_df.shape))

Dimensions of train: (1460, 81)
Dimensions of test: (1459, 80)


In [3]:
cat_cols, num_cols= sep_columns_from_desc(filename='data_description.txt',
                                          data_cols=train_df.columns)

print(f"Total Columns:{len(cat_cols)+len(num_cols)}")

Total Columns:79


Two columns missing from the list is `Id` and `SalePrice`. `Id` does not carry any information and `SalePrice` is the target value.

## Missing values

EDA results shows that 18 columns have missing values and further
- 14/18 columns missing values belong to categorical variables and NaN's have a meaning i.e absence of the category
- 4/18 columns (`GarageYrBlt`, `Electrical`, `MasVnrArea` and `LotFrontage` )
    - `GarageYrBlt`- column can be dropped as it is does not add value to the prediction of `SalePrice`
    - `Electrical` - There is 1 missing value in this column, we can drop that row
    - `MasVnrArea` - Missing values shall be replaced by 0
    - `LotFrontage` - Missing values shall be replaced by its mean/median.


In [12]:
# Identify missing value columns and their percentage
miss_col_df = missing_values_by_col(train_df)

# numerical columns with missing values
num_miss_col = ['GarageYrBlt', 'Electrical','MasVnrArea', 'LotFrontage' ]

# categorical columns with missing values
cat_miss_col = [c for c in miss_col_df['index'] if c not in num_miss_col]

# cat_miss_col, num_miss_col

Algorithm/Strategy to handle missing values in Training data

In [29]:
# Handle missing values of numerical columns
proc_df = train_df.copy()
# Drop GarageYrBlt column
proc_df = train_df.drop(columns=['GarageYrBlt'])
# Fill missing values in MasVnrArea with 0
proc_df['MasVnrArea'] = proc_df['MasVnrArea'].fillna(0)
# Fill missing values in LotFrontage with median
proc_df['LotFrontage'] = proc_df['LotFrontage'].fillna(proc_df['LotFrontage'].median())
# Drop row with missing values in Electrical
proc_df = proc_df.dropna(subset=['Electrical'])

# Handle missing values of categorical columns
# Fill missing values in cat_miss_col with 'None'
proc_df[cat_miss_col] = proc_df[cat_miss_col].fillna('None')

print("Number of Missing values in processed data: {}".format(proc_df.isnull().sum().sum()))

Number of Missing values in processed data: 0


The missing values could appear in any column in testing data. We need to device a general strategy to handle missing values in Testing Data.

1) Use `SimpleImputer` to impute missing Numerical values
2) Fill "None" for missing categorical values.

In [None]:
# def handle_missing_values(df:pd.DataFrame, cat_cols:List, num_cols:List )->pd.DataFrame:
#     """
#     Handle missing values in the dataframe

#     Args:
#     df: input dataframe
#     cat_cols: list of categorical columns
#     num_cols: list of numerical columns

#     Returns:
#     processed dataframe
#     """
#     # Make a copy of the dataframe
#     _df = df.copy()

#     # Identify missing value columns and their percentage
#     miss_col_df = missing_values_by_col(df)

#     # Iterate through rows of the miss_col_df
#     for i, row in miss_col_df.iterrows():
#         if row['index'] in num_cols:
#             # Fill missing values in numerical columns with median
#             _df[row['index']] = _df[row['index']].fillna(_df[row['index']].median())
#         else:
#             # Fill missing values in categorical columns with 'None'
#             _df[row['index']] = _df[row['index']].fillna('None')

#     return _df

# Handling Categorical variables

Use `OrdinalEncoder` to encode categorical values. As these have some sense of ordering.

## Preprocessing pipeline

Build preprocessing pipeline with sklearn's `Pipeline` function. 

Define pre-processing steps for 
1) Numerical 
2) Categorical 

columns separately with the  `ColumnTransformer`

In [54]:
# preprocessing for categorical and numerical data
def preprocess_data(cat_cols:List, num_cols:List)->ColumnTransformer:
    """
    Preprocess the data

    Args:
    cat_cols: list of categorical columns
    num_cols: list of numerical columns

    Returns:
    ColumnTransformer object
    """

    catergorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
        ('ordinal', OrdinalEncoder())
    ])

    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median'))
        ('scaler', StandardScaler())
    ])

    # combine preprocessing steps
    preprocessor = ColumnTransformer( transformers=[
            ('num', numerical_transformer, num_cols),
            ('cat', catergorical_transformer, cat_cols)
        ])
    
    return preprocessor

In [55]:
preproc = preprocess_data(cat_cols, num_cols)

In [58]:
preproc.fit(train_df)