### Tell me about your *house*, I will tell you its *worth*.
![](https://www.key-universal.com/wp-content/uploads/2020/07/real-estate-agents-and-buyer-difference.png)

# Introduction

Predicting the sale price of a house based on its features is probabily one of the famous and most common example in ML study. Perhaps together with the titanic learning from disater dataset, house price predictions is the most explored dataset here in kaggle. They are the first dataset beginners use to start their DS and ML journey. So like many beginning ML students, I will take the chance to practice my data science skills using the Ames Iowa housing dataset.

I will break down my work into three parts:

- Part-1: **Exploratory data analysis (EDA)**: where we will attempt to do exploratory data analysis and study the data at hand
- Part-2: **Pre-processing and Feature Engineering**: where we deal with missing values, drop trivial features (if any) and engineer additional potentially useful features 
- Part-3: **Modeling and Prediction**: where the magic happens, train a ML model and predict house sale price like a real estate agent would normally do :)


# Part 1: Exploratory Data Analysis (EDA)

## Setup

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings('ignore')

## 1.1 Load the data

In [None]:
train = pd.read_csv(r'/kaggle/input/house-prices-advanced-regression-techniques/train.csv', index_col='Id')
test = pd.read_csv(r'/kaggle/input/house-prices-advanced-regression-techniques/test.csv', index_col='Id')
submission = pd.read_csv(r'/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv', index_col='Id')

## 1.2 Explore the data (EDA)

In [None]:
display('Train data shape: {}'.format(train.shape))
display('Test data shape: {}'.format(test.shape))

In [None]:
df_train = train.copy()
df_test = test.copy()

y = df_train.SalePrice              
df_train.drop(['SalePrice'], axis=1, inplace=True)
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.describe().style.format("{:.2f}")

In [None]:
## source for this snippet is @https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

def highlight_max(data, color='red'):
    '''
    highlight the maximum in a Series or DataFrame
    '''
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_max = data == data.max()
        return [attr if v else '' for v in is_max]
    else:  # from .apply(axis=None)
        is_max = data == data.max().max()
        return pd.DataFrame(np.where(is_max, attr, ''),
                            index=data.index, columns=data.columns)

In [None]:
# zoom-in to few interesting features only
# train data
interesting_features =['LotArea', 'OverallQual','TotRmsAbvGrd', 'GarageCars', 'YearRemodAdd', 'YearBuilt','GarageYrBlt', 'SalePrice']
DF = train.describe(include=np.number)[interesting_features].style.format("{:.1f}")
DF.format({"YearRemodAdd": "{:.0f}","YearBuilt": "{:.0f}","GarageYrBlt": "{:.0f}"})

In [None]:
# test data
interesting_features =['LotArea', 'OverallQual','TotRmsAbvGrd', 'GarageCars', 'YearRemodAdd', 'YearBuilt','GarageYrBlt']
DF = test.describe(include=np.number)[interesting_features].style.format("{:.1f}").apply(highlight_max, subset=['GarageYrBlt'], color='red')
DF.format({"YearRemodAdd": "{:.0f}","YearBuilt": "{:.0f}", "GarageYrBlt": "{:.0f}"})

In [None]:
test[test['GarageYrBlt'] > 2010]['GarageYrBlt']

Observation:
- Average lot area is 10516.8 square feet
- Average total rooms excluding underground is 6.5 (~7 rooms)
- On average houses were renovated in the mid 80's (1985 avg.), the latest being 2010 (the year of data collection). The oldest house was built in 1872
- The mean sale price of a house is 180921 USD
- A typical garage size takes in 2 cars
<div class="alert alert-block alert-danger">  
>>> Test data has an error in data entry! The highlighted year (in the table above) is probabily 2007, surely not 2207! 
</div>


## 1.3 Explore NaN values

- 19 out of 78 features of the train data have at least one missing value. However, there are more (33) features missing at least one in the test data
- Four columns (**PoolQC**, **MiscFeatures**, **Alley** and **Fence**) have more that 80% missing values
- **FireplaceQu** has 47.26% missing values

In [None]:
display('Null values in each column: trian data')
df_train.isnull().sum().sort_values(ascending=False).head()

In [None]:
display('Null values in each column: test data')
df_test.isnull().sum().sort_values(ascending=False).head()

In [None]:
# train_data
null_values_train = []
for col in df_train.columns:
    if df_train[col].isna().sum() != 0:
        pct_na = np.round((100 * (df_train[col].isna().sum())/len(df_train)), 2)            
        dict2 ={
            'Features' : col,
            'NA_train (count)': df_train[col].isna().sum(),
            'NA_trian (%)': pct_na
        }
        null_values_train.append(dict2)
DF1 = pd.DataFrame(null_values_train, index=None).sort_values(by='NA_train (count)',ascending=False)


# test_data
null_values_test = []
for col in df_test.columns:
    if df_test[col].isna().sum() != 0:
        pct_na = np.round((100 * (df_test[col].isna().sum())/len(df_test)), 2)            
        dict1 ={
            'Features' : col,
            'NA_test (count)': df_test[col].isna().sum(),
            'NA_test (%)': pct_na
        }
        null_values_test.append(dict1)
DF2 = pd.DataFrame(null_values_test, index=None).sort_values(by='NA_test (count)',ascending=False)

pd.merge(DF1, DF2, how='outer', on=['Features']).style.format(None, na_rep="-")

In [None]:
def null_value_percentage_plot(data):
    '''Given a dataframe, this fuction calculates and plot the number of 
    null values in each columns as a percentage
    
    input: data
    output: seaborn horizontal barplot
    
    '''
    null_values = []
    for col in data.columns:
        if data[col].isna().sum() != 0:
            pct_na = np.round((100 * (data[col].isna().sum())/len(data)), 2)            
            dict ={
                'Column' : col,
                'Null value': data[col].isna().sum(),
                'Percent null value': pct_na
            }
            null_values.append(dict)

    z = pd.DataFrame(null_values, index=None)    
    fig = plt.figure(figsize=(12,8))
    #fig.subplots_adjust(top=0.89)
    #sns.set_style("dark")
    ax = sns.barplot(y="Column", x="Percent null value", 
                     data=z.sort_values('Percent null value', ascending=False).head(15),
                     palette='gist_earth',
                     orient='h')
    ax.set_title("Pecent null value in columns (top 15 displayed)", fontsize=20, y=1.05)
    return ax

In [None]:
null_value_percentage_plot(df_train)

## 1.3 Categorical Features 
I grouped the columns into three groups according to their cardinality. This is just (for now) purely for plotting convienience.
1. Low cardinals: cols with < 5 categories
1. Medium cardinals: cols with < 9 categoriescate
1. High cardinals: cols with >=9 categories

In [None]:
# separte the columns into numerical and categorical

cat_features =[]
num_features =[]

for col in df_train.columns:
    if df_train[col].dtype=='object':
        cat_features.append(col)
    else:
        num_features.append(col)

# group columns according to cardinality

low_cardinal_cols = []
med_cardinal_cols = []
high_cardinal_cols = []

for col in cat_features:
    if df_train[col].nunique() < 6:
        low_cardinal_cols.append(col)
    elif df_train[col].nunique() < 9:
        med_cardinal_cols.append(col)
    else:
        high_cardinal_cols.append(col)

# display the values

display("low_cardinal_cols")
display(low_cardinal_cols)
display("med_cardinal_cols")
display(med_cardinal_cols)
display("high_cardinal_cols")
display(high_cardinal_cols)

In [None]:
def count_plot_pct(df_train, df_test, cols, titleText, figsize=(24,26)):
    L = len(cols)
    nrow= int(np.ceil(L/3))
    ncol= 3
    
    remove_last= (nrow * ncol) - L
    
    fig, ax = plt.subplots(nrow, ncol,figsize=figsize, facecolor=None, sharex=True)
    ax.flat[-remove_last].set_visible(False)
    fig.subplots_adjust(top=0.97)
    itr = 1
    for col in cols:
        total = float(len(df_train[cols])) 
        plt.subplot(nrow, ncol, itr)
        ax = sns.countplot(y=col, color="#5F6664", data=df_train[cols], alpha=0.5, label='train')
        ax = sns.countplot(y=col, color="#600000", data=df_test[cols], alpha =0.5, label='test')
        ax.set_xlabel('') 
        sns.despine(top=True, right=True, left=False, bottom=False, offset=5, trim=False)
        plt.legend()
        itr += 1
    plt.suptitle(titleText ,fontsize = 24, y=1.002)
    fig.text(0.5, 0.085, 'counts', ha='center')
    plt.show()    
    

In [None]:
count_plot_pct(df_train, df_test, low_cardinal_cols, "Unique categories in low cardinal columns")

In [None]:
count_plot_pct(df_train, df_test, med_cardinal_cols, "Unique categories in medium cardinal columns", figsize=(24, 20))

In [None]:
count_plot_pct(df_train, df_test, high_cardinal_cols, "Unique categories in high cardinal columns", figsize=(24, 16))

### 1.3.1 Similarity of unique entries in catagorical features (train vs test)

<div class="alert alert-block alert-danger">  
>>> There are 11 categorical features where the unique values in train & test datasets aren't identical. We will deal with this issue in the data pre-processing phase.
</div>


In [None]:
# train_data
unique_cat_train = []
for col in df_train[cat_features]:
    unique_train = df_train[col].nunique()  
    dict1 ={
        'Features' : col,
        'Unique cats (train)': unique_train,        
    }
    unique_cat_train.append(dict1)
DF1 = pd.DataFrame(unique_cat_train, index=None).sort_values(by='Unique cats (train)',ascending=False)

# test_data
unique_cat_test = []
for col in df_test[cat_features]:
    unique_test = df_test[col].nunique()    
    dict2 ={
        'Features' : col,
        'Unique cats (test)': unique_test,        
    }
    unique_cat_test.append(dict2)
DF2 = pd.DataFrame(unique_cat_test, index=None).sort_values(by='Unique cats (test)',ascending=False)

pd.merge(DF1, DF2, how='outer', on=['Features']).style.format(None, na_rep="-")

## 1.4 Numerical Features

In [None]:
plt.figure()
fig, ax = plt.subplots(12, 3,figsize=(20, 46))
fig.subplots_adjust(top=0.96)
itr = 1
for feature in num_features:
    plt.subplot(12, 3, itr)
    ax = sns.histplot(df_train[feature], color="#ff9900", label='train')
    ax = sns.histplot(df_test[feature], color="#4da6ff", label='test')
    plt.xlabel(feature, fontsize=9)
    plt.legend()
    itr += 1
plt.suptitle('Numerical features', fontsize=20)
plt.show()

## 1.5 The target variable (Sale Price)

- Sale price is not a **normally** distributed data. It is **right skewed**.
- The average house price is sold at **180921** USD
- The highest and lowest sale prices are **755000** and **34900** USD respectively.

In [None]:
median_ = train['SalePrice'].median()
mode_ = train['SalePrice'].mode()
mean_ = train['SalePrice'].mean()
display(mode_, median_, mean_)

In [None]:
train['SalePrice'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
fig.suptitle('Normality Check of SalePrice distribution', fontsize=16)

sns.histplot(y, bins=100,  kde=True, color='#214e69', ax=axes[0])
axes[0].set_title('')

sm.qqplot(y, line='s', ax=axes[1], color='#214e69')
axes[1].set_title('')

### 1.5.1 Which features drive the sale price?

In [None]:
correlation_table = []
for cols in num_features:
    y = train['SalePrice']
    x = df_train[cols]
    corr = np.corrcoef(x, y)[1][0]
    dict ={
        'Features': cols,
        'Correlation coefficient' : corr,
        'Feat_type': 'numerical'
    }
    correlation_table.append(dict)
dF1 = pd.DataFrame(correlation_table)
fig = plt.figure(figsize=(12,8))
ax = sns.barplot(x="Correlation coefficient", y="Features", 
                     data=dF1.sort_values("Correlation coefficient", ascending=False),
                     palette='Blues_r', alpha=0.75)
ax.set_title("Correlation of numerical features with SalePrice", fontsize=20, y=1.05)

In [None]:
correlation_table= []
for cols in cat_features:
    y = train['SalePrice']
    X = train[cols]
    corr = pd.concat((X, y), axis=1).apply(lambda x : pd.factorize(x)[0]).corr()
    dict ={
        'Features': cols,
        'Correlation coefficient' : corr['SalePrice'][:].values[0],
        'Feat_type': 'categorical'
    }
    correlation_table.append(dict)
dF2 = pd.DataFrame(correlation_table)
fig = plt.figure(figsize=(12,8))
ax = sns.barplot(x="Correlation coefficient", y="Features", 
                     data=dF2.sort_values("Correlation coefficient", ascending=False),
                     palette='Blues_r', alpha=0.75)
ax.set_title("Correlation of categorical features with SalePrice", fontsize=20, y=1.05)

In [None]:
# # the bove two correlations plots combined together

# dF3 = pd.concat((dF1, dF2), axis=0)
# fig = plt.figure(figsize=(12,20))
# ax = sns.barplot(x="Correlation coefficient", y="Features", 
#                      data=dF3.sort_values("Correlation coefficient", ascending=False),
#                      alpha=0.75, hue='Feat_type')
# ax.set_title("Correlation of all features with SalePrice", fontsize=20, y=1.05)

### 1.5.2 Which features are correlated with eachother?

In [None]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=int))
f, ax = plt.subplots(figsize=(18, 12))
cmap = sns.color_palette("Spectral", as_cmap=True)
ax= sns.heatmap(corr, mask= np.abs(mask) > 0.8, cmap=cmap, vmax=1.0, vmin=-1.0, center=0, annot=False,
            square=True, linewidths=.5, cbar_kws={"shrink": 1.0})
ax.set_title('Correlation heatmap: numerical features', fontsize=20, y= 1.05)
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([-0.75, 0, 0.75])
colorbar.set_ticklabels(['negative_corr','Little_to_no_corr','positive_corr'])

In [None]:
f, ax = plt.subplots(figsize=(18, 12))
corr = pd.concat((df_train[cat_features], y), axis=1).apply(lambda x : pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.color_palette("Spectral", as_cmap=True)
ax = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.0, vmin=-1, center=0, annot=False,
            square=True, linewidths=.5, cbar_kws={"shrink": 1.0})
ax.set_title('Correlation heatmap: categorical features', fontsize=20, y= 1.05)
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([-0.75, 0, 0.75])
colorbar.set_ticklabels(['negative_corr','Little_to_no_corr','positive_corr'])

## 1.6 Visualizing missing values (through SalePrice lense)

Visualizing the null-values within each colomn relative to the other non-null values gives a perspective on what kind kind of values could be missing. Later in the data processing phase, it will help us guide imputing the missing values. 

Here first we identify the columns/features with null-values, plot them aginst the sale prices and see where they stand.

In [None]:
# make a list of columns with a null-value, which is a re-cap
col_with_null_values = []
for col in df_train:
    if df_train[col].isnull().any():
        col_with_null_values.append(col)
display('Recap features with missing values')
display(col_with_null_values)
# create a dataFrame of columns with null-values
df_train_nan = df_train[col_with_null_values].fillna('NaN')
df_train_nan

In [None]:
# the code for customized color @line15 is adapted from 
# python-graph-gallery.com

def sns_boxplot(data, features, titleText='Title', ncol=4):
    itr = 1    
    L = len(features)    
    nrow= int(np.ceil(L/ncol))        
    remove_last= (nrow * ncol) - L
    
    fig, ax = plt.subplots(nrow, ncol,figsize=(28, 18))
    ax.flat[-remove_last].set_visible(False)
    fig.subplots_adjust(top=0.95) 
    
    for feature in features:
        plt.subplot(nrow, ncol, itr)
        my_pal = {feature: "#ff9900" if feature == "NaN" else '#32718E' for feature in data[feature].unique()}
        ax = sns.boxplot(x=data[feature], y=y, data=data, palette=my_pal)
        plt.xlabel(feature, fontsize=10)
        itr += 1
    plt.suptitle(titleText, fontsize=24,)
    plt.show()    


In [None]:
LisT0 = ['Alley',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']
LisT1= ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [None]:
sns_boxplot(df_train_nan, LisT0, titleText='The (Sale) Price of Missing Values', ncol=4)

In [None]:
def sns_scatter_bar_subplot(data, features, titleText='Title', ncol=3):
    itr = 1    
    L = len(features)    
    nrow= int(np.ceil(L/ncol))        
    remove_last= (nrow * ncol) - L
    
    fig, ax = plt.subplots(nrow, ncol, figsize=(22, 18))
    ax.flat[-remove_last].set_visible(False)
    fig.subplots_adjust(top=0.95) 
    
    for feature in features:
        plt.subplot(nrow, ncol, itr)
        
        if data[feature].nunique() <= 5:
            ax = sns.boxplot(x=feature, y='SalePrice', data= data, color='#1eb169')
        
        elif data[feature].nunique() < 20:
            ax = sns.barplot(x=feature, y='SalePrice', data= data, color='#600000')
            
        else:
            ax = sns.scatterplot(data=data, x=feature, y="SalePrice", palette='Blues')
                    
        plt.xlabel(feature, fontsize=10)
        plt.grid()
        itr += 1
    plt.suptitle(titleText, fontsize=24, y=1.002)    
    plt.show()    


## Highly SalePrice-Predictive Features

Below are the top ten features highly (and positively) correlated with the sale price of the houses. Knowing these features of a house will play a vital role in how accurately one can estimate the sale price of a given house. 

In [None]:
highCorrFeats = ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF',
                 'FullBath', 'TotRmsAbvGrd', 'YearBuilt','YearRemodAdd', 'SalePrice']
df_highCorr = train[highCorrFeats]
sns_scatter_bar_subplot(df_highCorr, highCorrFeats[:-1], titleText='Features highly correlated with saleprice', ncol=2)

# Part 2: Data pre-processing and Feature Engineering

# Part 3: Modeling & Predictions

<div class="alert alert-info">
  <strong>Notebook in progress....</strong> 
[publicly publishing a partially complete notebook helps me fight procrastination and finish it sooner]

</div>

## Thank you very much for reading this notebook!