# House Price Prediction: Systematic Exploratory Data Analysis

![](https://www.reno.gov/Home/ShowImage?id=7739&t=635620964226970000)

In this kernel, we focus on the exploratory data analysis (EDA) part. In the dataset, there are 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. To perform a systematic EDA, we will cluster the features into groups. Each group consists of features that explain similar group of characteristics of a house. This grouping method can help us in performing EDA for many variables by making it easier to find relationships among features that describe similar characteristics. The following grouping is made:
- 1.0. [**Output feature**](#section 1.0.): feature that we are trying to predict
  - [**SalePrice**](#section 1.0.): the price of the house when sold.
  
  
- 2.0. [**Surroundings-related features**](#section 2.0.)
  - [**MSZoning**](#section 2.1.): identifies the general zoning classification of the sale
  - [**Neighborhood**](#section 2.2.): physical locations within Ames city limits
  - [**Condition1** & **Condition2**](#section 2.3.): proximity to various conditions
  
  
- 3.0. [**Land/lot-related features**](#section 3.0.)
  - [**LotFrontage**](#section 3.1.): linear feet of street connected to property
  - [**LotArea**](#section 3.1.): lot size in square feet
  - [**Street**](#section 3.2.): type of road access to property
  - [**Alley**](#section 3.2.): type of alley access to property
  - Lot characteristics, including:
    - [**LotShape**](#section 3.3.): general shape of property
    - [**LandContour**](#section 3.3.): flatness of the property
    - [**LotConfig**](#section 3.3.): lot configuration
    - [**LandSlope**](#section 3.3.): slope of property
    
    
- 4.0. [**Overall-building-related features**](#section 4.0.)
  - [**MSSubClass**](#section 4.1.): identifies the type of dwelling involved in the sale 
  - [**BldgType**](#section 4.1.): type of dwelling 
  - [**HouseStyle**](#section 4.1.): style of dwelling
  - [**OverallQual**](#section 4.2.): rates the overall material and finish of the house
  - [**OverallCond**](#section 4.2.): rates the overall condition of the house
  - [**Functional**](#section 4.2.): home functionality (Assume typical unless deductions are warranted)
  - [**YearBuilt**](#section 4.3.): original construction year
  - [**YearRemodAdd**](#section 4.3.): remodel year (same as construction year if no remodeling or additions)
  - [**YrSold**](#section 4.3.): year sold
  
  
- 5.0. [**External characteristics of the house**](#section 5.0.)
  - [**RoofStyle**](#section 5.1.): type of roof
  - [**RoofMatl**](#section 5.1.): roof material
  - [**Exterior1st** & **Exterior2nd**](#section 5.2.): exterior covering on house
  - [**MasVnrType**](#section 5.3.): masonry veneer type
  - [**MasVnrArea**](#section 5.3.): masonry veneer area in square feet
  - [**ExterQual**](#section 5.4.): evaluates the quality of the material on the exterior 
  - [**ExterCond**](#section 5.4.): evaluates the present condition of the material on the exterior
  - [**Foundation**](#section 5.5.): type of foundation
  
  
- 6.0. [**Basement-related features**](#section 6.0.)
  - [**BsmtQual**](#section 6.1.): evaluates the height of the basement
  - [**BsmtCond**](#section 6.1.): evaluates the general condition of the basement
  - [**BsmtExposure**](#section 6.1.): refers to walkout or garden level walls
  - [**BsmtFinType1** & **BsmtFinType2**](#section 6.2.): rating of basement finished area
  - [**BsmtFinSF1**](#section 6.3.): type 1 finished square feet
  - [**BsmtFinSF2**](#section 6.3.): type 2 finished square feet
  - [**BsmtUnfSF**](#section 6.3.): unfinished square feet of basement area
  - [**TotalBsmtSF**](#section 6.3.): total square feet of basement area
  
  
- 7.0. [**Utilities-related features**](#section 7.0.)
  - [**Heating**](#section 7.1.): type of heating
  - [**HeatingQC**](#section 7.1.): heating quality and condition
  - [**Utilities**](#section 7.2.): type of utilities available
  - [**CentralAir**](#section 7.2.): central air conditioning
  - [**Electrical**](#section 7.2.): electrical system
  
  
- 8.0. [**Living-area-related features**](#section 8.0.)
  - [**1stFlrSF**](#section 8.0.): first floor square feet
  - [**2ndFlrSF**](#section 8.0.): second floor square feet
  - [**LowQualFinSF**](#section 8.0.): low quality finished square feet (all floors)
  - [**GrLivArea**](#section 8.0.): above grade (ground) living area square feet
  

- 9.0. [**Bathroom- and bedroom-related features**](#section 9.0.)
  - [**BsmtFullBath**](#section 9.1.): basement full bathrooms
  - [**BsmtHalfBath**](#section 9.1.): basement half bathrooms
  - [**FullBath**](#section 9.1.): full bathrooms above grade
  - [**HalfBath**](#section 9.1.): half baths above grade
  - [**BedroomAbvGr**](#section 9.3.): bedrooms above grade (does NOT include basement bedrooms)
  

- 10.0. [**Kitchen-related features**](#section 10.0.)
  - [**KitchenAbvGr**](#section 10.0.): kitchens above grade
  - [**KitchenQual**](#section 10.0.): kitchen quality


- 11.0. [**Fireplace-related features**](#section 11.0.)
  - [**Fireplaces**](#section 11.0.): number of fireplaces
  - [**FireplaceQu**](#section 11.0.): fireplace quality


- 12.0. [**Garage-related features**](#section 12.0.)
  - [**GarageCars**](#section 12.1.): size of garage in car capacity
  - [**GarageQual**](#section 12.1.): garage quality
  - [**GarageCond**](#section 12.1.): garage condition
  - [**GarageType**](#section 12.1.): garage location
  - [**GarageArea**](#section 12.2.): size of garage in square feet
  - [**GarageYrBlt**](#section 12.3.): year garage was built
  

- 13.0. [**Porch-/deck- related features**](#section 13.0.)
  - [**WoodDeckSF**](#section 13.0.): wood deck area in square feet
  - [**OpenPorchSF**](#section 13.0.): open porch area in square feet
  - [**EnclosedPorch**](#section 13.0.): enclosed porch area in square feet
  - [**3SsnPorch**](#section 13.0.): three season porch area in square feet
  - [**ScreenPorch**](#section 13.0.): screen porch area in square feet


- 14.0. [**Pool-related features**](#section 14.0.)
  - [**PoolArea**](#section 14.1.): pool area in square feet
  - [**PoolQC**](#section 14.2.): pool quality
  
  
- 15.0. [**Miscellaneous features**](#section 15.0.)
  - [**Fence**](#section 15.1.): fence quality.
  - [**MiscFeature**](#section 15.2.): miscellaneous feature not covered in other categories.
  - [**MiscVal**](#section 15.3.): value of miscellaneous feature.
  - [**MoSold**](#section 15.4.): month Sold (MM).
  - [**SaleType**](#section 15.5.): type of sale.
  - [**SaleCondition**](#section 15.6.): condition of sale.

In [None]:
# Import data analysis packages
import numpy as np
import pandas as pd

# Import data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')

# Import stats packages
from scipy.stats import kurtosis, skew, pearsonr

# Miscellaneous
import time
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_colwidth',80)

The train and test dataset are given in this [link](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). We import the datasets by calling the following:

In [None]:
# Import datasets
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
# Show the first five rows of the training dataset
df_train.head()

In [None]:
# Set the 'Id' column as the index of the datasets
df_train.set_index('Id', inplace=True)
df_test.set_index('Id', inplace=True)

To make it easier and simpler to visualize the distribution and correlation among features, we define the following functions:

In [None]:
def create_boxen_count_2x1(first_feature, figsize):
    plt.figure(figsize=figsize)

    # Create boxen plot of first_feature and Log_SalePrice
    ax2 = plt.subplot(212)
    sns.boxenplot(x=first_feature, y='Log_SalePrice', data=df_train, color='tomato')
    plt.xticks(rotation='horizontal')

    # Create countplot of first_feature
    ax1 = plt.subplot(211, sharex=ax2)
    sns.countplot(x=first_feature, data=df_train, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(hspace = 0)

    plt.show()

In [None]:
def create_boxen_count_2x2(first_feature, second_feature, figsize):
    plt.figure(figsize=figsize)

    # Create boxen plot of first_feature and Log_SalePrice
    ax3 = plt.subplot(223)
    sns.boxenplot(x=first_feature, y='Log_SalePrice', data=df_train, color='dimgrey')

    # Create boxen plot of second_feature and Log_SalePrice
    ax4 = plt.subplot(224, sharey=ax3)
    sns.boxenplot(x=second_feature, y='Log_SalePrice', data=df_train, color='tomato')
    plt.setp(ax4.get_yticklabels(), visible=False)
    plt.ylabel('')

    #---------------------------------------------------------------------------------------

    # Create countplot of Condition1
    ax1 = plt.subplot(221, sharex=ax3)
    sns.countplot(x=first_feature, data=df_train, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Create countplot of Condition2
    ax2 = plt.subplot(222, sharey=ax1, sharex=ax4)
    sns.countplot(x=second_feature, data=df_train, color='tomato')
    plt.setp(ax2.get_yticklabels(), visible=False)
    plt.setp(ax2.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(wspace=0, hspace=0)

    plt.show()

In [None]:
def create_boxen_count_2x3(first_feature, second_feature, third_feature, figsize):
    plt.figure(figsize=figsize)

    # Create boxenplot of first_feature and Log_SalePrice
    ax4 = plt.subplot(234)
    sns.boxenplot(x=first_feature, y='Log_SalePrice', data=df_train, color='dimgrey')

    # Create boxenplot of second_feature and Log_SalePrice
    ax5 = plt.subplot(235, sharey=ax4)
    sns.boxenplot(x=second_feature, y='Log_SalePrice', data=df_train, color='tomato')
    plt.setp(ax5.get_yticklabels(), visible=False)
    plt.ylabel('')

    # Create boxenplot of third_feature and Log_SalePrice
    ax6 = plt.subplot(236, sharey=ax4)
    sns.boxenplot(x=third_feature, y='Log_SalePrice', data=df_train, color='darkseagreen')
    plt.setp(ax6.get_yticklabels(), visible=False)
    plt.ylabel('')

    #---------------------------------------------------------------------------------------

    # Create countplot of first_feature
    ax1 = plt.subplot(231, sharex=ax4)
    sns.countplot(x=first_feature, data=df_train, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Create countplot of second_feature
    ax2 = plt.subplot(232, sharey=ax1, sharex=ax5)
    sns.countplot(x=second_feature, data=df_train, color='tomato')
    plt.setp(ax2.get_yticklabels(), visible=False)
    plt.setp(ax2.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Create countplot of second_feature
    ax3 = plt.subplot(233, sharey=ax1, sharex=ax6)
    sns.countplot(x=third_feature, data=df_train, color='darkseagreen')
    plt.setp(ax3.get_yticklabels(), visible=False)
    plt.setp(ax3.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(wspace=0, hspace=0)

    plt.show()

In [None]:
def create_boxen_count_2x4(first_feature, second_feature, third_feature, fourth_feature, figsize):
    plt.figure(figsize=figsize)

    # Create boxenplot of first_feature and Log_SalePrice
    ax5 = plt.subplot(245)
    sns.boxenplot(x=first_feature, y='Log_SalePrice', data=df_train, color='dimgrey')

    # Create boxenplot of second_feature and Log_SalePrice
    ax6 = plt.subplot(246, sharey=ax5)
    sns.boxenplot(x=second_feature, y='Log_SalePrice', data=df_train, color='tomato')
    plt.setp(ax6.get_yticklabels(), visible=False)
    plt.ylabel('')

    # Create boxenplot of third_feature and Log_SalePrice
    ax7 = plt.subplot(247, sharey=ax5)
    sns.boxenplot(x=third_feature, y='Log_SalePrice', data=df_train, color='darkseagreen')
    plt.setp(ax7.get_yticklabels(), visible=False)
    plt.ylabel('')

    # Create boxenplot of fourth_feature and Log_SalePrice
    ax8 = plt.subplot(248, sharey=ax5)
    sns.boxenplot(x=fourth_feature, y='Log_SalePrice', data=df_train, color='seagreen')
    plt.setp(ax8.get_yticklabels(), visible=False)
    plt.ylabel('')

    #---------------------------------------------------------------------------------------

    # Create countplot of first_feature
    ax1 = plt.subplot(241, sharex=ax5)
    sns.countplot(x=first_feature, data=df_train, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Create countplot of second_feature
    ax2 = plt.subplot(242, sharey=ax1, sharex=ax6)
    sns.countplot(x=second_feature, data=df_train, color='tomato')
    plt.setp(ax2.get_yticklabels(), visible=False)
    plt.setp(ax2.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Create countplot of third_feature
    ax3 = plt.subplot(243, sharey=ax1, sharex=ax7)
    sns.countplot(x=third_feature, data=df_train, color='darkseagreen')
    plt.setp(ax3.get_yticklabels(), visible=False)
    plt.setp(ax3.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Create countplot of fourth_feature
    ax4 = plt.subplot(244, sharey=ax1, sharex=ax8)
    sns.countplot(x=fourth_feature, data=df_train, color='seagreen')
    plt.setp(ax4.get_yticklabels(), visible=False)
    plt.setp(ax4.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(wspace=0, hspace=0)

    plt.show()

## 1. SalePrice

**SalePrice** is the target feature. Let's plot the distribution of **SalePrice**.

In [None]:
# Create distribution plot of SalePrice feature
plt.figure(figsize=(14,5))
sns.distplot(df_train['SalePrice'], bins=150, color='g')
plt.show()

It resembles normal distribution. Let's check its skewness and kurtosis.

In [None]:
# Calculate the skewness and kurtosis of SalePrice distribution
print('Skewness of the distribution of SalePrice: {}'.format(skew(df_train['SalePrice'])))
print('Kurtosis of the distribution of SalePrice: {}'.format(kurtosis(df_train['SalePrice'])))

The skewness and kurtosis significantly deviate from a normal distribution. We perform log transformation to make it more like a normal distribution. 

In [None]:
# Log transform the SalePrice feature
df_train['Log_SalePrice'] = np.log(df_train['SalePrice'])

In [None]:
# Create a distribution plot of SalePrice feature after log transformation
plt.figure(figsize=(14,5))
sns.distplot(df_train['Log_SalePrice'], bins=150, color='r')
plt.show()

Let's check the skewness and kurtosis after log transformation.

In [None]:
# Calculate the skewness and kurtosis of SalePrice distribution after log transformation
print('Skewness of the distribution of SalePrice after log transformation: {}'.format(skew(df_train['Log_SalePrice'])))
print('Kurtosis of the distribution of SalePrice after log transformation: {}'.format(kurtosis(df_train['Log_SalePrice'])))

## 2. Surroundings-related features
Surroundings-related features consist of:
- [**MSZoning**](#section 2.1.): identifies the general zoning classification of the sale
- [**Neighborhood**](#section 2.2.): physical locations within Ames city limits
- [**Condition1** & **Condition2**](#section 2.3.): proximity to various conditions

### 2.1. MSZoning

**MSZoning** is an ordinal categorical feature. We will tranform it to a numerical feature to make it easier to identify trends/relationships during visualization. We transform it according to the population density of the zoning, starting with 1 as the least populated and 5 as the most populated. 

In [None]:
# Transform MSZoning into a numerical feature 
MSZoning_map = {'FV':1, 'RL':2, 'RM':3, 'RH':4, 'C (all)':5}
df_train['MSZoning'].replace(MSZoning_map, inplace=True)
df_test['MSZoning'].replace(MSZoning_map, inplace=True)

Let's visualize **MSZoning**. 

In [None]:
create_boxen_count_2x1('MSZoning', (7,8))

**Observations**:
- Most of the houses are located in zone number 2 (residential low population density area).
- Houses located at lower population density area generally have higher **Log_SalePrice** than houses at higher population density area.

### 2.2. Neighborhood

Let's visualize **Neighborhood**.

In [None]:
create_boxen_count_2x1('Neighborhood', (20,8))

### 2.3. Condition1 & Condition2

Let's visualize **Condition1** and **Condition2**.

In [None]:
create_boxen_count_2x2('Condition1', 'Condition2', (13,8))

**Observations**:
- Most of the houses have normal condition.
- There seems to be no correlation between **Condition1**, **Condition2**, and **Log_SalePrice**.

### 3.2.4. Correlation among surrounding-related features

Let's visualize the correlation between **MSZoning** and **Log_SalePrice**.

In [None]:
surround_feats = df_train[['MSZoning', 'Log_SalePrice']]

In [None]:
plt.figure(figsize=(2,2))

sns.heatmap(surround_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations:**
- There is a weak negative correlation between **MSZoning** and **Log_SalePrice**.

## 3. Land/lot-related features
Land/lot-related features comprise of:
- [**LotFrontage**](#section 3.1.): linear feet of street connected to property
- [**LotArea**](#section 3.1.): lot size in square feet
- [**Street**](#section 3.2.): type of road access to property
- [**Alley**](#section 3.2.): type of alley access to property
- Lot characteristics, including:
  - [**LotShape**](#section 3.3.): general shape of property
  - [**LandContour**](#section 3.3.): flatness of the property
  - [**LotConfig**](#section 3.3.): lot configuration
  - [**LandSlope**](#section 3.3.): slope of property

### 3.1. LotFrontage & LotArea
Let's plot the distribution of **LotFrontage** and **LotArea**. 

In [None]:
plt.figure(figsize=(14,10))

# Create distribution plot of LotFrontage
plt.subplot(211)
sns.distplot(df_train['LotFrontage'].dropna(), bins=100, color='r')

# Create distribution plot of LotArea
plt.subplot(212)
sns.distplot(df_train['LotArea'].dropna(), bins=100, color='g')

# Adjusting the spaces between graphs
plt.subplots_adjust(hspace = 0.2)

plt.show()

They resemble a normal distribution. Let's check their skewness and kurtosis.

In [None]:
# Calculate the skewness and kurtosis of LotFrontage and LotArea distribution
print('Skewness of the distribution of LotFrontage: {}'.format(skew(df_train['LotFrontage'].dropna())))
print('Kurtosis of the distribution of LotFrontage: {}'.format(kurtosis(df_train['LotFrontage'].dropna())))
print('')
print('Skewness of the distribution of LotArea: {}'.format(skew(df_train['LotArea'].dropna())))
print('Kurtosis of the distribution of LotArea: {}'.format(kurtosis(df_train['LotArea'].dropna())))

The skewness and kurtosis significantly deviate from a normal distribution. We perform log transformation to make them more like a normal distribution.

In [None]:
# Log transform LotFrontage and LotArea
df_train['LotFrontage'] = np.log(df_train['LotFrontage'])
df_train['LotArea'] = np.log(df_train['LotArea'])

df_test['LotFrontage'] = np.log(df_test['LotFrontage'])
df_test['LotArea'] = np.log(df_test['LotArea'])

Let's check the skewness and kurtosis after log transformation.

In [None]:
# Calculate the skewness and kurtosis of LotFrontage and LotArea distribution after log transformation
print('Skewness of the distribution of LotFrontage after log transformation: {}'.format(skew(df_train['LotFrontage'].dropna())))
print('Kurtosis of the distribution of LotFrontage after log transformation: {}'.format(kurtosis(df_train['LotFrontage'].dropna())))
print('')
print('Skewness of the distribution of LotArea after log transformation: {}'.format(skew(df_train['LotArea'].dropna())))
print('Kurtosis of the distribution of LotArea after log transformation: {}'.format(kurtosis(df_train['LotArea'].dropna())))

Now, the distributions are more resemble to a normal distribution. 

Let's check the relationship of **LotFrontage** and **LotArea** with **Log_SalePrice** by using scatter plots.

In [None]:
plt.figure(figsize=(10,8))

ax1 = plt.subplot(121)
sns.regplot(x='LotFrontage', y='Log_SalePrice', data=df_train, color='lightcoral')

ax2 = plt.subplot(122, sharey=ax1)
sns.regplot(x='LotArea', y='Log_SalePrice', data=df_train, color='darkseagreen')
plt.setp(ax2.get_yticklabels(), visible=False)
plt.ylabel('')

# Adjusting the spaces between graphs
plt.subplots_adjust(wspace = 0)

plt.show()

**Observation**:
- There seems to be a linear relationship between **LotFrontage** and **Log_SalePrice**.
- There seems to be a linear relationship between **LotArea** and **Log_SalePrice**.

### 3.2. Street & Alley
Before we make visualizations, we need to fill *NaN* values in **Alley** feature to *No alley access*.  

In [None]:
# Fill NaN with 'No alley access'
df_train['Alley'].fillna('No alley access', inplace=True)
df_test['Alley'].fillna('No alley access', inplace=True)

Now, let's visualize **Street** and **Alley**. 

In [None]:
create_boxen_count_2x2('Street', 'Alley', (10,10))

**Observation**:
- Most of the houses have paved road access. 
- Most of the houses have no alley access. 
- There seems to be no relationship between **Street** and **Log_SalePrice**.
- There seems to be no relationship between **Alley** and **Log_SalePrice**.

### 3.3. Lot Characteristics
Lot characteristics include **LotShape**, **LandContour**, **LotConfig**, and **LandSlope**. **LotShape** and **LandContour** are ordinal categorical features. We will transform them into numerical features to make it easier to identify trends/relationships during visualization.

In [None]:
# Transform LotShape into a numerical feature 
LotShape_map = {'Reg':1, 'IR1':2, 'IR2':3, 'IR3':4}
df_train['LotShape'].replace(LotShape_map, inplace=True)
df_test['LotShape'].replace(LotShape_map, inplace=True)

# Transform LandContour into a numerical feature 
LandContour_map = {'Lvl':0, 'Bnk':1, 'Low':1, 'HLS':1}
df_train['LandContour'].replace(LandContour_map, inplace=True)
df_test['LandContour'].replace(LandContour_map, inplace=True)

# Transform LotConfig into a numerical feature 
LotConfig_map = {'Inside':0, 'FR2':1, 'Corner':1, 'CulDSac':1, 'FR3':1}
df_train['LotConfig'].replace(LotConfig_map, inplace=True)
df_test['LotConfig'].replace(LotConfig_map, inplace=True)

# Transform LandSlope into a numerical feature 
LandSlope_map = {'Gtl':0, 'Mod':1, 'Sev':2}
df_train['LandSlope'].replace(LandSlope_map, inplace=True)
df_test['LandSlope'].replace(LandSlope_map, inplace=True)

In [None]:
create_boxen_count_2x4('LotShape', 'LandContour', 'LotConfig', 'LandSlope', (20,12))

**Observation**: 
- There seems to be no relationship between lot characteristics related features and **Log_SalePrice**.
- These features are not likely to help in predicting house prices. 

### 3.4. Correlation among land/lot-related features
Let's visualize the correlation among **LotFrontage**, **LotArea**, **Street**, **Alley**, **LotShape**, **LandContour**, **LotConfig**, **LandSlope**, and **Log_SalePrice**.

In [None]:
lot_feats = df_train[['LotFrontage', 'LotArea', 'Street', 'Alley', 
                      'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 
                      'Log_SalePrice']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(lot_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observation**:
There is a weak positive correlation between: 
- **LotFrontage** and **Log_SalePrice**
- **LotArea** and **Log_SalePrice**
- **LotShape** and **Log_SalePrice**

Additionally, there is multicollinearity between **LotArea** and **LotFrontage**.

## 4. Overall-building-related features
Overall-building-related features consist of:
- [**MSSubClass**](#section 4.1.): identifies the type of dwelling involved in the sale 
- [**BldgType**](#section 4.1.): type of dwelling 
- [**HouseStyle**](#section 4.1.): style of dwelling
- [**OverallQual**](#section 4.2.): rates the overall material and finish of the house
- [**OverallCond**](#section 4.2.): rates the overall condition of the house
- [**Functional**](#section 4.2.): home functionality (Assume typical unless deductions are warranted)
- [**YearBuilt**](#section 4.3.): original construction year
- [**YearRemodAdd**](#section 4.3.): remodel year (same as construction year if no remodeling or additions)
- [**YrSold**](#section 4.3.): year sold

### 4.1. MSSubClass, BldgType, & HouseStyle
Let's visualize **MSSubClass**, **BldgType**, and **HouseStyle**.

In [None]:
create_boxen_count_2x3('MSSubClass', 'BldgType', 'HouseStyle', (16,8))

**Observations**:
- There seems to be no relationship between **MSSubClass** and **Log_SalePrice**.
- There seems to be no relationship between **BldgType** and **Log_SalePrice**.
- There seems to be no relationship between **HouseStyle** and **Log_SalePrice**.

### 4.2. OverallQual, OverallCond, Functional
**Functional** is an ordinal categorical feature. We will tranform it to a numerical feature to make it easier to identify trends/relationships during visualization. 

In [None]:
# Transform Functional into a numerical feature 
Functional_map = {'Sal':0, 'Sev':1, 'Maj2':2, 'Maj1':3, 'Mod':4, 'Min2':5, 'Min1':6, 'Typ':7}
df_train['Functional'].replace(Functional_map, inplace=True)
df_test['Functional'].replace(Functional_map, inplace=True)

Let's visualize **OverallQual**, **OverallCond**, and **Functional**.

In [None]:
create_boxen_count_2x3('OverallQual', 'OverallCond', 'Functional', (16,8))

**Observations**:
- Houses with higher **OverallQual** have higher **Log_SalePrice**. In other words, there seems to be a strong relationship between **OverallQual** and **Log_SalePrice**. 
- There seems to be no relationship between **OverallCond** and **Log_SalePrice**. Similarly, there seems to be no relationship between **Functional** and **Log_SalePrice**. These features are not likely to help in predicting house prices. 

### 4.3. YearBuilt, YearRemodAdd, & YrSold
We can create **TotalAge** feature that indicates the property age from the year built (**YearBuilt**) until the year sold (**YrSold**). Similarly, we can create **YrSinceRemod** feature that indicates the property age from the remodeling year (**YearRemodAdd**) until the year sold (**YrSold**).

In [None]:
# Create TotalAge feature
df_train['TotalAge'] = df_train['YrSold'] - df_train['YearBuilt']
df_test['TotalAge'] = df_test['YrSold'] - df_test['YearBuilt']

# Create YrSinceRemod feature
df_train['YrSinceRemod'] = df_train['YrSold'] - df_train['YearRemodAdd']
df_test['YrSinceRemod'] = df_test['YrSold'] - df_test['YearRemodAdd']

Let's check the relationship of **TotalAge** and **YrSinceRemod** with **Log_SalePrice** by using scatter plots.

In [None]:
plt.figure(figsize=(14,8))

ax1 = plt.subplot(121)
sns.regplot(x='TotalAge', y='Log_SalePrice', data=df_train, color='lightcoral')

ax2 = plt.subplot(122, sharey=ax1)
sns.regplot(x='YrSinceRemod', y='Log_SalePrice', data=df_train, color='darkseagreen')
plt.setp(ax2.get_yticklabels(), visible=False)
plt.ylabel('')

# Adjusting the spaces between graphs
plt.subplots_adjust(wspace = 0)

plt.show()

**Observations**:
- Older houses typically have lower prices. There seems to be a relationship between **TotalAge** and **Log_SalePrice**.
- Houses that recently remodeled have higher prices. There seems to be a linear relationship between **YrSinceRemod** and **Log_SalePrice**.

### 4.4. Correlation among overall-building-related features
Let's visualize the correlation among **OverallQual**, **OverallCond**, **Functional**, **TotalAge**, **YrSinceRemod**, and **Log_SalePrice**.

In [None]:
overall_feats = df_train[['OverallQual', 'OverallCond', 'Functional', 
                          'TotalAge', 'YrSinceRemod', 'Log_SalePrice']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(overall_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is a positive correlation between **OverallQual** and **Log_SalePrice**. 
- There is a negative correlation between **TotalAge** and **Log_SalePrice**. Similarly, there is a negative correlation between **YrSinceRemod** and **Log_SalePrice**. However, it should be noticed that there is a multicollinearity between **TotalAge** and **YrSinceRemod**.

## 5. External characteristics of the house
External characteristics of the house include:
- [**RoofStyle**](#section 5.1.): type of roof
- [**RoofMatl**](#section 5.1.): roof material
- [**Exterior1st** & **Exterior2nd**](#section 5.2.): exterior covering on house
- [**MasVnrType**](#section 5.3.): masonry veneer type
- [**MasVnrArea**](#section 5.3.): masonry veneer area in square feet
- [**ExterQual**](#section 5.4.): evaluates the quality of the material on the exterior 
- [**ExterCond**](#section 5.4.): evaluates the present condition of the material on the exterior
- [**Foundation**](#section 5.5.): type of foundation

### 5.1. RoofStyle & RoofMatl
Let's visualize **RoofStyle** and **RoofMatl**.

In [None]:
create_boxen_count_2x2('RoofStyle', 'RoofMatl', (15,9))

**Observations**:
- Most of the houses have gable and hip roof style, and most of the houses have standard (composite) shingle roof material.
- There seems to be no relationship between **RoofStyle** and **Log_SalePrice**. Similarly, there seems to be no relationship between **RoofMatl** and **Log_SalePrice**. Therefore, these features may not help in adding value to predict house prices. 

### 5.2. Exterior1st & Exterior2nd
Let's visualize **Exterior1st** and **Exterior2nd**.

In [None]:
create_boxen_count_2x2('Exterior1st', 'Exterior2nd', (15,9))

**Observations**:
- There seems to be no relationship between **Exterior1st** and **Log_SalePrice**. Similarly, there seems to be no relationship between **Exterior2nd** and **Log_SalePrice**. Therefore, these features may not help in predicting house prices.

### 5.3. MasVnrType & MasVnrArea
Let's visualize **MasVnrType**. 

In [None]:
create_boxen_count_2x1('MasVnrType', (8,8))

**Observations**:
- There seems to be no relationship between **MasVnrType** and **Log_SalePrice**. 

We transform **MasVnrType** into a numerical feature. If **MasVnrType** is *None*, we transform the value to 0. If **MasVnrType** is *BrkFace*, *Stone*, or *BrkCmn*, we transform the value to 1.

In [None]:
# Transform MasVnrType into a numerical feature 
MasVnrType_map = {'None':0, 'BrkFace':1, 'Stone':1, 'BrkCmn':1}
df_train['MasVnrType'].replace(MasVnrType_map, inplace=True)
df_test['MasVnrType'].replace(MasVnrType_map, inplace=True)

Now, let's visualize **MasVnrType**.

In [None]:
create_boxen_count_2x1('MasVnrType', (8,8))

**Observations**:
- Houses with **MasVnrType** 0 has generally lower **Log_SalePrice** than houses with **MasVnrType** 1.

After we visualize **MasVnrType**, let's visualize **MasVnrArea**. 

In [None]:
plt.figure()

sns.jointplot(x='MasVnrArea', 
              y='Log_SalePrice', 
              data=df_train, 
              kind='reg', 
              height=9,
              color='darkseagreen')

plt.show()

**Observations**:
- Houses with higher **MasVnrArea** have higher **Log_SalePrice**. There seems to be a relationship between **MasVnrArea** and **Log_SalePrice**. 

### 5.4. ExterQual & ExterCond
**ExterQual** and **ExterCond** are ordinal categorical features. We will transform them into numerical features to make it easier to identify trends/relationships during visualization.

In [None]:
ExterQual_map = {'Po':1 ,'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}

df_train['ExterQual'].replace(ExterQual_map, inplace=True)
df_test['ExterQual'].replace(ExterQual_map, inplace=True)

df_train['ExterCond'].replace(ExterQual_map, inplace=True)
df_test['ExterCond'].replace(ExterQual_map, inplace=True)

Let's visualize **ExterQual** and **ExterCond**.

In [None]:
create_boxen_count_2x2('ExterQual', 'ExterCond', (10,9))

**Observations**:
- Houses with higher **ExterQual** have higher **Log_SalePrice**. There seems to be a strong relationship between **ExterQual** and **Log_SalePrice**. 

### 5.5. Foundation
Let's visualize **Foundation**.

In [None]:
create_boxen_count_2x1('Foundation', (8,8))

**Observations**:
- There seems to be no relationship between **Foundation** and **Log_SalePrice**. 

### 5.6. Correlation among external characteristics of the house

Let's visualize the correlation among **MasVnrType**, **MasVnrArea**, **ExterQual**, **ExterCond**, and **Log_SalePrice**.

In [None]:
ext_feats = df_train[['MasVnrType', 'MasVnrArea', 'ExterQual', 
                      'ExterCond', 'Log_SalePrice']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(ext_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is a weak positive correlation between **MasVnrType** and **Log_SalePrice**. Similarly, there is a positive correlation between **MasVnrArea** and **Log_SalePrice**. However, There is a multicollinearity between **MasVnrArea** and **MasVnrType**.
- There is a strong positive correlation between **ExterQual** and **Log_SalePrice**.

## 6. Basement-related features
Basement-related features include:
- [**BsmtQual**](#section 6.1.): evaluates the height of the basement
- [**BsmtCond**](#section 6.1.): evaluates the general condition of the basement
- [**BsmtExposure**](#section 6.1.): refers to walkout or garden level walls
- [**BsmtFinType1** & **BsmtFinType2**](#section 6.2.): rating of basement finished area
- [**BsmtFinSF1**](#section 6.3.): type 1 finished square feet
- [**BsmtFinSF2**](#section 6.3.): type 2 finished square feet
- [**BsmtUnfSF**](#section 6.3.): unfinished square feet of basement area
- [**TotalBsmtSF**](#section 6.3.): total square feet of basement area

### 6.1. BsmtQual, BsmtCond, & BsmtExposure
**BsmtQual**, **BsmtCond**, and **BsmtExposure** are ordinal categorical features. We will transform them into numerical features to make it easier to identify trends/relationships during visualization.

In [None]:
BsmtQualCond_map = {'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}

df_train['BsmtQual'].replace(BsmtQualCond_map, inplace=True)
df_test['BsmtQual'].replace(BsmtQualCond_map, inplace=True)

df_train['BsmtCond'].replace(BsmtQualCond_map, inplace=True)
df_test['BsmtCond'].replace(BsmtQualCond_map, inplace=True)

BsmtExposure_map = {'No':0 ,'Mn':1, 'Av':2, 'Gd':3}
df_train['BsmtExposure'].replace(BsmtExposure_map, inplace=True)
df_test['BsmtExposure'].replace(BsmtExposure_map, inplace=True)

We also need to replace `NaN` values, which indicate houses with no basement, with 0.

In [None]:
df_train['BsmtQual'].fillna(0, inplace=True)
df_test['BsmtQual'].fillna(0, inplace=True)

df_train['BsmtCond'].fillna(0, inplace=True)
df_test['BsmtCond'].fillna(0, inplace=True)

df_train['BsmtExposure'].fillna(0, inplace=True)
df_test['BsmtExposure'].fillna(0, inplace=True)

Let's visualize them.

In [None]:
create_boxen_count_2x3('BsmtQual', 'BsmtCond', 'BsmtExposure', (16,10))

**Observations**:
- Houses with higher **BsmtQual** have higher **Log_SalePrice**. There seems to be a strong linear relationship between **BsmtQual** and **Log_SalePrice**.

### 6.2. BsmtFinType1 & BsmtFinType2
**BsmtFinType1** and **BsmtFinType2** are ordinal categorical features. We will transform them into numerical features to make it easier to identify trends/relationships during visualization.

In [None]:
BsmtFinType_map = {'Unf':0, 'LwQ':1, 'Rec':2, 'BLQ':3, 'ALQ':4, 'GLQ':5}

df_train['BsmtFinType1'].replace(BsmtFinType_map, inplace=True)
df_test['BsmtFinType1'].replace(BsmtFinType_map, inplace=True)

df_train['BsmtFinType2'].replace(BsmtFinType_map, inplace=True)
df_test['BsmtFinType2'].replace(BsmtFinType_map, inplace=True)

Let's visualize **BsmtFinType1** and **BsmtFinType2**.

In [None]:
create_boxen_count_2x2('BsmtFinType1', 'BsmtFinType2', (14,10))

**Observations**:
- There seems to be no relationship between **BsmtFinType1** and **Log_SalePrice**. Similalry, there seems to be no relationship between **BsmtFinType1** and **Log_SalePrice**. 

### 6.3. BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, & TotalBsmtSF
Let's visualize **BsmtFinSF1**, **BsmtFinSF2**, **BsmtUnfSF**, and **TotalBsmtSF**. 

In [None]:
plt.figure(figsize=(20,15))

g = sns.pairplot(df_train[['BsmtFinSF1', 'BsmtFinSF2', 
                           'BsmtUnfSF', 'TotalBsmtSF', 'Log_SalePrice']], 
             palette='Accent',
             kind='scatter',
             diag_kind='auto',
             height=3)

plt.show()

**Observations**:
- There seems to be a linear relationship between **TotalBsmtSF** and **Log_SalePrice**. Similarly, there seems to be a linear relationship between **BsmtFinSF1** and **Log_SalePrice**.

### 6.4. Correlation among basement-related features
Let's visualize the correlation among basement-related features.

In [None]:
Bsmt_feats = df_train[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                       'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 
                       'BsmtUnfSF', 'TotalBsmtSF', 'Log_SalePrice']].dropna()

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(Bsmt_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is a strong correlation between **TotalBsmtSF** and **Log_SalePrice**, and between **BsmtQual** and **Log_SalePrice**.

## 7. Utilities-related features
Utilities-related features include:
- [**Heating**](#section 7.1.): type of heating
- [**HeatingQC**](#section 7.1.): heating quality and condition
- [**Utilities**](#section 7.2.): type of utilities available
- [**CentralAir**](#section 7.2.): central air conditioning
- [**Electrical**](#section 7.2.): electrical system

### 7.1. Heating & HeatingQC
**HeatingQC** is an ordinal categorical feature. We will transform it into a numerical feature to make it easier to identify trends/relationships during visualization. For **Heating**, we will transform the value into 0 if the type of heating is gas, and transform the value into 1 for other types of heating. 

In [None]:
HeatingQC_map = {'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}
df_train['HeatingQC'].replace(HeatingQC_map, inplace=True)
df_test['HeatingQC'].replace(HeatingQC_map, inplace=True)

Heating_map = {'GasA':1, 'GasW':1, 'Grav':0, 'Wall':0, 'OthW':0, 'Floor':0}
df_train['Heating'].replace(Heating_map, inplace=True)
df_test['Heating'].replace(Heating_map, inplace=True)

Let's visualize **Heating** and **HeatingQC**. 

In [None]:
create_boxen_count_2x2('Heating', 'HeatingQC', (14,10))

**Observations**:
- Houses with gas heating have lower **Log_SalePrice** than houses with other types of gas heating. 
- There seems to be a weak linear relationship between **HeatingQC** and **Log_SalePrice**. 

### 7.2. Utilities, CentralAir, & Electrical
Before visualization, we transform the values of **Electrical**, so that houses that have standard circuit breakers and romex electrical system have value of 0. Houses with other electrical systems are given value of 1. 

In [None]:
Electrical_map = {'SBrkr':1, 'FuseA':0, 'FuseF':0, 'FuseP':0, 'Mix':0}
df_train['Electrical'].replace(Electrical_map, inplace=True)
df_test['Electrical'].replace(Electrical_map, inplace=True)

Let's visualize **Utilities**, **CentralAir**, and **Electrical**.

In [None]:
create_boxen_count_2x3('Utilities', 'CentralAir', 'Electrical', (16,10))

**Observations**:
- Houses with central air conditioning generally have higher **Log_SalePrice** than houses with no central air conditioning.
- Houses with standard circuit breakers and romex electrical system have lower **Log_SalePrice** than houses with no standard circuit breakers and romex electrical system.

### 7.3. Correlation among utilities-related features
Let's visualize the correlation among utilities-related features.

In [None]:
utilities_feats = df_train[['Heating', 'HeatingQC', 'Utilities', 
                            'CentralAir', 'Electrical', 'Log_SalePrice']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(utilities_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is weak positive correlation between **HeatingQC** and **Log_SalePrice**, and between **Electrical** and **Log_SalePrice**

## 8. Living-area-related features
Living-area-related features include:
- [**1stFlrSF**](#section 8.0.): first floor square feet
- [**2ndFlrSF**](#section 8.0.): second floor square feet
- [**LowQualFinSF**](#section 8.0.): low quality finished square feet (all floors)
- [**GrLivArea**](#section 8.0.): above grade (ground) living area square feet

Let's visualize the univariate distribution of these features. 

In [None]:
plt.figure(figsize=(13,10))

plt.subplot(221)
sns.distplot(df_train['1stFlrSF'], bins=100, color='g')

plt.subplot(222)
sns.distplot(df_train['2ndFlrSF'], bins=100, color='r')

plt.subplot(223)
sns.distplot(df_train['LowQualFinSF'], bins=100, color='b', kde=False)

plt.subplot(224)
sns.distplot(df_train['GrLivArea'], bins=100, color='g')

plt.show()

**Observations**:
- Most of the houses have 0 **LowQualFinSF** and **2ndFlrSF**.
- The distribution of **1stFlrSF** and **GrLivArea** resemble normal distribution. 

Let's check the skewness and kurtosis of **1stFlrSF** and **GrLivArea**.

In [None]:
# Calculate the skewness and kurtosis of 1stFlrSF and GrLivArea distribution
print('Skewness of the distribution of 1stFlrSF: {}'.format(skew(df_train['1stFlrSF'].dropna())))
print('Kurtosis of the distribution of 1stFlrSF: {}'.format(kurtosis(df_train['1stFlrSF'].dropna())))
print('')
print('Skewness of the distribution of GrLivArea: {}'.format(skew(df_train['GrLivArea'].dropna())))
print('Kurtosis of the distribution of GrLivArea: {}'.format(kurtosis(df_train['GrLivArea'].dropna())))

We perform log transformation to make them more like a normal distribution.

In [None]:
df_train['1stFlrSF'] = np.log(df_train['1stFlrSF'])
df_train['GrLivArea'] = np.log(df_train['GrLivArea'])

df_test['1stFlrSF'] = np.log(df_test['1stFlrSF'])
df_test['GrLivArea'] = np.log(df_test['GrLivArea'])

Let's check the skewness and kurtosis after transformation.

In [None]:
# Calculate the skewness and kurtosis of 1stFlrSF and GrLivArea distribution
print('Skewness of the distribution of 1stFlrSF: {}'.format(skew(df_train['1stFlrSF'].dropna())))
print('Kurtosis of the distribution of 1stFlrSF: {}'.format(kurtosis(df_train['1stFlrSF'].dropna())))
print('')
print('Skewness of the distribution of GrLivArea: {}'.format(skew(df_train['GrLivArea'].dropna())))
print('Kurtosis of the distribution of GrLivArea: {}'.format(kurtosis(df_train['GrLivArea'].dropna())))

Now, the distributions are more resemble to a normal distribution.

Let's check the relationship of **1stFlrSF** and **GrLivArea** with **Log_SalePrice** by using scatter plots.

In [None]:
plt.figure(figsize=(10,9))

ax1 = plt.subplot(121)
sns.regplot(x='1stFlrSF', y='Log_SalePrice', data=df_train, color='lightcoral')

ax2 = plt.subplot(122, sharey=ax1, sharex=ax1)
sns.regplot(x='GrLivArea', y='Log_SalePrice', data=df_train, color='darkseagreen')
plt.setp(ax2.get_yticklabels(), visible=False)
plt.ylabel('')

# Adjusting the spaces between graphs
plt.subplots_adjust(wspace = 0)

plt.show()

**Observations**:
- There seems to be a strong linear relationship between **1stFlrSF** and **Log_SalePrice**, and between **GrLivArea** and **Log_SalePrice**.

## 9. Bathroom- and bedroom-related features
Bathroom- and bedroom-related features include:
- [**BsmtFullBath**](#section 9.1.): basement full bathrooms
- [**BsmtHalfBath**](#section 9.1.): basement half bathrooms
- [**FullBath**](#section 9.1.): full bathrooms above grade
- [**HalfBath**](#section 9.1.): half baths above grade
- [**BedroomAbvGr**](#section 9.3.): bedrooms above grade (does NOT include basement bedrooms)

### 9.1. BsmtFullBath, BsmtHalfBath, FullBath, & HalfBath

Let's visualize **BsmtFullBath**, **BsmtHalfBath**, **FullBath**, and **HalfBath**. 

In [None]:
create_boxen_count_2x4('BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', (20,12))

**Observations**:
- There is no clear relationship between bathroom-related features and **Log_SalePrice**.

### 9.2. BsmtFull+HalfBath, Full+HalfBath, & TotalBath
We create additional features from bathroom-related features as follows:
- **BsmtFull+HalfBath** is a feature that indicates the sum of **BsmtFullBath** and **BsmtHalfBath**.
- **Full+HalfBath** is a feature that indicates the sum of **FullBath** and **HalfBath**.
- **TotalBath** is a feature that indicates the sum of **BsmtFull+HalfBath** and **Full+HalfBath**.

In [None]:
# Create 'BsmtFull+HalfBath' feature
df_train['BsmtFull+HalfBath'] = df_train['BsmtFullBath'] + (df_train['BsmtHalfBath']*0.5)
df_test['BsmtFull+HalfBath'] = df_test['BsmtFullBath'] + (df_test['BsmtHalfBath']*0.5)

# Create 'Full+HalfBath' feature
df_train['Full+HalfBath'] = df_train['FullBath'] + (df_train['HalfBath']*0.5)
df_test['Full+HalfBath'] = df_test['FullBath'] + (df_test['HalfBath']*0.5)

# Create 'TotalBath' feature
df_train['TotalBath'] = df_train['Full+HalfBath'] + df_train['BsmtFull+HalfBath']
df_test['TotalBath'] = df_test['Full+HalfBath'] + df_test['BsmtFull+HalfBath']

Let's visualize these features.

In [None]:
plt.figure(figsize=(16,6))

ax1 = plt.subplot(131)
sns.boxenplot(x='BsmtFull+HalfBath', y='Log_SalePrice', data=df_train, color='dimgrey')

ax2 = plt.subplot(132, sharey=ax1)
sns.boxenplot(x='Full+HalfBath', y='Log_SalePrice', data=df_train, color='lightcoral')
plt.setp(ax2.get_yticklabels(), visible=False)
plt.ylabel('')

ax3 = plt.subplot(133, sharey=ax1)
sns.boxenplot(x='TotalBath', y='Log_SalePrice', data=df_train, color='darkseagreen')
plt.setp(ax3.get_yticklabels(), visible=False)
plt.ylabel('')


# Adjusting the spaces between graphs
plt.subplots_adjust(wspace = 0)

plt.show()

**Observations**:
- There seems to be a linear relationship between **Full+HalfBath** and **Log_SalePrice**, and between **TotalBath** and **Log_SalePrice**. 

### 9.3. BedroomAbvGr
Let's visualize **BedroomAbvGr**.

In [None]:
create_boxen_count_2x1('BedroomAbvGr', (8,8))

**Observations**:
- There seems to be no relationship between **BedroomAbvGr** and **Log_SalePrice**. 

### 9.4. Correlation among bathroom- and bedroom-related features
Let's visualize the correlation between bathroom- and bedroom-related features and **Log_SalePrice**. 

In [None]:
bathbed_feats = df_train[['BsmtFull+HalfBath', 'Full+HalfBath', 'TotalBath', 'BedroomAbvGr']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(bathbed_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is a weak positive linear correlation between **Full+HalfBath** and **Log_SalePrice**, and between **TotalBath** and **Log_SalePrice**. Additionally, there is multicollinearity between **Full+HalfBath** and **TotalBath**.

## 10. Kitchen-related features
Kitchen-related features include:
- [**KitchenAbvGr**](#section 10.0.): kitchens above grade
- [**KitchenQual**](#section 10.0.): kitchen quality

**KitchenQual** is an ordinal categorical feature. We will transform it into a numerical feature to make it easier to identify trends/relationships during visualization.

In [None]:
KitchenQual_map = {'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}
df_train['KitchenQual'].replace(KitchenQual_map, inplace=True)
df_test['KitchenQual'].replace(KitchenQual_map, inplace=True)

Let's visualize **KitchenAbvGr** and **KitchenQual**.

In [None]:
create_boxen_count_2x2('KitchenAbvGr', 'KitchenQual', (14,10))

**Observations**:
- There seems to be a linear relationship between **KitchenQual** and **Log_SalePrice**, but no relationship between **KitchenAbvGr** and **Log_SalePrice**.

## 11. Fireplace-related features
Fireplace-related features include:
- [**Fireplaces**](#section 11.0.): number of fireplaces
- [**FireplaceQu**](#section 11.0.): fireplace quality

**FireplaceQu** is an ordinal categorical feature. We will transform it into a numerical feature to make it easier to identify trends/relationships during visualization.

In [None]:
FireplaceQu_map = {'NA':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}
df_train['FireplaceQu'].replace(FireplaceQu_map, inplace=True)
df_test['FireplaceQu'].replace(FireplaceQu_map, inplace=True)

We also need to replace `NaN` values, which indicate `No Fireplace`, with 0.

In [None]:
df_train['FireplaceQu'].fillna(0, inplace=True)
df_test['FireplaceQu'].fillna(0, inplace=True)

Let's visualize **Fireplaces** and **FireplaceQu**.

In [None]:
create_boxen_count_2x2('Fireplaces', 'FireplaceQu', (14,10))

**Observations**:
- There seems to be a linear relationship between **Fireplaces** and **Log_SalePrice**, and between **FireplaceQu** and **Log_SalePrice**.

## 12. Garage-related features
Garage-related features include:
- [**GarageCars**](#section 12.1.): size of garage in car capacity
- [**GarageQual**](#section 12.1.): garage quality
- [**GarageCond**](#section 12.1.): garage condition
- [**GarageType**](#section 12.1.): garage location
- [**GarageArea**](#section 12.2.): size of garage in square feet
- [**GarageYrBlt**](#section 12.3.): year garage was built

### 12.1. GarageCars, GarageQual, GarageCond, & GarageType
**GarageQual** and **GarageCond** are ordinal categorical features. We will transform them into numerical features to make it easier to identify trends/relationships during visualization.

In [None]:
GarageQual_map = {'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}

df_train['GarageQual'].replace(GarageQual_map, inplace=True)
df_test['GarageQual'].replace(GarageQual_map, inplace=True)

df_train['GarageCond'].replace(GarageQual_map, inplace=True)
df_test['GarageCond'].replace(GarageQual_map, inplace=True)

We also need to replace `NaN` values, which indicate `No Garage`, with 0.

In [None]:
df_train['GarageQual'].fillna(0, inplace=True)
df_test['GarageQual'].fillna(0, inplace=True)

df_train['GarageCond'].fillna(0, inplace=True)
df_test['GarageCond'].fillna(0, inplace=True)

Let's visualize **GarageCars**, **GarageQual**, **GarageCond**, and **GarageType**. 

In [None]:
create_boxen_count_2x4('GarageCars', 'GarageQual', 'GarageCond', 'GarageType', (20,12))

**Observations**:
- There seems to be a linear relationship between **GarageCars** and **Log_SalePrice**, and between **GarageQual** and **Log_SalePrice**.
- There seems to be no relationship between **GarageCond** and **Log_SalePrice**, and between **GarageType** and **Log_SalePrice**.

### 12.2. GarageArea
Let's visualize **GarageArea**.

In [None]:
plt.figure()

sns.jointplot(x='GarageArea', 
              y='Log_SalePrice', 
              data=df_train, 
              kind='reg', 
              height=9,
              color='seagreen')

plt.show()

**Observations**:
- There seems to be a positive linear correlation between **GarageArea** and **Log_SalePrice**. 

### 12.3. GarageYrBlt

Let's visualize **GarageYrBlt**.

In [None]:
plt.figure(figsize=(24,6))

ax1 = plt.subplot(211, sharex=ax2)
sns.countplot(x='GarageYrBlt', data=df_train, color='dimgrey')
plt.xticks(rotation='vertical')
plt.setp(ax1.get_xticklabels(), visible=False)
plt.xlabel('')

ax2 = plt.subplot(212)
sns.boxenplot(x='GarageYrBlt', y='Log_SalePrice', data=df_train, color='lightcoral')
plt.xticks(rotation='vertical')

# Adjusting the spaces between graphs
plt.subplots_adjust(hspace=0)

plt.show()

### 12.4. Correlation among garage-related features
Let's visualize the correlation between garage-related features and **Log_SalePrice**.

In [None]:
garage_feats = df_train[['GarageCars', 'GarageQual', 'GarageCond', 
                         'GarageType', 'GarageArea', 'GarageYrBlt', 'Log_SalePrice']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(garage_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is a strong positive correlation between **GarageCars** and **Log_SalePrice**, and between **GarageArea** and **Log_SalePrice**. However, there is a multicollinearity betwen **GarageArea** and **GarageCars**..

## 13. Deck-/porch-related features 

Deck-/porch-related features consist of:
- **WoodDeckSF**: wood deck area in square feet.
- **OpenPorchSF**: open porch area in square feet.
- **EnclosedPorch**: enclosed porch area in square feet.
- **3SsnPorch**: three season porch area in square feet.
- **ScreenPorch**: screen porch area in square feet.

Let's visualize them using pairplot.

In [None]:
plt.figure(figsize=(20,15))

g = sns.pairplot(df_train[['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 
                           '3SsnPorch', 'ScreenPorch', 'Log_SalePrice']], 
             palette='Accent',
             kind='scatter',
             diag_kind='auto',
             height=3)

plt.show()

Let's visualize the correlation between deck-/porch-related features and **Log_SalePrice**.

In [None]:
deck_feats = df_train[['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 
                        '3SsnPorch', 'ScreenPorch', 'Log_SalePrice']]

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(deck_feats.corr(), annot=True, cmap='RdBu')

plt.show()

**Observations**:
- There is a weak positive correlation between **WoodDeckSF** and **Log_SalePrice**, and between **OpenPorchSF** and **Log_SalePrice**.

## 14. Pool-related features
Pool-related features consist of:
- [**PoolArea**](#section 14.1.): pool area in square feet
- [**PoolQC**](#section 14.2.): Pool quality

### 14.1. PoolArea
Let's visualize **PoolArea**.

In [None]:
plt.figure()

sns.jointplot(x='PoolArea', 
              y='Log_SalePrice', 
              data=df_train, 
              kind='scatter', 
              height=9,
              color='seagreen')

plt.show()

**Observations:**:
- Only a small number of houses has pool.
- The existence of pool in a house does not guarantee that the house has a high price.

### 14.2. PoolQC
**PoolQC** is an ordinal categorical feature. We will transform it into a numerical feature to make it easier to identify trends/relationships during visualization.

In [None]:
PoolQC_map = {'NA':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}
df_train['PoolQC'].replace(PoolQC_map, inplace=True)
df_test['PoolQC'].replace(PoolQC_map, inplace=True)

We also need to fill NaN values, which indicate `No Pool`, with 0.

In [None]:
df_train['PoolQC'].fillna(0, inplace=True)
df_test['PoolQC'].fillna(0, inplace=True)

Let's visualize **PoolQC**.

In [None]:
create_boxen_count_2x1('PoolQC', (8,8))

**Observations:**:
- Only a small number of houses has pool.
- The existence of pool in a house does not guarantee that the house has a high price.

## 15. Miscellaneous features
Miscellaneous features consist of:
- [**Fence**](#section 15.1.): fence quality.
- [**MiscFeature**](#section 15.2.): miscellaneous feature not covered in other categories.
- [**MiscVal**](#section 15.3.): value of miscellaneous feature.
- [**MoSold**](#section 15.4.): month Sold (MM).
- [**SaleType**](#section 15.5.): type of sale.
- [**SaleCondition**](#section 15.6.): condition of sale.

### 15.1. Fence
**Fence** is an ordinal categorical feature. We will transform it into a numerical feature to make it easier to identify trends/relationships during visualization.

In [None]:
Fence_map = {'MnWw':1, 'GdWo':2, 'MnPrv':3, 'GdPrv':4}
df_train['Fence'].replace(Fence_map, inplace=True)
df_test['Fence'].replace(Fence_map, inplace=True)

We also need to fill `NaN` values, which indicate `No Fence`, with 0. 

In [None]:
df_train['Fence'].fillna(0, inplace=True)
df_test['Fence'].fillna(0, inplace=True)

Let's visualize **PoolQC**.

In [None]:
create_boxen_count_2x1('Fence', (8,8))

**Observations**:
- It seems that there is no relationship between **Fence** and **Log_SalePrice**.

### 15.2. MiscFeature
We need to fill `NaN` values, which indicate `None`, with `None`.

In [None]:
df_train['MiscFeature'].fillna('None', inplace=True)
df_test['MiscFeature'].fillna('None', inplace=True)

Let's visualize **MiscFeature**.

In [None]:
create_boxen_count_2x1('MiscFeature', (8,8))

### 15.3. MiscVal
Let's visualize **MiscVal**.

In [None]:
plt.figure()

sns.jointplot(x='MiscVal', 
              y='Log_SalePrice', 
              data=df_train, 
              kind='scatter', 
              height=9,
              color='seagreen')

plt.show()

### 15.4. MoSold
Let's visualize **MoSold**.

In [None]:
create_boxen_count_2x1('MoSold', (8,8))

### 15.5. SaleType
Let's visualize **SaleType**.

In [None]:
create_boxen_count_2x1('SaleType', (8,8))

### 15.6. SaleCondition
Let's visualize **SaleCondition**.

In [None]:
create_boxen_count_2x1('SaleCondition', (8,8))

## 16. Summary of exploratory data analysis
After the analysis that we have performed above, we can get sense of which features are the important predictors to predict houses prices. Note that some of the features have quite high corelation with the target feature. These features are really significant. 