# <u>House Prices: Comprehensive EDA & Visualization</u>

This notebook shows you how to quickly visualize the overall features as a first step.  
If you like this notebook, please give it an upvote. Thank you! 

* [1. Preparations](#1)
    * [1.1 Import libraries](#1.1)
    * [1.2 Load dataset](#1.2)
    * [1.3 Split numeric and categorical data](#1.3)
* [2.SalePrice: Objective variable](#2)
    * [2.1 Original vs Log-transformation](#2.1)
* [3. Overall Features](#3)
    * [3.1 Data Type Ratio](#3.1)
    * [3.2 Missing Values](#3.2)
    * [3.3 Feature importances](#3.3)
* [4. Numeric Features](#4)
    * [4.1 Correlation Heatmap](#4.1)
    * [4.2 Correlation coefficient](#4.2)
    * [4.3 Feature importances](#4.3)
    * [4.4 Histgram of numeric features](#4.4)
    * [4.5 Relation of numeric features to target](#4.5)
    * [4.6 Skewness: Original vs Log-transformation](#4.6)
* [5. Categorical Features](#5)
    * [5.1 Feature importances](#5.1)
    * [5.2 Relation of features to target](#5.2)

<a id="1"></a><h1 style='background:blue; border:.; color:white'><center>1. Preparations</center></h1>

## 1.1 Import libraries<a id="1.1"></a>
**Import all required libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', palette='rainbow')
from scipy.stats import skew, norm
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from scipy import stats

## 1.2 Load dataset<a id="1.2"></a>
**Load each data as a Pandas DataFrame**

In [None]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
train.head()

## 1.3 Split numeric and categorical data<a id="1.3"></a>
**separate numeric and categorical data for further processing**

In [None]:
# extract numeric data
num_cols = train.loc[:,train.dtypes != 'object'].drop(['Id'], axis=1).columns
num_train = train[num_cols]
# extract categorical data
cat_cols = train.loc[:,train.dtypes == 'object'].columns
cat_train = train[cat_cols]

<a id="2"></a><h1 style='background:blue; border:.; color:white'><center>2.SalePrice: Objective variable</center></h1>

## 2.1 Original vs Log-transformation<a id="2.1"></a>

In [None]:
fig, ax = plt.subplots(2,3, figsize=(14,8))
##### Original
# histgram
sns.distplot(train['SalePrice'] , fit=norm, ax=ax[0,0])
mu, sigma = norm.fit(train['SalePrice'])
ax[0,0].legend([f'Normal dist. ($\mu=${mu:.2f}, $\sigma=${sigma:.2f})'], loc='best')
ax[0,0].set_ylabel('Frequency')
ax[0,0].set_title('Distribution(Original)')
# Q-Q plot
_ = stats.probplot(train['SalePrice'], plot=ax[0,1])
ax[0,1].set_title('Q-Q plot(Original)')
# plot boxplot
sns.boxplot(train['SalePrice'] , orient='v', ax=ax[0,2])
ax[0,2].set_title('Boxplot(Original)')

##### Log-transformation
logged = np.log1p(train["SalePrice"])
# histgram
sns.distplot(logged , fit=norm, ax=ax[1,0])
mu, sigma = norm.fit(logged)
ax[1,0].legend([f'Normal dist. ($\mu=${mu:.2f}, $\sigma=${sigma:.2f})'], loc='best')
ax[1,0].set_ylabel('Frequency')
ax[1,0].set_title('Distribution(Log-transformation)')
# Q-Q plot
_ = stats.probplot(logged, plot=ax[1,1])
ax[1,1].set_title('Q-Q plot(Log-transformation)')
# plot boxplot
sns.boxplot(logged , orient='v', ax=ax[1,2])
ax[1,2].set_title('Boxplot(Log-transformation)')

fig.tight_layout()

<a id="3"></a><h1 style='background:blue; border:.; color:white'><center>3. Overall Features</center></h1>

## 3.1 Data Type Ratio<a id="3.1"></a>
**Display the data types ratio in a pie chart**

In [None]:
dtype, count = np.unique(train.dtypes.values, return_counts=True)
df = pd.DataFrame(data={'count': count}, index=dtype)
ax = df.plot.pie(y='count', autopct="%1.1f%%", figsize=(6,6), legend=False)
ax.set_ylabel('')
ax.set_title('Data Type Ratio', fontsize=18);

## 3.2 Missing Values<a id="3.2"></a>
**Show percentage of the missing values**

In [None]:
# count missing values (extract only over 0)
mv = train.isnull().sum().sort_values(ascending=False)
mv = mv[mv.values > 0]
# convert to percentage
mv = mv / train.shape[0] * 100

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax = sns.barplot(mv.values,mv.index)
ax.set_xlabel('(%)')
ax.set_ylabel('feature')
ax.set_title('Percentage of missing values', fontsize=16);

## 3.3 Feature importances<a id="3.3"></a>
**Show the feature importances in a bar chart and numeric and categorical data are color coded**

In [None]:
# fill nan with "NULL"
tmp_df = train.copy()
tmp_df[cat_cols] = tmp_df[cat_cols].fillna('NULL')
# label encoding
for col in cat_cols:
    le = LabelEncoder()
    le.fit(tmp_df[col])
    tmp_df[col] = le.transform(tmp_df[col])
# train data
X_train = tmp_df.drop(['SalePrice', 'Id'], axis=1)
y_train = tmp_df['SalePrice']
lgb_train = lgb.Dataset(X_train, y_train)
params = {'objective': 'regression', 'metric': 'rmse'}
gbm = lgb.train(params, lgb_train)
# create DataFrame
cols = train.columns.drop(['Id', 'SalePrice'])
feat_importances = pd.DataFrame({'importance': gbm.feature_importance()}, index=X_train.columns).sort_values('importance', ascending=False)
feat_importances['dtype'] = ['numeric' if feat in num_cols else 'categorical' for feat in cols]
feat_importances.head()

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
bars = ax.bar(feat_importances.index, feat_importances['importance'])
leg1 = ax.legend(['categorical'], loc=(0.9, 0.9))
fig.gca().add_artist(leg1)
# set labels ans title
ax.set_xlabel('features')
ax.set_ylabel('importance')
ax.set_xticklabels(feat_importances.index, rotation=90)
ax.set_title('Feature importances', fontsize=16)
# change color of numeric columns
for i in np.where(feat_importances['dtype'].values=='numeric')[0]:
    bars[i].set_color('red')
leg2 = ax.legend(['numeric'], loc=(0.9, 0.85));

<a id="4"></a><h1 style='background:blue; border:.; color:white'><center>4. Numeric Features</center></h1>

## 4.1 Correlation Heatmap<a id="4.1"></a>
**Show the strength of correlation between numerical values**

In [None]:
plt.figure(figsize=(10, 9))
sns.heatmap(train.drop('Id', axis=1).corr(), cmap='YlGnBu');

## 4.2 Correlation coefficient<a id="4.2"></a>
**Show the correlation coefficient with SalePrice**

In [None]:
ax = num_train.corr()['SalePrice'].sort_values()[:-1].plot(kind='barh', figsize=(16,10))
ax.set_title('Correlation coefficient with SalePrice', fontsize=16)
ax.set_xlabel('correlation coefficient')
ax.set_ylabel('features');

## 4.3 Feature importances<a id="4.3"></a>
**Show the importances of numeric features**

In [None]:
# plot importance of numeric features
num_feat_importances = feat_importances.loc[num_cols[:-1]].sort_values('importance', ascending=False)
ax = num_feat_importances.plot.bar(figsize=(16,5))
ax.set_title('Importance of Numeric Features', fontsize=16)
ax.set_xlabel('features')
ax.set_ylabel('count');

## 4.4 Histgram of numeric features<a id="4.4"></a>
**Show the histgram of numeric features**

In [None]:
num_train.hist(figsize=(18, 21), bins=30, xlabelsize=8, ylabelsize=8);

## 4.5 Relation of numeric features to target<a id="4.5"></a>
**Display the relationship between each numeric data and "SalePrice" using seaborn regplot, and add the correlation coefficient and future importance**

In [None]:
cols = num_feat_importances.index
fig, ax = plt.subplots(-(-len(cols)//4), 4, figsize=(14, len(cols)/1.2))

for idx,col in enumerate(cols):
    # show regplot with Sale Price
    row = idx // 4
    col = idx % 4
    sns.regplot(data=num_train, x=cols[idx], y='SalePrice', ax=ax[row][col])
    
    # show correlation coefficient and feature importance
    corr = num_train['SalePrice'].corr(num_train[cols[idx]])
    feat_imp = num_feat_importances.loc[cols[idx], 'importance']
    ax[row][col].set_title(f'corr: {corr:.3f}, importance: {feat_imp}')

    # display scale and label only on the left edge
    if col != 0:
        ax[row][col].set_ylabel('')
        ax[row][col].set_yticklabels('')

fig.tight_layout()

## 4.6 Skewness: Original vs Log-transformation<a id="4.6"></a>
**Display the skewness of the logarithmic data side by side with the original**

In [None]:
# logarithm of numeric data
log_num = np.log1p(num_train)
# compare skewnesses original with log-transformation
skewness = pd.concat([num_train.apply(lambda x: skew(x.dropna())),
                      log_num.apply(lambda x: skew(x.dropna()))],
                     axis=1).rename(columns={0:'original', 1:'log-transformation'}).sort_values('original')
ax = skewness.plot.barh(figsize=(15,12), title='Comparison of skewness of original and log-transformation', width=0.8)
ax.set_xlabel('skewness');

<a id="5"></a><h1 style='background:blue; border:.; color:white'><center>5. Categorical Features</center></h1>

## 5.1 Feature importances<a id="5.1"></a>
**Show the importances of categorical features**

In [None]:
# plot importance of categorical features
cat_feat_importances = feat_importances.loc[cat_cols].sort_values('importance', ascending=False)
ax = cat_feat_importances.plot.bar(figsize=(15,5))
ax.set_title('Importance of Categorical Features', fontsize=16);

## 5.2 Relation of features to target<a id="5.2"></a>
**Display the relationship between each caterical data and "SalePrice" using seaborn boxplot, and add the future importance**

In [None]:
cols = cat_feat_importances.index
fig, ax = plt.subplots(-(-len(cols)//4), 4, figsize=(14, len(cols)/1.2))

for idx,col in enumerate(cols):
    # show stripplot with Sale Price
    row = idx // 4
    col = idx % 4
    sns.boxplot(data=pd.concat([cat_train, train['SalePrice']], axis=1), x=cols[idx], y='SalePrice', ax=ax[row][col])
    
    # show feature importance
    feat_imp = cat_feat_importances.loc[cols[idx], 'importance']
    ax[row][col].set_title(f'importance: {feat_imp}')

    # display scale and label only on the left edge
    if col != 0:
        ax[row][col].set_ylabel('')
        ax[row][col].set_yticklabels('')

fig.tight_layout()