# Exploratory Data Analysis for Kc_House_Data

This is a tutorial notebook on how to perform EDA for a dataset.

The chosen data for this tutorial is House Sales in King County, USA, available on [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction).

Check the blog post [How to perform EDA for machine learning?](https://mlwithhamza.blogspot.com/2021/07/how-to-perform-eda-for-machine-learning.html) for more informations about the used EDA method in this notebook.

In [None]:
# Importing basic libraries for EDA

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import plotly.graph_objects as go

# Importing the os library to get the path for the data file (not used in the EDA).
import os

# The following magic command lets matplotlib display images in the cells outputs.
%matplotlib inline

# Setting seaborn style
sb.set(style="darkgrid")

In [None]:
# Getting the list of entries in the current directory
#  The data file 'kc_house_data.csv' has to be in the same folder as this notebook
#  for it to appear in the following list

In [None]:
os.listdir('../input/housesalesprediction/')

In [None]:
# Reading the data .csv file
data = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

## Overall View

In [None]:
# List of the columns names
data.columns

In [None]:
# Checking the head (top 5) rows of the dataframe and whowing all the columns
pd.set_option("display.max_columns", len(data.columns))
data.head()

In [None]:
data.info()

In [None]:
# Helper function to print the values count of features
def feature_val_count(data, feature_name):
    s = data[feature_name].value_counts()
    return print(f"The Value counts of the feature {feature_name}: \n {s}")

In [None]:
# Checking the values count of the features to determine their nature
feature_val_count(data, 'condition')

### Feature Naure:
From the type of the features and their values count, we can determine the nature of each feature:
- **Qualitative:**
  - **Nominal:** id, waterfront, zipcode
  - **Ordinal:** date, view, condition
- **Quantitative:**
  - **Discrete:** bedrooms, sqft_living, sqft_lot, sqft_above, sqft_basement, yr_built, yr_renovated, sqft_living15, sqft_lot15
  - **Continuous:** price, floors, lat, long,

In [None]:
# Checking the basic statistics for each feature(column) [Count, Mean, Standard Deviation, Minimum, Quartiles, and Maximum]
data.describe()

## Univariate Analysis

In this step of the EDA, each variable is examined and assessed by itself. Usually this step dosen't provide valuable insight, however it helps understanding each feature better by visualizing its distribution and examining it's statistics.

In [None]:
data.head()

Usually the 'id' variable is ignored because it has no meaning and it is only used to index each row with a unique identifier.

#### Date

In [None]:
type(data.date[0])

**Remark:**
Notice is that the date type is 'str', so we need to convert it to a timestamp variable, which is achieved using the pandas method .to_date_time()

In [None]:
data.date = pd.to_datetime(data.date, infer_datetime_format=True)

In [None]:
data.head()

In [None]:
# Checking that the date type changed correctly
type(data.date[0])

In [None]:
# Creating a list of the years, and months extracted form the date feature.
Years = list(pd.DatetimeIndex(data.date).year)
Months = list(pd.DatetimeIndex(data.date).month)

In [None]:
# Creating baplot for the sales bount by Year and by Month
# Creating a boxplot for the monthly sales count distribution

fig = plt.figure(figsize=(20,6))
grid = plt.GridSpec(2, 2, width_ratios=(1, 2), height_ratios=(1,5), hspace=0.2, wspace=0.2)
Left_ax = fig.add_subplot(grid[:, 0])
Right_top = fig.add_subplot(grid[0, 1])
Right_bot = fig.add_subplot(grid[1, 1], xticklabels=['Jan','Feb','Mar','May','Avr','Jun','Jul','Aou','Sep','Oct','Nov','Dec'])

sb.countplot(x=Years, palette='mako', ax=Left_ax)
Left_ax.set_title('House sales count by Year', fontdict={'fontsize':15})
sb.countplot(x=Months, palette='mako', ax=Right_bot)
sb.boxplot(x=Months, ax=Right_top)
Right_top.set_title('House sales count by Month', fontdict={'fontsize':15});

#### Price

In [None]:
# Sorting the data by date and extracting some basics statistics aout the price feature
# Calculating the Upper and Lower whiskers of the boxplot

data_sorted = data.sort_values(by='date')

median = np.median(data.price)
upper_quartile = np.percentile(data.price, 75)
lower_quartile = np.percentile(data.price, 25)

iqr = upper_quartile - lower_quartile
upper_whisker = data.price[data.price<=upper_quartile+1.5*iqr].max()
lower_whisker = data.price[data.price>=lower_quartile-1.5*iqr].min()

In [None]:
print('\033[1m' + 'Price feature statistics:\n')

display(data_sorted.price.describe())
print('')
print(f'Upper Whisker: {upper_whisker}')
print(f'Lower Whisker: {lower_whisker}')

In [None]:
n_outliers = (data_sorted.price>upper_whisker).sum()
per_outlizers = n_outliers/len(data_sorted.price)*100
print(f'Number of outliers: {n_outliers}')
print(f'Percentage of outliers: {per_outlizers:.2f}%')

In [None]:
# Plotting the price feature using 3 different types of plots to better visualize the distribution

plt.figure(figsize=(20,8))
sb.scatterplot(x=range(len(data_sorted.price)) ,y=data_sorted.price, alpha=0.4)
plt.plot((0, len(data.price)), (lower_whisker, lower_whisker), 'm--',linewidth=3)
plt.plot((0, len(data.price)), (upper_whisker, upper_whisker), 'r--',linewidth=3)
plt.legend(['Lower Whisker', 'Upper Whisker', 'House Price'])
plt.title('Scatter plot of the house price feature', fontdict={'fontsize':15})

plt.figure(figsize=(20,8))
plt.subplot(121)
sb.histplot(data=data.price, bins=140)
plt.title('Distribution of the house prices', fontdict={'fontsize':15})

plt.subplot(122)
sb.boxplot(x=data.price)
plt.title('Boxplot of the house prices', fontdict={'fontsize':15});

**Remark:** 
Notice that the distribution of prices is extremeply right skewed, and that we have 1146 outlires out of 21613 entries.
Almost 94.7% of the house prices are below 1127500.

#### Bedrooms & Bathrooms

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(121)
sb.countplot(x=data.bedrooms, palette='mako' )
plt.title('Number of bedrooms distribution', fontdict={'fontsize':15})
plt.subplot(122)
sb.countplot(y=data.bathrooms, palette='mako' )
plt.title('Number of bathrooms distribution', fontdict={'fontsize':15});

As shown in the image above, there are homes with three-quarters and half of a bathroom, and this means:
A 1.5 bath would mean one full bathroom, and one half bathroom. A 0.5 bathroom is called a half bath. It doesn't mean half bath in terms of its size in square feet. A half bath offers a sink and a toilet but no shower or bathtub. This type of math notations for bathrooms are commonly used in USA and that's why it appears in this dataset.

#### Sqft_living, sqft_lot, sqft_living15, sqft_lot15, sqft_above, and sqft_basement.

sqft_living15 & sqft_lot15: Living room area and lot area in 2015, implying that there was some renovations.

In [None]:
sqft_des = pd.DataFrame(data=[data.sqft_living.describe(),data.sqft_lot.describe()])

In [None]:
pd.DataFrame((data.sqft_basement>0).value_counts()).transpose()

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(321)
sb.histplot(x=data.sqft_living, kde=True, bins= 110)
sb.histplot(x=data.sqft_living15, kde=True, bins= 110, color='red')
plt.legend(['sqft_living','sqft_living15'])
plt.title('Living area distribution', fontdict={'fontsize':15})
plt.subplot(322)
ax = sb.histplot(x=data.sqft_lot)
ax = sb.histplot(x=data.sqft_lot15, color='red')
plt.legend(['sqft_living','sqft_living15'])
plt.title('Lot area distribution', fontdict={'fontsize':15})
ax.set_xscale('log')
plt.subplot(323)
sb.boxplot(x=data.sqft_living)
plt.subplot(324)
ax2 = sb.boxplot(x=data.sqft_lot)
ax2.set_xscale('log')
plt.subplot(325)
sb.histplot(x=data.sqft_above)
plt.subplot(326)
ax3 = sb.histplot(x=data[data.sqft_basement>0]['sqft_basement'])
#ax3.set_xscale('log')
plt.tight_layout()

basement_bool = pd.DataFrame((data.sqft_basement>0).value_counts()).reset_index()
plt.figure(figsize=(8,5))
ax = sb.barplot(y=basement_bool['sqft_basement'], x=basement_bool['index'], palette='mako')
ax.set(ylabel='Count', xlabel='Basement');

#### floors, waterfront, view, condition, and grade

In [None]:
plt.figure(figsize=(20,20))
plt.subplot(321)
sb.countplot(x=data.floors, palette='mako')
plt.title('Distribution of houses with respect to floor count', fontdict={'fontsize':15})
plt.subplot(322)
sb.countplot(x=data.waterfront, palette='mako')
plt.title('Number of houses with/without a water front', fontdict={'fontsize':15})
plt.subplot(323)
sb.countplot(x=data.view, palette='mako')
plt.title('Distribution of the views count', fontdict={'fontsize':15})
plt.subplot(324)
sb.countplot(x=data.condition, palette='mako')
plt.title('Houses condition distribution', fontdict={'fontsize':15});

#### yr_built and yr_renovated

In [None]:
plt.figure(figsize=(15,20))
plt.subplot(121)
sb.countplot(y=data.yr_built, palette='mako')
plt.title('Distribution of yr_built feature', fontdict={'fontsize':15})
plt.subplot(122)
sb.countplot(y=data[data.yr_renovated>0]['yr_renovated'], palette='mako')
plt.title('Distribution of yr_renovated feature for renovated houses', fontdict={'fontsize':15})

plt.figure(figsize=(8,5))
yr_renov_bool = pd.DataFrame((data.yr_renovated>0).value_counts()).reset_index()
sb.barplot(y=yr_renov_bool['yr_renovated'], x=yr_renov_bool['index'], palette='mako')
ax.set(ylabel='Count', xlabel='Renovated');

#### Lon, and Lat

The best practice in dealing with longitude & Latitude variables is to plot them on a map to visualize the distribution (scatter) of positions on a real scale. And this is valid for both univariate and multivariate analysis. 

In [None]:
df = data[['long','lat']].copy()
df['loc']='USA'
df.rename(columns={'long':'lon'}, inplace=True)

In [None]:
fig = go.Figure(data=px.scatter_geo(
        lon = df['lon'],
        lat = df['lat'],
        center={'lat':df['lat'].mean(), 'lon':df['lon'].mean()},
        width=700,
        height=600,
        opacity=0.5
        ))

fig.update_layout(
        title = 'Houses Location in USA-King County',
        geo_scope='usa'
    )
fig.show()

### Conclusion of univariate analysis

Many of the categorical features on the dataset are heavely unbalanced like 'condition', 'view', 'waterfront', and 'floors', which may be the cause of the extreme skeweness of the distribution of hous prices and areas. These speculations can be further inspected by carrying out a multivariate analysis, which is the object of the following sections.

## Multivariate Analysis

Starting with bivariate analysis, and since we have a target variable which is the house prices, then we can limit the bivariate analysis to the 'price' vs All the other significant features. But first, let's take a quick look on the pair scatter plot of the numer

In [None]:
sb.pairplot(data=data[['price','bedrooms','bathrooms','sqft_living','sqft_lot','grade','sqft_above','sqft_basement','yr_built']], palette='mako');

In [None]:
plt.figure(figsize=(18,13))
plt.title('Heatmap correlation of the most important features', fontsize=18)
sb.heatmap(data=data.iloc[:,1:].corr(), annot=True);

In [None]:
plt.figure(figsize=(20,20))
plt.suptitle('Relation between categorical variables and the target variable', y=0.91, fontsize=20)
plt.subplot(421)
sb.barplot(x=data.bedrooms, y=data.price, palette='mako')
plt.subplot(422)
sb.barplot(x=data.waterfront, y=data.price, palette='mako')
plt.subplot(423)
sb.barplot(x=data.grade, y=data.price, palette='mako')
plt.subplot(424)
sb.barplot(x=data.floors, y=data.price, palette='mako')
plt.subplot(425)
sb.barplot(x=data.condition, y=data.price, palette='mako')
plt.subplot(426)
sb.barplot(x=data.view, y=data.price, palette='mako')
plt.subplot(414)
sb.barplot(x=data.bathrooms, y=data.price, palette='mako')

In [None]:
sb.lmplot(x='sqft_living', y='price', hue='waterfront', data=data, palette='mako', height=8, aspect=1.7);

In [None]:
sb.lmplot(x='sqft_lot', y='price', hue='waterfront', data=data, palette='mako', height=8, aspect=1.7, ci=0);

In the image below the house **condition** is coded in symbols and the **grade** variable in colors.

In [None]:
df = data[['long','lat','price','grade','condition','yr_built']].copy()
df.rename(columns={'long':'lon'}, inplace=True)

fig = go.Figure(data=px.scatter_geo(
        lon = df['lon'],
        lat = df['lat'],
        center={'lat':df['lat'].mean(), 
                'lon':df['lon'].mean()
               },
        size=df['price'],
        color=df['grade'],
        symbol=df['condition'],
        animation_frame=df['yr_built'],
        width=700,
        height=600,
        opacity=0.5,
        color_continuous_scale="deep"
        ))

fig.update_layout(
        title = 'Houses prices map',
        geo_scope='usa'
    )
fig.layout.legend.y = 1.05
fig.layout.legend.x = 1.035
fig.layout.coloraxis.colorbar.y = 0.25
fig.show()

## Conclusion

In this notebook, I presented a simple and methodical way of performing an EDA for structured and clean data. In practice, data are collected in raw state and needs more cleaning work. The presented EDA was not aiming for a specific task even though we have concidered the price feature as a target variable for a classification task, but in case we're going to build a model there are more analysis to be made. For instance, we can further inspect the drop in price for the houses that have 6.5-7.5 bathrooms, and we can also think about binning some features and rechck whether or not a pattern has emmerged.