# Exploratory data analysis (EDA)

Exploratory Data Analysis (EDA) is the initial step in the data analysis process which is needed to summarize data sets main characteristics: gain insights into the data, understand its structure, identify patterns, and generate hypotheses for further analysis.



<img src="http://sharpsightlabs.com/wp-content/uploads/2016/05/1_data-analysis-for-ML_how-we-use-dataAnalysis_2016-05-16.png" />

## Preparations

First let's import the libraries we are going to use. Note that the `np`, `pd`, `sns`, `plt` shorthands are established conventions.

In [None]:
import numpy as np  # for fast math and specifically linear algebra operations
import pandas as pd  # Like Excel but in Python
import seaborn as sns  # for easy plotting
import matplotlib.pyplot as plt  # for lower-level plotting

# Keeps plots inside jupyter notebooks, possible to use interactive mode for example
%matplotlib inline

sns.set()  # Use the seaborn default style, which is quite good

Download the [dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

In [None]:
!wget -q -O house-pricing-train.csv https://drive.google.com/u/0/uc?id=1NdAQ-DhI7bSBtu1C-GNiTERIW-5CHDwg&export=download

In [None]:
df = pd.read_csv('house-pricing-train.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

## Are there any features with too many missing values?

How would you define "too many"?

**Bonus:** check the dataset description and find out if a missing value is a legitimate value for some features. Perhaps you can just substitute it with something for some columns to make it usable?

Write code below to drop features with too many missing values.

In [None]:
columns_to_drop = # Write code here to get columns with too many missing values
df.drop(columns=columns_to_drop, inplace=True)
df.shape

## How is the target value distributed?

The target values is `SalePrice`.

Does it have missing values?

How is it distributed?

Are there any outliers?

Plot a distribution plot (a histogram) to answer these questions.
*Hint:* consider reading up on `sns.histplot` or `pd.DataFrame.hist`

In [None]:
print(df['SalePrice'].describe())
plt.figure(figsize=(9, 8))
sns.histplot(df.SalePrice)
plt.show()

## Numerical data distribution

For this part let's look at the distribution of all of the features by ploting them.

To do so let's first list all the types of our data from our dataset and take only the numerical ones:

In [None]:
list(set(df.dtypes.tolist()))

In [None]:
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num.head()

Now let's plot them all:

In [None]:
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);  # ; avoid having the matplotlib verbose informations

Are there any features with a similar distribution to the target variable?

#### Correlation

Now we'll try to find which features are strongly correlated with `SalePrice`. We'll store them in a var called `golden_features_list`. We'll reuse our `df_num` dataset to do so.

In [None]:
df_num_corr = df_num.corr()  # obtain the correlation matrix
df_num_corr = df_num_corr['SalePrice'][:-1]  # select only correlations with SalePrice, excluding SalePrice itself
df_num_corr[:10]

In the cell below, obtain all features with a strong correlation with `SalePrice` (`>= 0.5`).
*Hint:* Negatively correlated features can also be used for predicting `SalePrice`!

In [None]:
golden_features_list = # Your code here
print(f"There is {len(golden_features_list)} strongly correlated values with SalePrice:\n{golden_features_list}")

Perfect, we now have a list of strongly correlated values but this list is incomplete as we know that correlation is affected by outliers. So we could proceed as follow:

- Plot the numerical features and see which ones have very few or explainable outliers
- Remove the outliers from these features and see which one can have a good correlation without their outliers
    

Correlation by itself does not always explain the relationships in data. Why?

Let's plot numeric features against `SalePrice` to visually inspect them.

In [None]:
for i in range(0, len(df_num.columns), 5):
    sns.pairplot(data=df_num,
                x_vars=df_num.columns[i:i+5],
                y_vars=['SalePrice'])

We can clearly identify some relationships. Most of them seem to have a linear relationship with the `SalePrice` and if we look closely at the data we can see that a lot of data points are located on `x = 0` which may indicate the absence of such feature in the house.

Take `OpenPorchSF` (3d line from the end), I doubt that all houses have a porch (mine doesn't for instance but I don't lose hope that one day... yeah one day...).

Bonus: remove these `0` values and repeat the process of finding correlated values.

## Categorical features

Let's select some categorical features. Note that not all features that are categorical have a string data type (`'O'` for object in numpy/pandas terminology),

In [None]:
df_cat = df.select_dtypes(include = ['O'])
df_cat.head()

Now let's plot the distribution for SalePrices conditioned on a categorical feature value.

In [None]:
plt.figure(figsize=(10, 7))
sns.histplot(data=df, x="SalePrice", hue="KitchenQual", kde=True, stat="density", common_bins=False, common_norm=False);

**In the following cell plot a bar chart to inspect the count each value of   `KitchenQual`**

In [None]:
# Your code here

It's evident that houses with different kitchens have very different price distributions.


**How would you visualize the relationship between two categorical features?**

**Find a variable that correponds to the ordinal type (rank of some kind) and visualize its relationship with `SalePrice`**

## Feature to feature relationship

Trying to plot all the numerical features in a seaborn pairplot will take us too much time and will be hard to interpret. We can try to see if some variables are linked between each other and then explain their relation with common sense.

In [None]:
corr = df_num.drop('SalePrice', axis=1).corr()  # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))

sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)],
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);

A lot of features seems to be correlated between each other but some of them such as `YearBuild`/`GarageYrBlt` may just indicate a price inflation over the years. As for `1stFlrSF`/`TotalBsmtSF`, it is normal that the more the 1st floor is large (considering many houses have only 1 floor), the more the total basement will be large.

# Drawing conclusions

Based on the analysis up to this point, answer the following questions.

* What insights can we draw from the data about house pricing?
* Which features are definitely going to be useful for modelling house pricing?
* What features should we exclude or combine with features when making a predictive model?
* Were there any counter-intuitive findings?
* What findings remain unexplained?

# Additional analysis

## Q -> Q (Quantitative to Quantitative relationship)

Let's now examine the quantitative features of our dataframe and how they relate to the `SalePrice` which is also quantitative (hence the relation Q -> Q).

Some of the features of our dataset are categorical. To separate the categorical from quantitative features let's refer ourselves to the `data_description.txt` file. According to this file we end up with the folowing columns:

In [None]:
quantitative_features_list = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'TotalBsmtSF', '1stFlrSF',
    '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
    'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']
df_quantitative_values = df[quantitative_features_list]
df_quantitative_values.head()

Still, we have a lot of features to analyse here so let's take the *strongly correlated quantitative* features from this dataset and analyse them one by one

In [None]:
features_to_analyse = [x for x in quantitative_features_list if x in golden_features_list]
features_to_analyse.append('SalePrice')
features_to_analyse

Let's look at their distribution.

In [None]:
fig, ax = plt.subplots(round(len(features_to_analyse) / 3), 3, figsize = (18, 12))

for i, ax in enumerate(fig.axes):
    if i < len(features_to_analyse) - 1:
        sns.regplot(x=features_to_analyse[i],y='SalePrice', data=df[features_to_analyse], ax=ax)

## C -> Q (Categorical to Quantitative relationship)

Let's get all the categorical features of our dataset and see if we can find some insight in them.
Instead of opening back our `data_description.txt` file and checking which data are categorical, let's just remove `quantitative_features_list` from our entire dataframe.

In [None]:
# quantitative_features_list[:-1] as the last column is SalePrice and we want to keep it
categorical_features = [a for a in quantitative_features_list[:-1] + df.columns.tolist() if (a not in quantitative_features_list[:-1]) or (a not in df.columns.tolist())]
df_categ = df[categorical_features]
df_categ.head()

And don't forget the non-numerical features

In [None]:
df_not_num = df_categ.select_dtypes(include = ['O'])
print('There is {} non numerical features including:\n{}'.format(len(df_not_num.columns), df_not_num.columns.tolist()))

Now let's plot some of them

In [None]:
plt.figure(figsize = (10, 6))
ax = sns.boxplot(x='BsmtExposure', y='SalePrice', data=df_categ)
tmp = plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
tmp = plt.xticks(rotation=45)

In [None]:
plt.figure(figsize = (12, 6))
ax = sns.boxplot(x='SaleCondition', y='SalePrice', data=df_categ)
tmp =  plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
tmp = plt.xticks(rotation=45)

And finally let's look at their distribution

In [None]:
fig, axes = plt.subplots(round(len(df_not_num.columns) / 3), 3, figsize=(12, 30))

for i, ax in enumerate(fig.axes):
    if i < len(df_not_num.columns):
        ax.set_xticks(ax.get_xticks());
        ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45);
        sns.countplot(x=df_not_num.columns[i], alpha=0.7, data=df_not_num, ax=ax);

fig.tight_layout()

<font color='chocolate'>We can see that some categories are predominant for some features such as `Utilities`, `Heating`, `GarageCond`, `Functional`... These features may not be relevant for our predictive model</font>