# **House Prices Study**

## Objectives

This notebook answer business requirement 1:
* We will inspect the data related to house prices.
* We will perform a correlation study to investigate the most relevant variables correlated to the sale price.
* We will visualize these variables against the sale price, display and summarize the insights.

## Inputs

* outputs/datasets/cleaned/house_prices_records.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App

### CRISP-DM
* Data Understanding


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/cleaned/house_prices_records.csv"))
df.head(5)

---

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Correlation Study

Check for correlations with spearman and pearson methods

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

We notice strong correlation between SalesPrice and 5 features with both pearson and spearman methods. 

We also discovered that the Above ground living area **GrLivArea** has a strong correlation with SalePrice. Just as our first hypotesis says.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.5, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

We can see in the heatmap that correlation between YearBuilt and OverallCond is moderate in the pearson correlation method and strong in spearman method which support hypothesis number 2.

* We check the top 5 correlation levels so we can study the features.

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

We notice its not the same 5 features thats in top with pearson/spearman. So we will keep on working with 6 features thats in the top.

We will studying the following:
* The house price is higher when the first floor area is bigger.
* The house price is higher when the above ground area is bigger.
* The house price is higher when the house have garage and depending on the size of the garage.
* The house price is higher the better overall quality there is on the house.
* The house price is higher the younger the house is.

# EDA on selected features

### Features distribution by SalePrice

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

corr = df["SalePrice"].corr(df["1stFlrSF"])

plt.scatter(df["1stFlrSF"], df["SalePrice"])
plt.title("Correlation: {:.3f}".format(corr))
plt.xlabel("1stFlrSF (sq ft)")
plt.ylabel("Price ($)")
plt.show()

* When inspecting the scatterplot for 1stFlrSF we can see that typicaly the more sq ft the higher the price.

In [None]:
corr = df["SalePrice"].corr(df["GrLivArea"])

plt.scatter(df["GrLivArea"], df["SalePrice"])
plt.title("Correlation: {:.3f}".format(corr))
plt.xlabel("GrLivArea (sq ft)")
plt.ylabel("Price ($)")
plt.show()

* When inspecting the scatterplot for GrLivArea we can see that typicaly the more sq ft the higher the price.

In [None]:
corr = df["SalePrice"].corr(df["GarageArea"])

plt.scatter(df["GarageArea"], df["SalePrice"])
plt.title("Correlation: {:.3f}".format(corr))
plt.xlabel("GarageArea (sq ft)")
plt.ylabel("Price ($)")
plt.show()

* When inspecting the scatterplot for GarageArea we can see that typicaly the more sq ft the higher the price even if not as clear as in above scatterplots. We also notice a lot of houses that don´t have garage due to zeros.

In [None]:
corr = df["SalePrice"].corr(df["OverallQual"])

plt.scatter(df["OverallQual"], df["SalePrice"])
plt.title("Correlation: {:.3f}".format(corr))
plt.xlabel("OverallQual")
plt.ylabel("Price ($)")
plt.show()

We can see in the scatterplot that when the house have higher rate for quality, the prizes are normaly higher. We also noticed big differences on the highest rate (10) with the biggest differences in price.

In [None]:
corr = df["SalePrice"].corr(df["TotalBsmtSF"])

plt.scatter(df["TotalBsmtSF"], df["SalePrice"])
plt.title("Correlation: {:.3f}".format(corr))
plt.xlabel("TotalBsmtSF")
plt.ylabel("Price ($)")
plt.show()

We can see that basement sq ft is correlated with higher prize and that a lot of houses don´t have basement.

In [None]:
corr = df["SalePrice"].corr(df["YearBuilt"])

plt.scatter(df["YearBuilt"], df["SalePrice"])
plt.title("Correlation: {:.3f}".format(corr))
plt.xlabel("YearBuilt")
plt.ylabel("Price ($)")
plt.show()

Hard to see the correlation but as it may seem in the plot the highest prizes have been on the newest houses.

---

# Conclusions and Next steps

The correlations and plots interpretation converge.

* Top 6 correlated features with price is: 
 '1stFlrSF',
 'GarageArea',
 'GrLivArea',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt'
* The house price is higher when the first floor area is bigger.
* The house price is higher when the above ground area is bigger.
* The house price is higher when the house have garage and depending on the size of the garage.
* The house price is higher the better overall quality there is on the house.
* The house price is higher the younger the house is.

* Our first hypotesis is confirmed to be true that houses with more GrLivArea has higher SalesPrice.
* Our second hypothesis is also confirmed that the younger the house is YearBuilt the better the OverallCond is of the house.

---