# Sale Price Study Notebook

## Objectives

Answer the client's first business requirement of being able to visualise which features of a property correlate most strongly to the sale price by: <br>
 - Analysing correlation and PPS scores.
 - Conducting further exploratory data analysis using ProfileReport and other tools.
 - Visualising findings in plots that can be later used in a Streamlit app.

## Inputs

* /inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

## Outputs

* Generate plots that can be used later in a Streamlit app.

## Additional Comments

* We will apply the method used in https://github.com/Code-Institute-Solutions/churnometer/blob/main/jupyter_notebooks/02%20-%20Churned%20Customer%20Study.ipynb and adapting it to the scope and requirements of this project.

## Notebook in Relation to CRISP-DM

* This notebook forms the Data Understanding phase of the CRISP-DM framework.


---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

# 

## Import Libraries

In [None]:
import pandas as pd
from ydata_profiling import ProfileReport
from feature_engine.encoding import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
import ppscore as pps

# Data Exploration

## Load Cleaned Data

In [None]:
df = (pd.read_csv("outputs/datasets/cleaned/HousePricesCleaned.csv"))
df.head(5)

## Generate a Profile Report to Recap Data Types and General Insights

In [None]:
ydata_report = ProfileReport(df=df, minimal=True)
ydata_report.to_notebook_iframe()

## Correlation Analysis

Let's examine both the Spearman and Pearson coefficients. The functions used to conduct this analysis are taken from Code Institute's "Churnometer" walkthrough project.

First, as Spearman and Pearson tests deal only with numerical values, we will use OneHotEncoder to encode categorical variables as corresponding numerical values.

In [None]:
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_encoded = encoder.fit_transform(df)
print(df_encoded.shape)
df_encoded.head(3)

Then we will visualise the correlations:

In [None]:
def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_corr_pearson, df_corr_spearman = CalculateCorrAndPPS(df_encoded)

In [None]:
Dis(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman,
                  CorrThreshold = 0.45,
                  figsize=(20,12), annot_size=10)

Then, let's list the most strongly correlated variables ordered by strength of the correlation. We will look at the Spearman and Pearson coefficients specifically as they relate to sale price, our target variable. Note that we begin at position 1 in the list of variables for comparison so that SalePrice is omitted. 

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
