# **Notebook Correlation Study**

## Objectives

* Determine the the house attributes that correlates with the target. The target is 'sale price'

## Inputs

* Cleaned dataset from the previous step (data_cleaning), namely the outputs/datasets/house_prices_cleaned.csv 

## Outputs

*  As per defined business requirement number one, client expects data visualisations of variables/attributes correlated with sale price. Hence, output of this notebook are data plots/visualisations to be used in the Streamlit App

## Conclusions from the correlation study

*  


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\My_Folders\\CodeInstitute\\Project_5_files\\Project-5\\Project-5\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\My_Folders\\CodeInstitute\\Project_5_files\\Project-5\\Project-5'

In [4]:
import os; print(os.getcwd())

c:\My_Folders\CodeInstitute\Project_5_files\Project-5\Project-5


# Load Data

In [5]:
import pandas as pd

file_path = r"C:\My_Folders\CodeInstitute\Project_5_files\Project-5\Project-5\outputs\datasets\cleaned\HousePricesCleaned.csv"

df = pd.read_csv(file_path)


# Data Exploration

Data Profiling to review the dataset. Particular focus how individual variables do have correlation with target ('sale price').

Data Exploration is achieved with Pandas Profiling.
Pandas Profiling shows that 10 variables have missing values. I have no idea why values are missing here since house_prices_cleaned.csv where
all missing data was imputed was used as the source for the Pandas Profiling. I will proceed from here to see if I can still remove the
variables with missing values in a subsequent step. 

---

In [6]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

  from pandas_profiling import ProfileReport


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

# Correlation and PPS Analyses

Pandas Profiling shows one variable to be of nature "text" rather than of nature "numeric". From the DataCleaning&FeatureEngineering excel spreadsheet, it is known I deal with categorical of ordinal nature.

Categorical variable must be encoded with OneHotEncoder to be used in correlation and PPS analysis.


Section 2 content

In [7]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(5)


(1460, 22)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtFinSF1,BsmtUnfSF,GarageArea,GarageYrBlt,GrLivArea,LotArea,LotFrontage,...,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice,KitchenQual_Gd,KitchenQual_TA,KitchenQual_Ex,KitchenQual_Fa
0,856,854.0,3.0,706,150,548,2003.0,1710,8450,65.0,...,5,7,856,2003,2003,208500,1,0,0,0
1,1262,0.0,3.0,978,284,460,1976.0,1262,9600,80.0,...,8,6,1262,1976,1976,181500,0,1,0,0
2,920,866.0,3.0,486,434,608,2001.0,1786,11250,68.0,...,5,7,920,2001,2002,223500,1,0,0,0
3,961,0.0,2.883272,216,540,642,1998.0,1717,9550,60.0,...,5,7,756,1915,1970,140000,1,0,0,0
4,1145,0.0,4.0,655,490,836,2000.0,2198,14260,84.0,...,5,8,1145,2000,2000,250000,1,0,0,0


---

- Heatmaps to be created to visualise correlation matrix of dataset
- Calculation of correlation matrices and PPS matrix and display results
- Code taken from CI ProdictePowerScore Unit1: Introduction, Video No. 6

In [8]:
import os

# Create directories if they don't exist
os.makedirs('docs/plots', exist_ok=True)


In [None]:
import numpy as np
import ppscore as pps
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.title("This is a test plot saved inside docs/plots")
plt.savefig('docs/plots/test_plot.png', bbox_inches='tight')
plt.show()

def heatmap_corr(df,threshold, figsize=(20,12), font_annot = 8):
  """
  Create heatmap using correlations
  """
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
     # Save heatmaps to docs folder
    if df.name == "corr_spearman":
      plt.savefig('C:/My_Folders/CodeInstitute/Project_5_files/Project-5/docs/plots/heatmap_corr_spearman.png', bbox_inches='tight')
    else:
      plt.savefig('C:/My_Folders/CodeInstitute/Project_5_files/Project-5/docs/plots/heatmap_corr_spearman.png', bbox_inches='tight')
    plt.show()


def heatmap_pps(df,threshold, figsize=(20,12), font_annot = 8):
    """
    Create heatmap with PPS (predictive power score).
    """
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=np.bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05,linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      # Save heatmap to docs folder
      plt.savefig(f'docs/plots/heatmap_pps.png', bbox_inches='tight')
      plt.show()


def CalculateCorrAndPPS(df):
  """
  Function that calculates the correlations correlations and predicte power scores
  """
  df_corr_spearman = df.corr(method="spearman")
  df_corr_spearman.name = 'corr_spearman'
  df_corr_pearson = df.corr(method="pearson")
  df_corr_pearson.name = 'corr_pearson'

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix,CorrThreshold,PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):
  """
  Function to display correlations and predictive power scores
  """

  print("\n")
  print("* Analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi-colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationship \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Power Predictive Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

  plt.show()


NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [10]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (2852421808.py, line 5)