# **Notebook 3: Split data**

## Objectives

* Generate train and test sets for feature engineering.
* Do Correlation and PPS analysis to visualize the relationship between different variables in the dataset.


## Inputs

* outputs/datasets/cleaned/house_prices_records_cleaned.csv

## Outputs

* Generate Train and Test sets from cleaned data, saved under outputs/datasets/cleaned/train and outputs/datasets/cleaned/test

## Comment

* For readers of this notebook, previously the name of the main dataframe used in this project has been called 'records_df'. From this notebook on, the main dataframe will simply be 'df', with variations when we use a subset of the df for a particular purpose. In this notebook, we split df into train and test sets, and their respective names will be 'TrainSet' and 'TestSet'.



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/cleaned/house_prices_records_cleaned.csv"
df = pd.read_csv(df_raw_path)
df.head(3)

# Split train and test data

In the cell below, the target variable 'SalePrice' is separated out from the rest of the data, and the split produces both a train and test set for the features (TrainSet and TestSet) and the target (train_target and test_target).

The reason for changing the code from the walkthrough is that in the train_test_split() function, I was passing df['SalePrice'] as the target variable, but also including it in my df, which means my target variable 'SalePrice' would be present in both my features and targets, which is not what I would typically want.

In [None]:
from sklearn.model_selection import train_test_split

features = df.drop('SalePrice', axis=1)  # drop the target variable from the feature set
target = df['SalePrice']

TrainSet, TestSet, train_target, test_target = train_test_split(
    features,
    target,
    test_size=0.2,
    random_state=0
)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")


As we see in the output, the train set has 1168 rows which is 80% of the data, and the test set accounts for the remaining 20%.

We then save the train and test set respectively in their folders.

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/train/TrainSetCleaned.csv", index=False)
TestSet.to_csv("outputs/datasets/cleaned/test/TestSetCleaned.csv", index=False)

And we also save the datasets where we put the target variable.

In [None]:
train_target.to_csv("outputs/datasets/cleaned/train/TrainSetTarget.csv", index=False)
test_target.to_csv("outputs/datasets/cleaned/test/TestSetTarget.csv", index=False)

# Load cleaned training and test sets

Train set

In [None]:
import pandas as pd

TrainSet = pd.read_csv("outputs/datasets/cleaned/train/TrainSetCleaned.csv")
TrainSet.head()

Test set

In [None]:
TestSet = pd.read_csv("outputs/datasets/cleaned/test/TestSetCleaned.csv")
TestSet.head()

# Data Exploration

In this section we are interested to evaluate which potential transformation we could do in our variables.

In [None]:
%matplotlib inline

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation and PPS Analysis

Note: Since the first notebooks in this project are not identical to those of the walkthrough project Churnometer, namely we did data cleaning before the EDA (which is a valid approach), we will therefore assess correlation levels and PPS here before starting the feature engineering process in the next notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

What does this mean?

- count: There are 462 pairs of variables for which the PPS was calculated.
- mean: On average, the PPS among all the pairs is around 0.061. This indicates that, on average, variables have a relatively weak predictive power.
- std: The standard deviation is 0.105, which indicates a relatively large variation in the PPS scores across different pairs of variables.
- min: The lowest PPS score among all pairs is 0. This means that for at least one pair, there is no predictive power from one variable to another.
- 25% (1st Quartile): 25% of the pairs have a PPS of 0.
- 50% (Median): Half of the pairs have a PPS of 0. This indicates that for most variable pairs, there's no predictive power or it's very weak.
- 75% (3rd Quartile): 75% of the pairs have a PPS up to 0.083. This further demonstrates that the majority of variable pairs have very low predictive power.
- max: The highest PPS score among all pairs is 0.625. This means that there is at least one pair of variables where one variable can predict the other with a fair amount of accuracy (62.5%).

The output suggests that most pairs of variables have very low or no predictive power, but there are few pairs with relatively high predictive power (max PPS is 0.625).

Therefore we display at heatmaps to help visualize the relationship among different variables in the dataset. Note the relatively high thresholds set as the maps become difficult to read when all moderately correlated variables are shown.

In [None]:
%matplotlib inline

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.6, PPS_Threshold =0.3,
                  figsize=(12,10), font_annot=10)

A couple of observations from the heatmaps:
- 'TotalBsmtSF' and '1stFlrSF':
These variables are highly correlated with each other as shown by their high Spearman (0.83) and Pearson scores (0.82). This means that they tend to increase or decrease together. For example, a larger basement ('TotalBsmtSF') often comes with a larger first floor ('1stFlrSF'). Their Power Predictive Score (PPS) of 0.44 indicates a moderate predictive power. In other words, knowledge of 'TotalBsmtSF' would be moderately useful in predicting the value of '1stFlrSF'.

- 'YearBuilt' and 'GarageYrBlt':
Again, these variables are highly correlated, with Spearman (0.85) and Pearsob (0.78). This implies that houses tend to have their garages built the same year as the house itself. The high PPS of 0.59 indicates that the year a house was built ('YearBuilt') is a good predictor of the year its garage was built ('GarageYrBlt'), and vice versa.

- This list is not exhaustive, there are several other pairs of variables that show similar tendencies.

The findings suggest that each pair of variables carries similar information, indicating multicollinearity.

---

# Conclusion

- The analysis shows overall low predictive power in most pairs of variables, with a few variables with higher PPS.
- There's a tendency of multicollinearity several variable pairs, which is important to keep in mind in the upcoming modeling stages.

# Next step

- In the next notebook we engineer our features. 
- Since we have 4 categorical variables, these will need to be transformed into numerical variables. 
- We want all numerical variables to be normally distributed, and will therefore use numerical transformation.
- Lastly we will handle features that are highly correlated with one another, which we saw in this notebook.