# **Data Cleaning Notebook**

## Objectives

* Prepare the data sets for further analysis.
* Load and inspect the data prepared during data collection.
* Data Cleaning.
* Conclusion and next steps.

## Inputs

* /outputs/datasets/collection/breast-cancer.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned

## Additional Comments

* As previously mentioned this projects relays on the guidelines provided in the walk through projects, and the lessons provided along the course (especially in the Predictive Analysis)
* Therefore, this notebook will be relating the Data Preparation step of Crisp-DM methodology.
* ***Don't forget Garbage In, Garbage out.***


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Collected Data

In [None]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/breast-cancer.csv')
df.head()

# Data Exploration

We will check the distribution and shape from a variable with missing data.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

We notice that there are no variables with missing data.

# Correlation and PPS Analysis

* We will create a function that is in compliance with updated pandas and ppscore.
* This will get rid of any deprecated warning that may come.

In [None]:
from pandas.api.types import CategoricalDtype
import ppscore.calculation as pps_calc

def _is_categorical_dtype(series):
    """
    Check if a pandas Series is of categorical dtype.
    This fixes the PPS deprecation warning.
    """
    return isinstance(series.dtype, CategoricalDtype)

pps_calc.is_categorical_dtype = _is_categorical_dtype

In [None]:
%matplotlib inline

In the following code cell we will -
* Create a function to plot a heatmap to visualize spearman and pearson correlations among the variables.
* Create a function to plot a heatmap to visualize Power Predictive Score (PPS) among the variables.
* Create a function to calculate correlation and PPS.
* Create a function to display the heatmaps ini succession.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 15), font_annot=10, rotation=45):
    """
    Plot a heatmap for the correlation matrix.
    Used to visualize spearman and pearson correlations.
    """
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, fmt=".2f", xticklabels=True, yticklabels=True,
                    mask=mask, cmap='plasma_r', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5, cbar_kws={'shrink': 0.5})
        axes.set_yticklabels(df.columns, rotation=0)
        axes.set_xticklabels(df.columns, rotation=rotation, ha='right')
        plt.ylim(len(df.columns), 0)
        plt.tight_layout()
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 15), font_annot=10, rotation=45):
    """
    Plot a heatmap for the Power Predictive Score (PPS) matrix.
    Used to visualize the predictive power of features.
    """
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, fmt=".2f", xticklabels=True, yticklabels=True,
                    mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                    linewidth=0.05, linecolor='grey', cbar_kws={'shrink': 0.5})
        ax.set_yticklabels(df.columns, rotation=0)
        ax.set_xticklabels(df.columns, rotation=rotation, ha='right')
        plt.ylim(len(df.columns), 0)
        plt.tight_layout()
        plt.show()


def CalculateCorrAndPPS(df):
    """
    Calculate correlation and Power Predictive Score (PPS) matrices.
    """
    df_corr_spearman = df.corr(method="spearman", numeric_only=True)
    df_corr_pearson = df.corr(method="pearson", numeric_only=True)

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                    figsize=(20, 12), font_annot=8):
    """
    Display the correlation and PPS heatmaps.
    """
    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

### Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

### Display Heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                df_corr_spearman = df_corr_spearman, 
                pps_matrix = pps_matrix,
                CorrThreshold = 0.4, PPS_Threshold =0.2,
                figsize=(20,15), font_annot=10)

## Dataset Analysis

### Correlation and Power Predictive Score (PPS) Analysis

The data presented in the spearman and pearson reports and PPS matrix shows that -
* The worst (largest) values of radius, texture, area and perimeter correlate highly with their own mean values, not only reciprocally but also amongst each other.
* The correlativity among the mean and worst values of concave variables are very high.
* Nearly all of the variables show correlations of some sort, weak or strong. So for our future ML cases its better to retain them.

## Data Cleaning

### Assess missing data levels

In [None]:
df.head(3)

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                "PercentageOfDataset": missing_data_percentage,
                                "DataType": df.dtypes}
                                    )
                        .sort_values(by=['PercentageOfDataset'], ascending=False)
                        .query("PercentageOfDataset > 0")
                        )

    return df_missing_data

In [None]:
EvaluateMissingData(df)

* Again, we confirm that there are no variables with missing data, so we can procede onto the next step.

### Data Cleaning Spreadsheet Summary

### Data Cleaning Summary

* We will convert `diagnosis` values from `M` and `B` to `0` and `1` respectively so that the saved datasets will already have the target variable in a numeric format, so any ML model can consume it directly without extra preprocessing later.

* We will split the dataset into train and test sets, the train set containing 80% of the total dataset and the test set containing 20%.

* We will take the **Drop Variables** imputation approach and drop `id` variable from our train and test sets.

### Convert 'diagnosis' values

In [None]:
df['diagnosis'].unique()

We will convert the diagnosis to binary values

* `M` = 1 (for Malignant)
* `B` = 0 (for Benign)

***!! Warning !!***
**( Do not run the next code cell twice. `diagnosis` will lose all its value )**

In [None]:
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
df.tail(3)

### Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['diagnosis'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

### Drop Variables


* Step 1: imputation approach: Drop Variables
* Step 2: Select variables to apply the imputation approach

In [None]:
variables_method = ['id' ]

print(f"* {len(variables_method)} variables to drop \n\n"
    f"{variables_method}")

* We are dropping `id` since it is a unique identifier for each patient, it doesn't add any relevant value to our diagnosis prediction.

* Step 3: Create a separate DataFrame applying this imputation approach to the selected variables.

In [None]:
from feature_engine.selection import DropFeatures
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(TrainSet)
df_method = imputer.transform(TrainSet)

* Step 4: Assess the effect on the variable's distribution.

In this case, there is no effect on the distribution of the variable, since we are not removing rows, but columns. 

* Step 5: Apply the transformation to our data.

In [None]:
from feature_engine.selection import DropFeatures
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(TrainSet)

TrainSet, TestSet = imputer.transform(TrainSet) , imputer.transform(TestSet)

* Step 6: Conclude the data cleaning process with dataset assesment.

In [None]:
TrainSet.head(3)

In [None]:
TestSet.head(3)

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/cleaned folder
except Exception as e:
  print(e)


### Saving the Train Set

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

### Saving the Test Set

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

Being satisfied with our data cleaning process, we save the Train and Test dataset and move onto the Feature Engineering process.