# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **Data Cleaning Notebook**

## Objectives

*   Confirm / Evaluate missing data
*   Clean data in preparation for analysis

### Inputs

1. Test Dataset : `outputs/datasets/collection/PredictiveMaintenanceTest.csv`

2. Train Dataset : `outputs/datasets/collection/PredictiveMaintenanceTrain.csv`

### Outputs

* Generate cleaned Train and Test sets, both saved under `outputs/datasets/cleaned`

### Conclusions

  * Data Cleaning Pipeline
  * Drop Variables as Required
  <!-- `['customerID', 'TotalCharges' ]` -->

---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Collection Data

In [None]:
import pandas as pd
df_train = pd.read_csv(f'outputs/datasets/collection/PredictiveMaintenanceTrain.csv')
df_test = pd.read_csv(f'outputs/datasets/collection/PredictiveMaintenanceTest.csv')

In [None]:
df_train.info()

In [None]:
df_test.info()

---

# Data Exploration

### Evaluate Missing Data

To confirm we don't have variables with missing data, and if we do; discover their distribution and shape.
* Note: we are aware that the **df_train** dataset does not have values for `RUL`, so both sets are checked separately

If we tried to combine the sets to check, it would indicate `RUL` has missing values like so: 

In [None]:
df_total = pd.concat([df_train, df_test])
vars_with_missing_data = df_total.columns[df_total.isna().sum() > 0].to_list()
vars_with_missing_data

#### To check both datasets for missing data at the same time

Define a handy function to identify which dataframe

In [None]:
def name_dataframe(data):
    ''" To identify which dataframe is being accessed """
    name =[n for n in globals() if globals()[n] is data][0]
    print('Dataframe name: %s' % name)

Check for missing data & return error information if there is

In [None]:
from pandas_profiling import ProfileReport

for df in (df_train, df_test):
    vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
    if vars_with_missing_data:
        profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
        profile.to_notebook_iframe()
    else:
        name_dataframe(df)
        print('There are no variables with missing data')

---

## Outliers in differential pressure observations

In each bin we notice that the change_DP measure, the size and direction of first few observations indicated they may be outliers. We have considered three main methods to deal with outliers:
* Log transformation.
* Winsorize method.
* Dropping the outliers.

These will be handled in the [feature engineering](https://github.com/roeszler/filter-maintenance-predictor/blob/main/jupyter_notebooks/03_FeatureEngineering.ipynb) notebook

In [None]:
df_bin = df_train[df_train['Data_No'] == 46]
df_bin

Plot of Bin 46 change in Differential Pressure

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df_bin = df_train[df_train['Data_No'] == 46]
# sns.boxplot(x = df_bin['change_DP'])
# sns.stripplot(x = df_bin['change_DP'])
# sns.swarmplot(x = df_bin['change_DP'])
sns.displot(x = df_bin['change_DP'])
# sns.barplot(x = df_bin['change_DP'])
# sns.relplot(x = df_bin['change_DP'], kind='line')
# sns.scatterplot(x = df_bin['change_DP'])
# sns.histplot(x = df_bin['change_DP'])
plt.show()

---

# Correlation Study

---

# Correlation and Power Predictive Score Analysis

`pip install ppscore`
* Following code derived from Code Institute [Exploratory Data Analysis Tools](https://learn.codeinstitute.net/courses/course-v1:CodeInstitute+DDA101+2021_T4/courseware/468437859a944f7d81a34234957d825b/c8ea2343476c48739676b7f03ba9b08e/) 2022.

In [None]:
import numpy as np
import ppscore as pps

def heatmap_corr(df, threshold, figsize=(10, 8), font_annot=8):
    """
    Heatmap for pearson (linear) and spearman (monotonic) correlations to 
    visualize only those correlation levels greater than a given threshold.
    """
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={'size': font_annot}, ax=axes,
                    linewidth=0.01, linecolor='WhiteSmoke'
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(10, 8), font_annot=8):
    """
    Heatmap for power predictive score
    PPS == 0 means that there is no predictive power
    PPS < 0.2 often means that there is some relevant predictive power but it is weak
    PPS > 0.2 often means that there is strong predictive power
    PPS > 0.8 often means that there is a deterministic relationship in the data,
    """
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={'size': font_annot},
                         linewidth=0.01, linecolor='WhiteSmoke')
        plt.ylim(len(df.columns), 0)
        plt.show()


def calculate_corr_and_pps(df):
    """
    Calculate the correlations and ppscore of a given dataframe
    """
    df_corr_spearman = df.corr(method='spearman')
    df_corr_pearson = df.corr(method='pearson')

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query('ppscore < 1').filter(['ppscore']).describe().T
    print('PPS threshold - check PPS score IQR to decide threshold for heatmap \n')
    print(pps_score_stats.round(4))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def display_corr_and_pps(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(10, 8), font_annot=8):
    """
    Render the correlations and ppscore heatmaps for a given dataframe
    """
    # print('\n')
    print('To analyze: \n** Colinearity: how the target variable is correlated with the other features (variables)')
    print('** Multi-colinearity: how each feature correlates among themselves (multi-colinearity)')

    print('\n')
    print('*** Heatmap: Pearson Correlation ***')
    print(f'It evaluates the linear relationship between two continuous variables \n'
          f'* A +ve correlation indicates that as one variable increases the other variable tends to increase.\n'
          f'A correlation near zero indicates that as one variable increases, there is no tendency in the other variable to either increase or decrease.\n'
          f'A -ve correlation indicates that as one variable increases the other variable tends to decrease.')
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print('\n')
    print(f'*** Heatmap: Spearman Correlation ***')
    print(f'It evaluates monotonic relationship \n'
          f'Spearman correlation coefficients range from -1 to +1.\n'
          f'The sign of the coefficient indicates whether it is a positive or negative monotonic relationship.\n'
          f'* A positive correlation means that as one variable increases, the other variable also tends to increase.')
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print('\n')
    print('*** Heatmap: Power Predictive Score (PPS) ***')
    print(f'PPS detects linear or non-linear relationships between two columns.\n'
          f'The variable on the x-axis is used to predict the corresponding variable on the y-axis.\n'
          f'The score ranges from 0 (no predictive power) to 1 (perfect predictive power)\n\n'
          f'* PPS == 0 means that there is no predictive power\n'
          f'* PPS < 0.2 often means that there is some relevant predictive power but it is weak\n'
          f'* PPS > 0.2 often means that there is strong predictive power\n'
          f'* PPS > 0.8 often means that there is a deterministic relationship in the data\n')
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_total

Drop Calculated Variables, Calculate Correlations and Power Predictive Score

In [None]:
df_drop = df_total.drop(['Data_No', 'change_DP', 'mass_g', 'cumulative_mass_g', 'Tt', 'filter_balance'], axis=1)
# df_corr_pearson, df_corr_spearman, pps_matrix = calculate_corr_and_pps(df_total)
df_corr_pearson, df_corr_spearman, pps_matrix = calculate_corr_and_pps(df_drop)

**Pairplot** to quickly visualize the relationships among the provided variables

In [None]:
sns.pairplot(data=df_drop)

Heatmaps for **df_total** dataset

In [None]:
display_corr_and_pps(df_corr_pearson = df_corr_pearson, df_corr_spearman = df_corr_spearman,
                    pps_matrix = pps_matrix, CorrThreshold = 0, PPS_Threshold =0,
                    figsize=(12,10), font_annot=10
                    )


### Observations
#### Heatmap: **Pearson Correlation**
* A linear relationship is one when a change in one variable is associated with a proportional change in the other variable 
* Positive relationships can be observed between 
    * **Differential Pressure** and **Time** plus **Flow Rate** with a negative 
    * **Flow Rate** and **RUL** plus **Time** 
* Strongly negative between **Differential Pressure** and **RUL** 

#### Heatmap: **Spearman Correlation**
* A monotonic relationship is one where one variable is associated with a **change in the specific direction** of another variable. 
    * e.g. Does a positive change in value/direction X result in a positive change in the value/direction of Y?
    * We consider Spearman’s correlation when 
        * we have pairs of continuous variables and the relationships between them don’t follow a straight line (curvilinear), and/or 
        * we have pairs of ordinal data (like time)

* **Spearman's rho Values and Direction**
    * **Differential Pressure** is strongly positively correlated to **Time**, less so **Flow Rate** and negatively correlated to **RUL**
    * **Dust Feed** is negatively correlated to **RUL** whereas **Dust Type** is positively correlated to **RUL**
    * **Flow Rate** is positively correlated to **Time** and **Differential Pressure** as noted above.

#### Heatmap: **Power Predictive Score (PPS)**
* Detects linear or non-linear relationships between two columns.
* We see strong predictive power between **Dust_feed** and **RUL**, less so however still significant with **Dust_feed** and **Flow_rate**
    * RUL as a calculation of **time** remaining, is logically affected by the volume of dust per second. The lower the flow or feed, the higher the RUL. This is however dictated by the simple fact that the filter needs to filter dust. Reducing either of the rates naturally negates the purpose of the filtering process, so we will treat it as a **confounding** relationship and as such, cannot be described in terms of correlations or associations.
* When considering the absolute levels of the scores in the dataset, we see a weak yet strong predictive relationship between **Differential Pressure** and **RUL**
    * Differential pressure has predictive power of RUL, whereas RUL has no predictive power on differential pressure
    * Naturally we also see a week two way relationship between  **Differential Pressure** and **Time**


---

## Save Datasets

Save the files to /cleaned folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

df_train.to_csv(f'outputs/datasets/cleaned/dfCleanTrain.csv',index=False)
df_test.to_csv(f'outputs/datasets/cleaned/dfCleanTest.csv',index=False)

---

# Conclusions and Next steps

#### Conclusions: 
* 

#### Next Steps:
* Correlation Study
* Feature Engineering

---