# **Maternal Health Risk Study A**

## Objectives

* Answer Buisiness requirement 1. Carry out descriptive analytics on the maternal health dataset.
  
## Inputs

* The maternal health dataset from outputs/datasets/collection/maternal-healt-dataset.csv

## Outputs

* Code to answer business requirement 1 and to use to build a Streamlit dashboard
* Plots to visualise the analysis

## Additional Comments
* This is the first part of the maternal health risk study and contains exploratory data analysis of the data set, outlier study and correlation study.
* See part B of this notebook for the rest of the analysis (EDA of selected correlated variables)

---

# Import Packages for Data Collection

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport
import plotly.express as px
from feature_engine.discretisation import ArbitraryDiscretiser

%matplotlib inline

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory, and confirm new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* os.getcwd() get the current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"New current directory set to {current_dir}.")

# Load Data

In [None]:
df = pd.read_csv('outputs/datasets/collection/maternal-health-risk-dataset.csv')
df.head()

---

# Data Exploration

In this section we perform a first exploratory data analysis on the dataset.

In [None]:
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

### Notes from the profiling report:

* The target variable (RiskLevel) has three classes and 40% of the target variables are zero (low risk).
  * This could be a hint that the dataset is imbalanced.
  * Depending on the balance of the other two classes, this could indicate significant or mild imbalance and we might need to consider rebalancing the dataset during the feature engineering step.
  * The class distribution is: 40%, 33.1%, 26.8%. This is a more moderate imbalance and ML algorithms might perform well enough without rebalancing.
  * We will keep this in mind for the case where one class has notably worse results in prediction.
* There are no missing values in any column.

---

# Outlier Study

In this section we look at outliers to get a first impression and see whether there could be possible errors in the dataset.

We will further handle the outliers in the data cleaning and feature engineering notebooks.

### Short Statictical Summary

We look at a quick statistical summary of the dataset to check the averages and also the min and max to get a first impression on possible outliers/extreme values etc. 

In [None]:
df.describe()

* We notice that the minimum value for age is 10 and the maximum value for age ist 70 years old, which seems odd for data about pregnancy.
  * We will decide in the data cleaning notebook how to proceed with the extreme values for age.
* Since we look at medical data, it is useful to gather information about normal ranges of the other variables:
  * Blood Pressure (mm Hg) (from [Blood Pressure UK](https://www.bloodpressureuk.org/your-blood-pressure/understanding-your-blood-pressure/what-do-the-numbers-mean/)):
    * low blood pressure: 70-90 systolic and 40-60 diastolic
    * ideal blood pressure: 90-120 systolic and 60-80 diastolic
    * pre-high blood pressure: 120-140 systolic and 80-90 diastolic
    * high blood pressure: 140-190 systolic and 90-100 diastolic
    * Blood pressure values in the dataset look to be in a realistic range
  * Blood Sugar (mmol/L):
    * We do not have enough details about the measurement (i.e. at what time was the measurement taken etc.) to know whether min and max values are in a realistic range, we have to assume they are and consider the outlier analysis as usual.
  * Body Temperature looks to be in a normal range.
  * Heart Rate seems to be in a normal range.

To visualise the outliers, let us create a boxplot for each variable.

In [None]:
df.head()

In [None]:
for var in df.columns[:-1]:
    plt.figure(figsize=(4,4))
    sns.boxplot(data=df[var])
    plt.title(f"{var} Box Plot", fontsize=15, y=1.05)
    plt.show()

---

# Correlation Study

In this section we study correlation in the dataset.

We are interested in
* the correlation of each feature with the target
* correlation between the features to find possible collinearity

Note that the focus in this notebook is on the correlation of the features with the target variable. We leave a more detailed study of the correlations between the features to the feature engineering notebook, where it comes to deciding which features to train the ML model on.

We start by looking at the correlation levels of RiskLevel with all the features. We drop the correlation of the target with itself and sort the correlation levels by absolute value (key=abs) in descending order to find the most relevant correlated variables.

Since the variables are not normally distributed and not all of them (most notably the target) are continuous, the Spearman's correlation is the one we have to consider. 

With our discrete target variable Pearson's correlation does not make sense.

In [None]:
corr_vars_spearman = df.corr(method='spearman')["RiskLevel"].sort_values(key=abs, ascending=False).drop("RiskLevel")
corr_vars_spearman

### Heatmap

Let us now visualise the correlations of all variables in a heatmap.

We will look into this in more detail in the feature engineering notebook.

In [None]:
df_corr = df.corr(method="spearman")
# Create a mask to cover the upper triangle of the heatmap since the data is symetric
# Get zeros in the shape of df_corr of boolean type
upper_mask = np.zeros_like(df_corr, dtype=np.bool)
# Select all indices in the upper right triangle and set to True
upper_mask[np.triu_indices_from(upper_mask)] = True
sns.heatmap(data=df_corr, annot=True, mask=upper_mask, linewidths=0.7, annot_kws={"size": 9}, cmap='crest')
plt.show()

### Conclusions from Correlation Study

* For both methods we notice mostly weak and some moderate correlation levels between the target and the features.
* The variable with the strongest correlation is BloodSugar.
* We include all variables with correlation levels greate than 0.2 in the further study.
* Hence, we drop HeartRate and BodyTemp for now since they only correlate weakly to the RiskLevel.
* Also note that all correlation levels are positive meaning that when one variable increases also the RiskLevel increases.
* Most notable this analysis leads us to suspect that:
  * Patients with high risk tend to have high blood sugar levels
  * Patients with high risk tend to have high systolic blood pressure levels
  * Patients with high risk tend to have high diasystolic blood pressure levels
  * Patients with high risk tend to are of a higher age

We now take the four weak to moderately correlated variables, store them in corr_vars and perform further analysis in the next section. 

In [None]:
corr_vars_study = corr_vars_spearman[:4].index.to_list()
corr_vars_study 

---

# Conclusions and Next Steps



We continue our analysis in the next notebook 03-MaternalHealthRiskStudyB.ipynb