# **Data Cleaning**

## Objectives

* Clean data
* Split into train and test set
  
## Inputs

* The maternal health dataset from outputs/datasets/collection/maternal-healt-dataset.csv

## Outputs

* Generate cleaned train and test sets and save to outputs/datasets/cleaned

---

# Import Packages for Data Collection

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access current directory with os.getcwd()

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/ML-maternal-health-risk/jupyter_notebooks'

Make the parent of the current directory the new current directory, and confirm new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* os.getcwd() get the current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"New current directory set to {current_dir}.")

New current directory set to /workspaces/ML-maternal-health-risk.


# Load Data

In [4]:
df = pd.read_csv('outputs/datasets/collection/maternal-health-risk-dataset.csv')
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,36.7,86,2
1,35,140,90,13.0,36.7,70,2
2,29,90,70,8.0,37.8,80,2
3,30,140,85,7.0,36.7,70,2
4,35,120,60,6.1,36.7,76,0


---

# Data Exploration

To help us decide which steps to take in the data cleaning process, let us review the data and its statistical summary of the data again.

In [6]:
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,36.7,86,2
1,35,140,90,13.0,36.7,70,2
2,29,90,70,8.0,37.8,80,2
3,30,140,85,7.0,36.7,70,2
4,35,120,60,6.1,36.7,76,0


In [7]:
df.tail()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
1009,22,120,60,15.0,36.7,80,2
1010,55,120,90,18.0,36.7,60,2
1011,35,85,60,19.0,36.7,86,2
1012,43,120,90,18.0,36.7,70,2
1013,32,120,65,6.0,38.3,76,1


In [5]:
df.describe()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
count,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0
mean,29.871795,113.198225,76.460552,8.725986,37.059763,74.301775,0.86785
std,13.474386,18.403913,13.885796,3.293532,0.743991,8.088702,0.807353
min,10.0,70.0,49.0,6.0,36.7,7.0,0.0
25%,19.0,100.0,65.0,6.9,36.7,70.0,0.0
50%,26.0,120.0,80.0,7.5,36.7,76.0,1.0
75%,39.0,120.0,90.0,8.0,36.7,80.0,2.0
max,70.0,160.0,100.0,19.0,39.4,90.0,2.0


### Summary:

* There is no missing data
* The target variable (RiskLevel) has three classes and 40% of the target variables are zero (low risk).
  * This could be a hint that the dataset is imbalanced.
  * Depending on the balance of the other two classes, this could indicate significant or mild imbalance and we might need to consider rebalancing the dataset during the feature engineering step.
  * The class distribution is: 40%, 33.1%, 26.8%. This is a more moderate imbalance and ML algorithms might perform well enough without rebalancing.
  * We will keep this in mind for the case where one class has notably worse results in prediction.
* There are no missing values in any column.

---

# Outlier Study

In this section we look at outliers to get a first impression and see whether there could be possible errors in the dataset.

We will further handle the outliers in the data cleaning and feature engineering notebooks.

### Short Statictical Summary

We look at a quick statistical summary of the dataset to check the averages and also the min and max to get a first impression on possible outliers/extreme values etc. 

In [None]:
df.describe()

* We notice that the minimum value for age is 10 and the maximum value for age ist 70 years old, which seems odd for data about pregnancy.
  * We will decide in the data cleaning notebook how to proceed with the extreme values for age.
* Since we look at medical data, it is useful to gather information about normal ranges of the other variables:
  * Blood Pressure (mm Hg) (from [Blood Pressure UK](https://www.bloodpressureuk.org/your-blood-pressure/understanding-your-blood-pressure/what-do-the-numbers-mean/)):
    * low blood pressure: 70-90 systolic and 40-60 diastolic
    * ideal blood pressure: 90-120 systolic and 60-80 diastolic
    * pre-high blood pressure: 120-140 systolic and 80-90 diastolic
    * high blood pressure: 140-190 systolic and 90-100 diastolic
    * Blood pressure values in the dataset look to be in a realistic range
  * Blood Sugar (mmol/L):
    * We do not have enough details about the measurement (i.e. at what time was the measurement taken etc.) to know whether min and max values are in a realistic range, we have to assume they are and consider the outlier analysis as usual.
  * Body Temperature looks to be in a normal range.
  * Heart Rate seems to be in a normal range.

To visualise the outliers, let us create a boxplot for each variable.

In [None]:
df.head()

In [None]:
for var in df.columns[:-1]:
    plt.figure(figsize=(4,4))
    sns.boxplot(data=df[var])
    plt.title(f"{var} Box Plot", fontsize=15, y=1.05)
    plt.show()

# Conclusions and Next Steps

