# **ML Modeling and Evaluation - Classification**

## Objectives

* Fit a feature engineering pipeline
* Fit a classification model to predict the health risk level of patients
* Evaluate the classification model performance
  
## Inputs

* Cleaned Maternal Health Risks dataset: outputs/cleaned/maternal-health-risk-dataset-clean.csv
* Instructions on feature engineering steps to take (see previous notebook)

## Outputs

* Train set (features and target)
* Test set (features and target)
* Feature engineering pipeline
* Modeling pipeline
* Feature importance plot

---

# Import Packages for Data Collection

In [1]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from feature_engine import transformation as vt
from feature_engine.selection import SmartCorrelatedSelection

%matplotlib inline

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access current directory with os.getcwd()

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/ML-maternal-health-risk/jupyter_notebooks'

Make the parent of the current directory the new current directory, and confirm new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* os.getcwd() get the current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"New current directory set to {current_dir}.")

New current directory set to /workspaces/ML-maternal-health-risk.


# Load Data

Recall that we can directly load the cleaned dataset, since the only data cleaning we had to do was to remove 3 (suspectetly) erronous datapoints. There were no steps that could be performed in a data cleaning pipeline.

In [7]:
df = pd.read_csv('outputs/datasets/cleaned/maternal-health-risk-dataset-clean.csv')
print(df.shape)
df.head()

(1011, 7)


Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,36.7,86,2
1,35,140,90,13.0,36.7,70,2
2,29,90,70,8.0,37.8,80,2
3,30,140,85,7.0,36.7,70,2
4,35,120,60,6.1,36.7,76,0


---

# Data Exploration

---

# Conclusions and Next Steps

* We analysed six different numerical transformations on our six features and decided to include into our feature engineering pipeline:
  * Box-Cox transformation on the following features: Age, Systolic BP, Blood Sugar
* We also looked into Smart Correlated Selection and found:
  * The DiastolicBP feature is strongly correlated to SystolicBP and can thus be dropped

We will include these transformations in the feature engineering pipeline in the next notebook:

* Box-Cox transformation: `["Age", "SystolicBP", "BloodSugar"]`
* SmartCorrelatedSelection: `["DiastolicBP"]`