[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yusufokunlola/maternal-health/blob/main/notebook.ipynb)

### Context
Data has been collected from different hospitals, community clinics, maternal health cares through the IoT based risk monitoring system.

- Age: Age in years when a woman is pregnant.
- SystolicBP: Upper value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- DiastolicBP: Lower value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- BS: Blood glucose levels is in terms of a molar concentration, mmol/L.
- HeartRate: A normal resting heart rate in beats per minute.
- Risk Level: Predicted Risk Intensity Level during pregnancy considering the previous attribute.


### Data Source

Data was sourced from [Kaggle](https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data)

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)

import warnings
warnings.filterwarnings("ignore")

In [None]:
# load data
df = pd.read_csv('MaternalHealthRisk.csv')
print("Data imported successfully")

In [None]:
# inspect dataframe by printing out the first 5 rows
df.head()

In [None]:
# display all columns for better visibility
pd.set_option('display.max_columns', None)

In [None]:
# explore top 5 and bottom 5 data (full columns)
df

In [None]:
# check the shape of the data
df.shape

In [None]:
# info about the data
df.info()

In [None]:
# Checking for the null value.
df.isna().sum()

In [None]:
# return a total count for each RiskLevel in the dataset
df.RiskLevel.value_counts()

In [None]:
# check the datatype counts of the dataset
df.dtypes.value_counts()

In [None]:
# number of unique values in each features
df.nunique()

In [None]:
# statistical summary
df.describe()

### visualize the distribution of the numerical variables

In [None]:
sns.histplot(df['Age'], kde=True)
plt.title('Distribution of Age')
plt.show()

The Age distribustion above shows that women from age 25 to 48 are of higher risk of Maternal health issues while from age 48 to 50 also need to take precausions so that they do not suffer from Maternal health issues.

In [None]:
df.columns

In [None]:
# Assign age distribution to a new column
# Define the conditions and corresponding designations
conditions = [
    df['Age'] <= 17,
    df['Age'] <= 44,
    df['Age'] <= 55,
    df['Age'] >= 56
]
designations = [
    'Children',
    'Pre-menopausal Adults',
    'Menopausal Adults',
    'Post-menopausal Adults'
]

# Use numpy.select() to create the new column
df['AgeDist'] = np.select(conditions, designations)

In [None]:
# Count the frequency of each value in the 'AgeDist' column
age_dist_counts = df['AgeDist'].value_counts()

# Create a pie chart of the frequency counts
age_dist_counts.plot.pie(autopct='%1.1f%%', startangle=90, labeldistance=1.2, figsize=(8, 8), legend=True)

# Set the title of the chart
plt.title('Age Distribution')

# Show the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, title='Age Distribution')

# Show the chart
plt.show()

The age distribution above from W.H.O 2022 and National Institute on Aging shows that Older adults are of higher risk of Maternal health with a percentage of 54.4.

In [None]:
# Select only the numerical columns
num_cols = ['Age', 'SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate']

# Plot histograms for all numerical columns
df[num_cols].hist(bins=10, figsize=(15, 12))

# Add titles and axis labels
plt.suptitle('Histograms of Numerical Variables', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Generate box plots for each column
df.boxplot()

# Show the plot
plt.show()

Questions

What is the acceptable range of value for the following columns:

    1. Age
    2. SystolicBP 
    3. DiastolicBP
    4. BS
    5. BodyTemp
    6. HeartRate

### visualize the relationship between the numerical variables and the target variable

In [None]:
sns.boxplot(x='RiskLevel', y='Age', data=df)
plt.title('Age vs. RiskLevel')
plt.show()

From the boxplot above, it shows that women from age 25 to 48 are of higher risk of maternal health risk.

In [None]:
sns.boxplot(x='RiskLevel', y='SystolicBP', data=df)
plt.title('SystolicBP vs. RiskLevel')
plt.show()

SystolicBP measures the pressure in the artries when the heart beats and from the above plot, women with Bp between 120 to 140 are of high risk of suffering from maternal heallth risk.

In [None]:
sns.boxplot(x='RiskLevel', y='DiastolicBP', data=df)
plt.title('DiastolicBP vs. RiskLevel')
plt.show()

DiastolicBP measures the pressure in the heart when it is at rest between beats and the boxplot above shos that women between 75 to 100 DiastolicBp are of high risk of maternal health

In [None]:
sns.boxplot(x='RiskLevel', y='BS', data=df)
plt.title('BS vs. RiskLevel')
plt.show()

BS is the blood glucose level also knon as blood sugar level, women between 8 to 14.5 blood sugar level are of high risk of maternal health as shown above.

In [None]:
sns.boxplot(x='RiskLevel', y='BodyTemp', data=df)
plt.title('BodyTemp vs. RiskLevel')
plt.show()

Body temperature of a pregnant woman and it's effect varies depending on how old the pregnancy is however, from the boxplot above shos that women with body temperature between 98 to 100 are both at high risk and mis risk of maternal health.

In [None]:
sns.boxplot(x='RiskLevel', y='HeartRate', data=df)
plt.title('HeartRate vs. RiskLevel')
plt.show()

During pregnancy the amount of blood pumped by the heart increase by 30 to 50% according to research by MSD Manuels however heart rate of 67 to 90 as shown above is of maternal risk.

In [None]:
# correlation analysis
corrmat = df.drop(columns=['RiskLevel', 'AgeDist']).corr()
top_corr_features = corrmat.index
mask= np.triu(top_corr_features)
plt.figure(figsize=(6,4))
#plot heat map
sns.heatmap(df[top_corr_features].corr(),annot=True, fmt='.2f', mask=mask, cmap='Spectral_r');

## Modeling

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import pickle

In [None]:
# the 'Age' column will be dropped as we now have the 'AgeDist' column
df.drop(columns=['Age'], inplace=True)

In [None]:
# find categorical variable and encode
cat_features = df.select_dtypes(include=['object', 'category']).columns
num_features = [col for col in df.columns if col not in cat_features]

# print categorical variable
print("Categorical features: ", cat_features)

# print numerical variable
print("Numerical features: ", num_features)

In [None]:
# label encoding
le = LabelEncoder()
for i in cat_features:
    df[i] = le.fit_transform(df[i])

In [None]:
# save file to disk for modeling
df.to_csv("healthrisk.csv", index=False)

In [None]:
# split data into train and test set
X = df.drop(columns='RiskLevel')
y = df['RiskLevel']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

num_features = ['SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate', 'AgeDist']

# standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train[num_features])
X_test = scaler.transform(X_test[num_features])

In [None]:
# instantiate models
modelclasses = {'LogisticReg': LogisticRegression(), 
                'SVC': SVC(), 
                'DecisionTree': DecisionTreeClassifier(),
                'RandomForest': RandomForestClassifier()
                }

In [None]:
# Iteration of models

# create a list to store model results
acc_scores = []
f1_scores = []

for model_name, model_method in modelclasses.items():
       
    # fit model to training data
    model_method.fit(X_train, y_train)
    
    # predict the outcomes on the test set
    y_pred = model_method.predict(X_test)
    
    # append accuracy evaluation metric for the model to the list 
    acc_scores.append(accuracy_score(y_test, y_pred))
    
    # append R2 score evaluation metric for the model to the list
    f1_scores.append(f1_score(y_test, y_pred, pos_label='positive', average='macro'))
    
# create a dataframe to store the results
cla_results = pd.DataFrame({"Model":modelclasses.keys(), "Accuracy Score": acc_scores, "F1 Score": f1_scores})
cla_results

In [None]:
# plot the R2 Score 
sns.barplot(x=cla_results['F1 Score'], y=cla_results.Model);

In [None]:
# Hyperparameter tuning for Random Forest Regressor using RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(),
                                  param_distributions={
                                    'max_depth': [10, 20, 30, 40],
                                    'min_samples_split': [2, 5, 10],
                                    'n_estimators': [100, 80, 60, 55, 51, 45]
                                  },cv=5, scoring="r2",verbose=1,n_jobs=-1, 
                                  n_iter=50, random_state = 1
                                )

random_search.fit(X_train, y_train)

best_params=random_search.best_params_

print(" Results from Random Search " )
print("\n The best estimator across ALL searched params:\n", random_search.best_estimator_)
print("\n The best score across ALL searched params:\n", random_search.best_score_)
print("\n The best parameters across ALL searched params:\n", best_params)

In [None]:
# Build Random Forest Regression Model
model = RandomForestClassifier(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"], min_samples_split=best_params['min_samples_split'])

# fit model
model.fit(X_train, y_train)

# make predictions on the test set
y_pred = model.predict(X_test)

In [None]:
# Evaluation Metrics
def evaluation_metrics_func(y_test, y_pred):
    
    #print('Evaluation metric results:-')
    print('Accuracy score : \n', accuracy_score(y_test, y_pred))
    print('F1 score : \n', f1_score(y_test, y_pred, pos_label="positive", average="macro"))
    print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
    print('Classification Report : \n', classification_report(y_test, y_pred))

In [None]:
# computing the evaluation metrics
evaluation_metrics_func(y_test, y_pred)

In [None]:
# plot confusion matrix for RandomForestClassifier
plot_confusion_matrix(model, X_test, y_test, cmap='Spectral_r');

In [None]:
# obtain feature weights for random forest regression
feat_imp = model.feature_importances_

# create a dataframe of feature weight
feat_imp_ = pd.DataFrame(feat_imp, X.columns, columns=["Feature Importance"]).sort_values(by="Feature Importance", ascending=False)

# plot feature weights
plt.figure(figsize=(11,8))
plt.ylabel('Features')
plt.title('Feature Importance')
sns.barplot(x=feat_imp_['Feature Importance'], y=feat_imp_.index);

In [None]:
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))