# Maternal Health Risk Prediction

Dataset Description:

This dataset contains few attributes that can lead us to determine if a patient is at risk on maternal health.

Data has been collected from different hospitals, community clinics, maternal health cares from the rural areas of Bangladesh through the IoT based risk monitoring system.

Following are the attributes

Age,
SystolicBP,
DiastolicBP,
BS,
HeartRate,
Risk Level.

There are more than 1000 istances and three levels of predicted risk intensities(Low,Mid and High)

Source of dataset: 
UCI Machine Learning Repository(https://archive.ics.uci.edu/ml/datasets/Maternal+Health+Risk+Data+Set)

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
np.random.seed(1)

**Load data**

In [3]:
md=pd.read_csv("./Maternal Health Risk Data Set.csv")

**Dataset Exploring**

In [4]:
md.head(3)

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,high risk
1,35,140,90,13.0,98.0,70,high risk
2,29,90,70,8.0,100.0,80,high risk


In [5]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1014 non-null   int64  
 1   SystolicBP   1014 non-null   int64  
 2   DiastolicBP  1014 non-null   int64  
 3   BS           1014 non-null   float64
 4   BodyTemp     1014 non-null   float64
 5   HeartRate    1014 non-null   int64  
 6   RiskLevel    1014 non-null   object 
dtypes: float64(2), int64(4), object(1)
memory usage: 55.6+ KB


In [6]:
md.describe()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate
count,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0
mean,29.871795,113.198225,76.460552,8.725986,98.665089,74.301775
std,13.474386,18.403913,13.885796,3.293532,1.371384,8.088702
min,10.0,70.0,49.0,6.0,98.0,7.0
25%,19.0,100.0,65.0,6.9,98.0,70.0
50%,26.0,120.0,80.0,7.5,98.0,76.0
75%,39.0,120.0,90.0,8.0,98.0,80.0
max,70.0,160.0,100.0,19.0,103.0,90.0


In [7]:
#checking if there are any null values
md.isna().sum()

Age            0
SystolicBP     0
DiastolicBP    0
BS             0
BodyTemp       0
HeartRate      0
RiskLevel      0
dtype: int64

In [8]:
#getting list of Categorical Variables
cat_var_list = list(md.select_dtypes(include='object').columns)
cat_var_list

['RiskLevel']

In [9]:
#checking for unique values in each column
for cat in cat_var_list: 
    print(f"Category: {cat} Values: {md[cat].unique()}")

Category: RiskLevel Values: ['high risk' 'low risk' 'mid risk']


In [10]:
#Instead of label encoding, here replacing the categroical to numeric since label encoding assigns values in reverse order
#like low risk-1,high risk-0 which is confusing
label_mapping = {"low risk": 0, "mid risk": 1, "high risk": 2}
md = md.replace({"RiskLevel": label_mapping})

In [11]:
md.head(10)

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,2
1,35,140,90,13.0,98.0,70,2
2,29,90,70,8.0,100.0,80,2
3,30,140,85,7.0,98.0,70,2
4,35,120,60,6.1,98.0,76,0
5,23,140,80,7.01,98.0,70,2
6,23,130,70,7.01,98.0,78,1
7,35,85,60,11.0,102.0,86,2
8,32,120,90,6.9,98.0,70,1
9,42,130,80,18.0,98.0,70,2


**Split Data**

In [12]:
train_df, test_df = train_test_split(md, test_size=0.3)
target = 'RiskLevel'
predictors = list(md.columns)
predictors.remove(target)

In [13]:
#Standardizing the numerical columns to have a common scale
scaler = preprocessing.StandardScaler()
cols_to_stdize = [ 'Age', 'SystolicBP', 
                   'DiastolicBP', 'BS', 'BodyTemp', 
                   'HeartRate']                
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array


test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize])

In [14]:
train_df.RiskLevel.value_counts()

0    289
1    235
2    185
Name: RiskLevel, dtype: int64

Resampling to have a balanced training data

In [15]:
class0 = train_df[train_df['RiskLevel']==0]
class1 = train_df[train_df['RiskLevel']==1]
class2 = train_df[train_df['RiskLevel']==2]

In [16]:
from sklearn.utils import resample
train_df_class1_resampled = resample(class1, 
                                 replace=True,     
                                 n_samples=289,    
                                 random_state=111)

In [17]:
from sklearn.utils import resample
train_df_class2_resampled = resample(class2, 
                                 replace=True,     
                                 n_samples=289,    
                                 random_state=111)

In [18]:
print(class0.shape,train_df_class1_resampled.shape,train_df_class2_resampled.shape)

(289, 7) (289, 7) (289, 7)


In [19]:
#Final training Dataset
train_df=pd.concat([class0,train_df_class1_resampled,train_df_class2_resampled])

In [20]:
train_X=train_df[predictors]
train_y = train_df[target] 
test_X = test_df[predictors]
test_y = test_df[target]

In [21]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

Considering micro average for the measures as this is a multi-class problem

**Fitting Logistic Regression**

In [22]:
log_reg_model = LogisticRegression(penalty='none', max_iter=700)
_ = log_reg_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': accuracy_score(test_y, model_preds), 
                                                    'Precision': precision_score(test_y, model_preds, average='micro'), 
                                                    'Recall': recall_score(test_y, model_preds, average='micro'), 
                                                    'F1': f1_score(test_y, model_preds, average='micro')
                                                     }, index=[0])])

**Fitting Decision Tree Classifier**

In [23]:
Dt=DecisionTreeClassifier(max_depth=15)
Dt=Dt.fit(train_X,np.ravel(train_y))
model_preds=Dt.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree Classifier", 
                                                    'Accuracy': accuracy_score(test_y, model_preds), 
                                                    'Precision': precision_score(test_y, model_preds, average='micro'), 
                                                    'Recall': recall_score(test_y, model_preds, average='micro'), 
                                                    'F1': f1_score(test_y, model_preds, average='micro')
                                                     }, index=[0])])

**Fitting SVM**

In [24]:
#SVM using linear kernel
svm_lin_model = SVC(kernel="linear",probability=True)
_ = svm_lin_model.fit(train_X, np.ravel(train_y))
model_preds = svm_lin_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"linear svm", 
                                                    'Accuracy': accuracy_score(test_y, model_preds), 
                                                    'Precision': precision_score(test_y, model_preds, average='micro'), 
                                                    'Recall': recall_score(test_y, model_preds, average='micro'), 
                                                    'F1': f1_score(test_y, model_preds, average='micro')
                                                     }, index=[0])])

In [25]:
#SVM using RBF Kernel
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(train_X, np.ravel(train_y))
model_preds = svm_rbf_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"rbf svm", 
                                                    'Accuracy': accuracy_score(test_y, model_preds), 
                                                    'Precision': precision_score(test_y, model_preds, average='micro'), 
                                                    'Recall': recall_score(test_y, model_preds, average='micro'), 
                                                    'F1': f1_score(test_y, model_preds, average='micro')
                                                     }, index=[0])])

In [26]:
#SVM using poly Kernel
svm_poly_model = SVC(kernel="poly", degree=3,coef0=1,C=1,probability=True)
_ = svm_poly_model.fit(train_X, np.ravel(train_y))
model_preds = svm_poly_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': accuracy_score(test_y, model_preds), 
                                                    'Precision': precision_score(test_y, model_preds, average='micro'), 
                                                    'Recall': recall_score(test_y, model_preds, average='micro'), 
                                                    'F1': f1_score(test_y, model_preds, average='micro')
                                                     }, index=[0])])

In [27]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.636066,0.636066,0.636066,0.636066
0,Decision Tree Classifier,0.796721,0.796721,0.796721,0.796721
0,linear svm,0.636066,0.636066,0.636066,0.636066
0,rbf svm,0.704918,0.704918,0.704918,0.704918
0,poly svm,0.695082,0.695082,0.695082,0.695082


Based on the dataset we chose, Recall is the best metric to be used for comparision. When we compare the above models we used, Decision Tree Classifier fits the dataset better than other models. There is a huge difference for Decision tree in recall when compared with other models.