# Group Project Report: Maternal Health Risk Classification

Members: Ruby Liu, Yu Wei Chen, Annabel Lim, Heather Jia

## Introduction

### Background Information

Depending on a number of factors, a pregnancy can be considered “high-risk”. A high-risk pregnancy means that both mother and child are more likely to have health problems, requiring special monitoring to ensure the least amount of harm. It is important to know whether a mother is high-risk, so medical professionals can take the necessary preventative measures to ensure the health of both the mother and baby. 

### Project Question

Can we use the maternal risk factor measurements (age, systolic blood pressure, diastolic blood pressure, and blood sugar level) provided in our data to predict whether someone who is pregnant is at high risk, mid-risk, or low risk of maternal mortality? 

### Data Set

We will be using the Maternal Health Risk Data Set from the UCI Machine Learning Repository. The data was collected from hospitals, community clinics, and maternal health care centres from the rural areas of Bangladesh where region-specific studies have found that 1 in 10 pregnant women have low blood glucose levels. There are columns for age, systolic blood pressure, diastolic blood pressure, blood sugar, body temperature, heart rate, and risk level.


## Methods and Results 

In [None]:
install.packages("tidymodels")
install.packages("kknn")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [None]:
# load libraries 
library(tidyverse)
library(tidymodels)

# set seed value
set.seed(4)

In [None]:
# data set url
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00639/Maternal%20Health%20Risk%20Data%20Set.csv"

# read data 
maternity_data <- read_csv(url) |>
    mutate(RiskLevel = as_factor(RiskLevel))
head(maternity_data)

In [None]:
# keep predictors: Age, Systolic Blood Pressure, Diastolic Blood Pressure, Blood Sugar
# target variable: Risk Level 
maternity_selected <- maternity_data |>
    select(Age, DiastolicBP, SystolicBP, BS, RiskLevel)
head(maternity_selected)

In [None]:
set.seed(4);

# split data by 75% training, 25% testing 
maternity_split <- initial_split(maternity_selected, prop = 0.75, strata = RiskLevel) 

# training set 
maternity_train <- training(maternity_split)

# testing set
maternity_test <- testing(maternity_split)

head(maternity_train)
head(maternity_test)

In [None]:
# find proportions of labels
maternity_proportions <- maternity_train |>
    group_by(RiskLevel) |>
    summarize(n = n()) |>
    mutate(percent = 100*n/nrow(maternity_train))

maternity_proportions

# find mean of each predictor
maternity_predictor_means <- maternity_train |>
    select(- RiskLevel) |>
    map_df(mean)

maternity_predictor_means

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)

# plot diastolic BP vs systolic BP, color by risk level
maternity_diastolic_vs_systolic <- maternity_train |>
    ggplot(aes(x = DiastolicBP, y = SystolicBP, color = RiskLevel)) +
        geom_point() +
        labs (x = "Diastolic Blood Pressure (mmHg)", y = "Systolic Blood Pressure (mmHg)", color = "Risk Level") +
        ggtitle("Diastolic Blood Pressure vs Systolic Blood Pressure by Risk Level") +
        theme(text = element_text(size = 16)) +
        theme(plot.title = element_text(hjust = 0.5))

maternity_diastolic_vs_systolic

#plot blood sugar vs age, colour by risk level
maternity_BS_vs_Age <- maternity_train |>
    ggplot(aes(x = Age, y = BS, color = RiskLevel)) +
        geom_point() +
        labs (x = "Age (years)", y = "Blood Sugar (mg/dL)", color = "Risk Level") +
        ggtitle("Age vs. Blood Sugar") +
        theme(text = element_text(size = 16)) +
        theme(plot.title = element_text(hjust = 0.5))

maternity_BS_vs_Age

# plot proportion of risk level based on age 
maternity_age_vs_risk <- maternity_train |>
    ggplot(aes(x = Age, fill = RiskLevel)) +
        geom_bar() +
        labs(x = "Age", y = "Number of Pregnant Individual", fill = "Risk Level") +
        ggtitle("Proportion of Risk Level by Age") +
        theme(text = element_text(size = 16)) +
        theme(plot.title = element_text(hjust = 0.5))

maternity_age_vs_risk

In [None]:
set.seed(4)

# create recipe with all predictors
maternity_recipe <- recipe(RiskLevel ~., data = maternity_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# create knn model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# perform 5 fold cross validation on training set
train_vfold <- vfold_cv(maternity_train, v=5, strata = RiskLevel) 

# create data frame where k = 1 to 10
k_vals <- tibble(neighbors = seq(from = 1, to = 10))

In [None]:
# use workflow to combine recipe + model spec
# then tune model using tune_grid and collect metrics
knn_results <- workflow() |>
    add_recipe(maternity_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = train_vfold, grid = k_vals) |>
    collect_metrics()

# get accuracy from metrics on fitted model 
accuracies <- knn_results |>
    filter(.metric == "accuracy") 

head(accuracies)

In [None]:
# plot k (neighbors) vs accuracy (mean)
cross_val_plot <- accuracies |>
    ggplot(aes(x = neighbors, y = mean)) +
        geom_point() +
        geom_line() +
        labs(x = "Neighbors (k)", y = "Accuracy Estimate") +
        ggtitle("Cross-Validation Plot for Estimated Accuracy") +
        scale_x_continuous(breaks = 1:10) +
        theme(text = element_text(size = 16)) +
        theme(plot.title = element_text(hjust = 0.5))

# can estimate and view which k value may be the best
cross_val_plot

In [None]:
# choose k = 2

# create knn model specification with chosen k
final_spec <- nearest_neighbor(weight_func = "rectangular",
                               neighbors = 2) |>
    set_engine("kknn") |>
    set_mode("classification")

# analysis
final_results <- workflow() |>
    add_recipe(maternity_recipe) |>
    add_model(final_spec) |>
    fit(data = maternity_train) |>
    predict(maternity_test) |>
    bind_cols(maternity_test)

# extracting model's accuracy
final_accuracy <- final_results |>
    metrics(truth = RiskLevel, estimate = .pred_class) |>
    filter(.metric == "accuracy")
final_accuracy

# extracting model's confusion matrix
final_conf_mat <- final_results |>
    conf_mat(truth = RiskLevel, estimate = .pred_class)
final_conf_mat

## Discussion

## References 