<font size="8">**Final Report**

# **Introduction**

### What is Diabetes?

Diabetes is a metabolic condition where the body is unable to regulate blood sugar levels effectively (American Diabetes Association, 2013). It is a common disease, with 38.5% of men and 32.8% of women in the US at risk of the condition as reported in a medical study (Gray et al., 2015). There are 2 types of diabetes: Type I and Type II. Type I diabetes is when the body doesn’t produce insulin, a blood sugar-regulating hormone, and is therefore unable to regulate blood sugar levels.  Type I affects around 5-10% of those with diabetes. On the contrary, Type II diabetes is when the body either doesn’t produce enough insulin or doesn’t use it effectively and this accounts for around 90-95% of those with diabetes (American Diabetes Association, 2013). 


### Diagnosing Diabetes & Question

The standard for diabetes diagnosis is dictated by ones' blood test results that show Hemoglobin A1c (a component of blood) levels ≥ 6.5 (American Diabetes Association (2013), Patel et al., 2023). Higher blood Glucose levels are also typically shown to be associated with diabetes (American Diabetes Association (2013), Patel et al., 2023). Interestingly, a study conducted on factors associated with diabetes strongly suggest that Body Mass Index (BMI) is associated with diabetes. The results suggested that those with even moderately higher BMI's are associated with an increased risk of developing diabetes (Gray et al., 2015., Patel et al., 2023). Thus, for this project, we aim to answer the question: **Can we predict a patient's diabetes diagnosis based on their blood glucose level (mg/dL) and BMI (kg/m2)?**
    
### Dataset

The dataset we will be using for this project contains demographic and laboratory variables on African-American patients including height, weight, gender, age, Hemoglobin A1c level, blood pressure etc. The dataset was initially compiled by Mohamadreza Momeni to use for machine learning models in diabetes diagnosis.

### Biases in Diabetes Literature Review

The motivation for using this dataset is to encourage equity in medical research by using data from a racially diverse sample. A 2023 study on the diagnosis of diabetes has found that current literature on the diagnosis of diabetes is biased as a large number of diabetes diagnosis models are based on data collected largely from non-Hispanic Whites. This implicates a dangerous overdiagnosis of diabetes among non-Hispanic Whites *and* an underdiagnosis of diabetes among non-Hispanic Blacks (Cronjé et al., 2023). Thus, we have chosen to conduct our project using this dataset consisting of African-American participants to create a model that furthers the goal of inclusion and equity in healthcare. 


# **Methods & Results** 
    
-- Description of methods -- write this after we've done all the code --

Please run the following cell to load the library packages necessary


In [None]:
# Run this cell before continuing
library(rvest)
library(tidyverse)
library(tidymodels)
library(repr)
install.packages("themis")
library(themis)
install.packages('kknn')
library(kknn)
source('tests.R')
source('cleanup.R')
set.seed(2023)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m         masks [34mstats[39m::filter()
[31m✖[39m [34mreadr[39m::[32mguess_encoding()[39m masks [34mrvest[39m::guess_encoding()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m            masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune    

### **1) Load Data**

Our data is loaded from the URL generated through GitHub

In [None]:
#Read data 
URL <- 'https://raw.githubusercontent.com/wmma2/group_18_project/main/diabetes.csv'
diabetes_data <- read_csv(URL)

#Check all the available columns
glimpse(diabetes_data)

### **2) Clean & Wrangle Data**

The dataset is already tidy. 

To prepare for our analysis, we will need to create 2 new columns: **diagnosis** (our categorical column) and **BMI** (in *kg/m^2*) (one of our predictor variables). 

- The **diagnosis** column will use data from the `glyhb` (Glycosylated Hemoglobin) column which corresponds to Hemoglobin A1c levels to indicate whether a person has diabetes or not. Rows that are greater or equal to 6.5 will be 'yes' and rows that are less than 6.5 will be 'no'.

- The **BMI** column will use data from the `weight` and `height` columns. BMI will be calculated using the standard formula (Fehring, 2007): BMI = (`weight` $\times$ 0.45359237) / (`height` $\times$ 0.0254)

Additionally, we can filter out the rows with missing values as they are not significant to our dataset.

Finally, we will select the necessary columns: `BMI`, `diagnosis` and `stab.glu` as we are interested in the whether the factors `stab.glu` (stable blood glucose levels) and `BMI` are related to the diabetes diagnosis.

Our clean and wrangled data will be assigned to a new tibble called `tidy_diabetes`.

In [None]:
tidy_diabetes <- diabetes_data |>

#Create 'diagnosis' column
        mutate(diagnosis = if_else(glyhb >= 6.5, "yes", "no")) |>
        mutate(diagnosis = as_factor(diagnosis))|>

#Create 'BMI' column
        mutate(height_m = height*0.0254, 
               weight_kg = weight*0.45359237,
               BMI = weight_kg/height_m^2) |>

#Filter missing values
        filter(!is.na(glyhb + BMI + stab.glu)) |>

#Select necessary columns
        select(stab.glu, BMI, diagnosis)
               
head(tidy_diabetes)

With the clean and wrangled data above, we can now continue with our data analysis.

### **3) Exploratory Data Analysis**

#### **3.1) Split data**

First we split the `tidy_diabetes` dataset into training (`diabetes_train`) and testing (`diabetes_test`) data so that we can continue exploratory data analysis with the training dataset. We have chosen to have 75/25 split with 75% training set and 25% in the testing set.

We do not need to check for missing values as we have already filtered them out in the previous step. 

In [None]:
set.seed(2023)

diabetes_split <- initial_split(tidy_diabetes, prop = 0.75 , strata = diagnosis)
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

#### **3.2) Uneven data proportion**

As we were looking at our data on Kaggle, it seemed that there was a lot more rows with those who don't have diabetes than those who do. To check for this we will use the `count` for the `diagnosis` column.

In [None]:
train_count <- tidy_diabetes|>
    group_by(diagnosis)|>
    summarize(count = n()) 

#### **3.3) Distributions of BMI & stab.glu**

Next, we calculate the means of the variables `BMI` and `stab.glu` to check if will need to standardize our data.

In [None]:
train_mean <- tidy_diabetes|>
    summarize(stab.glu_mean = mean(stab.glu),
              BMI_mean = mean(BMI))

#### **3.4) Count missing Data**

Finally, we count the number of missing values in our data to see if we need to remove any missing data.

In [None]:
train_NAs <- sum(is.na(tidy_diabetes))

#### **Summary of exploratory data analysis**

In [None]:
#run this cell
train_count
train_mean
train_NAs

From `train_count`, we see that our data is uneven and needs to be upscaled when training our algorithm. 

From `train_mean`, we see that the means stab.glu and BMI vary by quite a bit, thus we will need to standardize the data.

We can also see that our dataset has no missing values, meaning we do not need to remove any values.

### **4) Visualization**

We will now plot blood glucose levels `stab.glu` and `BMI` to check distribution of our predictors, as well as see if there is any obvious correlation. We will also color code the data points based on their `diagnosis` label to see check for patterns.

In [None]:
#run this cell 
options(repr.plot.width = 10, repr.plot.height = 5)

train_plot <- diabetes_train |>
    ggplot(aes(x = stab.glu, y = BMI, colour = diagnosis)) +
    geom_point() +
    labs(x = "Blood glucose (mg/dl)", y = "Body Mass Index (kg/m^2)", colour = "Diabetes diagnosis?") +
    ggtitle('Diabetes diagnosis in relation to Blood Glucose and BMI') +
    theme(text=element_text(size = 15))

train_plot

From the plot above, we can see that a lower blood glucose level seems to be associated with no diabetes. The trend with BMI is harder to tell as both diagnosis labels appear to fall in the same range.

From the visualization, we can also see that the range of blood glucose `stab.glu` is on a much larger scale than `BMI`, meaning it will have a greater effect on our k-NN model. To counter this, we will standardize the predictors in our data analysis.

### **5) Data Analysis**

Insert description...

#### **5.1) Upsampling uneven data**

In [None]:

# ups_recipe <- recipe(diagnosis~., data = tidy_diabetes) |>
#     step_upsample(diagnosis, over_ratio = 1, skip = FALSE) |>
#     step_scale(all_predictors()) |>
#     step_center(all_predictors()) |>
#     prep()
# even_data <- bake(ups_recipe, tidy_diabetes)

# even_count<- even_data |>
#     group_by(diagnosis)|>
#     summarize (count = n())

# even_count

#### **5.2) Split data: training (75%), testing (25%)**

In [None]:


# diabetes_split <- initial_split(even_data, prop = 0.75, strata = diagnosis)
# diabetes_train <- training(diabetes_split)
# diabetes_test <- testing(diabetes_split)

# head(diabetes_train)
# head(diabetes_test)

In [None]:


#Create a recipe to upscale and standardize our training data

ups_recipe <- recipe(diagnosis~., data = diabetes_train) |>
    step_upsample(diagnosis, over_ratio = 1, skip = FALSE) |>
    # step_scale(all_predictors()) |>
    # step_center(all_predictors()) |>
    prep()
diabetes_even_train <- bake(ups_recipe, diabetes_train)
                                         
#Check count

check_count<- diabetes_even_train |>
    group_by(diagnosis)|>
    summarize (count = n())

check_count

#### **5.2) Creating cross-validation sets**

In [None]:
#Apply cross-validation
set.seed(2023)

diabetes_vfold <- vfold_cv(diabetes_even_train, v = 5, strata = diagnosis)

# diabetes_recipe <- recipe(diagnosis ~ stab.glu + BMI,  data = even_data)



#### **5.3) Tuning K**

In [None]:
#Tuning


knn_tune <- nearest_neighbor(weight_func = 'rectangular', neighbors = tune()) |>
    set_engine('kknn') |>
    set_mode('classification')
knn_tune


#### **5.4) Recipe**

In [None]:

#scaled data 
diabetes_recipe <- recipe(diagnosis ~ stab.glu + BMI, data = diabetes_even_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

diabetes_recipe

In [None]:

knn_results <- workflow() |>
       add_recipe(diabetes_recipe) |>
       add_model(knn_tune) |>
       tune_grid(resamples = diabetes_vfold, grid = 10) |>
       collect_metrics()
knn_results

In [None]:
#Filter for accuracies
accuracies <- knn_results |> 
      filter(.metric == 'accuracy')

#plot best K
accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(0, 14, by = 1)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.8, 1.0)) # adjusting the y-axis
accuracy_versus_k


In [None]:

#set k
knn_spec_optimal <- nearest_neighbor(weight_func = 'rectangular', neighbors = 10) |>
    set_engine('kknn') |>
    set_mode('classification')
knn_spec_optimal

#results
knn_results_optimal <- workflow() |>
       add_recipe(diabetes_recipe) |>
       add_model(knn_spec_optimal) |>
       tune_grid(resamples = diabetes_vfold, grid = 10) |>
       collect_metrics()
knn_results_optimal

#### **5.4) Create K-NN classifier and train the classifier, k=3**

In [None]:
#Creating k-NN classifier
# knn_spec <- nearest_neighbor(weight_func = 'rectangular', neighbors = 9) |>
#     set_engine('kknn') |>
#     set_mode('classification')

#Training the classifier
diabetes_workflow <- workflow() |>
    add_recipe(diabetes_recipe) |>
    add_model(knn_spec_optimal) |>
    fit(data = diabetes_train)

diabetes_workflow

#### **5.5) Fit Data**

In [None]:
diabetes_fit <- knn_spec_optimal |>
    fit(diagnosis ~ BMI+stab.glu, data = diabetes_train)
    
diabetes_fit

In [None]:
#predictions

diabetes_test_predictions <- predict(diabetes_fit, diabetes_test) |>
        bind_cols(diabetes_test)

head(diabetes_test_predictions)
tail(diabetes_test_predictions)

diabetes_metrics <- diabetes_test_predictions |>
    metrics(truth = diagnosis, estimate = .pred_class) |>
    filter(.metric == 'accuracy')
diabetes_metrics

In [None]:
#matrix
diabetes_mat <- diabetes_test_predictions |>
    conf_mat(truth = diagnosis, estimate = .pred_class)
diabetes_mat

## **Discussion**

summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?

## References

<font size="2">American Diabetes Association. (2013). Diagnosis and Classification of Diabetes Mellitus. *Diabetes Care, 37(1)*, S81–S90. https://doi.org/10.2337/dc14-S081

<font size="2">Cronjé, Héléne T., Katsiferis, Aleandros, Elsenburg, Leonie K., Andersen, Theo O., Rod, Naja H. Varga, Tibor V. (2023). Assessing racial bias in type 2 diabetes risk prediction algorithms. *PLOS Glob Public Health. 2023; 3(5)*, e0001556. https://doi.org/10.1371/journal.pgph.0001556
    
<font size="2">Fehring, Thomas, K., Odum, Susan, M., Griffin, William, L., Mason, Bohannon., McCoy, Thomas H. (2007). The Obesity Epidemic: Its Effect on Total Joint Arthroplasty. *The Journal of Arthroplasty, 22(6)*, 71-76. https://doi.org/10.1016/j.arth.2007.04.014

<font size="2">Gray, Natallia., Picone, Gabriel., Sloan, Frank., Yashkin, Arseniy. (2015). The Relationship between BMI and Onset of Diabetes Mellitus and its Complications. *National Library of Medicine, 108(1), 29-36*. https://doi: 10.14423/SMJ.0000000000000214

<font size="2">Momeni, Mohamadreza. (2023). Diabetes. Version 1 . Retrieved Oct 24, 2023 from https://www.kaggle.com/datasets/imtkaggleteam/diabetes 

<font size="2">Patel, B. J., Mehta, D. N., Vaghani, A., & Patel, K. (2023). Correlation of Body Mass Index (BMI) with Saliva and Blood Glucose Levels in Diabetic and Non-Diabetic Patients. *Journal of pharmacy & bioallied sciences, 15(Suppl 2)*, S1204–S1207. https://doi.org/10.4103/jpbs.jpbs_159_23
