In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
import altair as alt
# Disables maximum rows allowed for altair plots
# alt.data_transformers.disable_max_rows()
# Uncomment below to re-enable max rows
# alt.data_transformers.enable('default', max_rows=5000)


# Predicting Diabetes: A Data-Driven Approach Using Medical History and Demographic Data
##### Dataset: [Diabetes prediction dataset (kaggle.com)](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/code)
### Proposal by: 

![](https://storage.googleapis.com/kaggle-datasets-images/3102947/5344155/d4f2d9d63736fff7b6ba10f73774752e/dataset-cover.png?t=2023-04-08-06-42-24)
*istockphoto.com*

### Introduction:
Diabetes continues to be a critical research topic in the modern healthcare and medical field. Affecting how the body handles glucose, an individual with diabetes will either have trouble producing insulin, or their body cannot effectively use insulin to process glucose. This may lead to several complications, such as cardiovascular issues and nerve damage. There are several risk factors associated with diabetes, including obesity, age, and others. Given that medical research has found a correlation with these risk factors and diabetes, this project will aim to address the following question: **Can we predict the onset of diabetes based on a patient’s medical history and demographic data?**

To answer this question, we will be using <u>Diabetes prediction dataset</u> by Mohammed Mustafa. This dataset contains medical and demographic data collected worldwide, including features such as age, bmi, heart disease, HbA1c level, and blood glucose level.

### Methods:
We will begin by loading the dataset and changing the classification labels from numerical values to categorical labels (negative and positive). After stratifying the data, we will split it into training and testing sets. This step ensures that our model is trained on a representative sample and is evaluated effectively. In this analysis, we will focus on essential variables such as age, BMI, heart disease, HbA1c levels, and blood glucose levels. These attributes were chosen based on their potential relevance for predicting diabetes.

To visualize the results, we will create informative visualizations, such as bar charts, to display the distribution of different attributes among patients, segmented by diabetes status. These methods or techniques help us understand the data, identify patterns, and draw preliminary conclusions. Our approach is designed to be transparent and informative, guiding us toward building a robust prediction model for diabetes.



### Expected outcomes and significance:
By analyzing data from a demographic that includes people with and without diabetes, we aim to develop a prediction model for assessing a patient’s likelihood of having diabetes. 

This data enables proactive measures to be taken. Identifying individuals at risk of developing diabetes empowers them to make lifestyle modifications, ultimately reducing their risk. Additionally, it fosters the realization of significant cost efficiencies. Preventing diabetes or addressing it in its early stages can result in cost savings for both patients and healthcare systems. Furthermore, the ability to predict diabetes may also empower governments and healthcare organizations to use this information in initiating strategies to combat diabetes. 

These findings could prompt questions about the efficient utilization of diabetes predictions and how governments and healthcare organizations are harnessing these predictions to prevent or mitigate the risk of diabetes in individuals. 


### Preliminary exploratory data analysis:

In [2]:
diabetes = pd.read_csv("diabetes_prediction_dataset.csv") #read data
display(diabetes)
display(diabetes.info())
diabetes["diabetes"].value_counts(normalize = True) #show classification variable distribution

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0
...,...,...,...,...,...,...,...,...,...
99995,Female,80.0,0,0,No Info,27.32,6.2,90,0
99996,Female,2.0,0,0,No Info,17.37,6.5,100,0
99997,Male,66.0,0,0,former,27.83,5.7,155,0
99998,Female,24.0,0,0,never,35.42,4.0,100,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


None

0    0.915
1    0.085
Name: diabetes, dtype: float64

In [3]:
np.random.seed(1) # set seed

diabetes_negative = diabetes[diabetes["diabetes"] == 0] #create even amounts of positive and negative labels
diabetes_positive = diabetes[diabetes["diabetes"] == 1]
diabetes_positive_upscaled = resample(
    diabetes_positive, n_samples = diabetes_negative.shape[0]
)
diabetes_positive_upscaled.shape[0]
diabetes_upsampled = pd.concat((diabetes_negative, diabetes_positive_upscaled))
diabetes_upsampled["diabetes"].value_counts(normalize = True)

0    0.5
1    0.5
Name: diabetes, dtype: float64

In [4]:
diabetes_train, diabetes_test = train_test_split(
    diabetes_upsampled, train_size = .75, stratify = (diabetes_upsampled["diabetes"]) # split data
)
display(diabetes_train.info())
diabetes_train["diabetes"].value_counts(normalize = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 137250 entries, 94511 to 85718
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               137250 non-null  object 
 1   age                  137250 non-null  float64
 2   hypertension         137250 non-null  int64  
 3   heart_disease        137250 non-null  int64  
 4   smoking_history      137250 non-null  object 
 5   bmi                  137250 non-null  float64
 6   HbA1c_level          137250 non-null  float64
 7   blood_glucose_level  137250 non-null  int64  
 8   diabetes             137250 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 10.5+ MB


None

1    0.5
0    0.5
Name: diabetes, dtype: float64

In [5]:
diabetes_stats = diabetes_train.drop(["gender", "smoking_history", "diabetes"], axis=1) # find mean values
diabetes_stats.agg(["mean","std"]) #show average + variability demographics for survey

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level
mean,50.515819,0.152612,0.08937,29.42797,6.162753,163.420546
std,21.54606,0.359614,0.285278,7.44374,1.280485,56.88576


In [6]:
diabetes_small = diabetes_train.sample(5000)
diabetes_HbA1c_plot = alt.Chart(diabetes_small).mark_bar().encode(
    x = alt.X("HbA1c_level"),
    y = "count()",
    color = alt.Color("diabetes") 
).facet("diabetes")
diabetes_HbA1c_plot

This graph demonstrates that Hba1c level is directly correlated to the presence of diabetes, as all patients with very high HbA1c have diabetes, and all with very low HbA1c do not.

In [7]:
diabetes_bmi_plot = alt.Chart(diabetes_small).mark_bar().encode(
    x = alt.X("bmi"),
    y = "count()",
    color = alt.Color("diabetes") 
).facet("diabetes")
diabetes_bmi_plot