# Predicting Diabetes: A Data-Driven Approach Using Medical History and Demographic Data
 Report by: Trevor Harms, Jena Arianto, Jina Kang, Ricky Wong



![](https://storage.googleapis.com/kaggle-datasets-images/3102947/5344155/d4f2d9d63736fff7b6ba10f73774752e/dataset-cover.png?t=2023-04-08-06-42-24)
*istockphoto.com*

## Introduction
Diabetes remains a pervasive and important health concern within contemporary healthcare and medical research. Individuals with diabetes encounter challenges either in insulin production or the efficient utilization of insulin for glucose processing, resulting in potential complications such as cardiovascular issues and nerve damage. The multifaceted nature of this condition emphasizes the significance of research efforts aimed at explaining predictive factors for diabetes onset. Notably, existing medical literature has established correlations between diabetes and various risk factors, including obesity, age, and other demographic variables.

This project aims to address a pivotal question in the realm of diabetes research: **Can the onset of diabetes be predicted based on an analysis of a patient's medical history and demographic data?** To explore this question, we turn our attention to the [diabetes prediction dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/code) meticulously curated by Mohammed Mustafa. Drawing from a global pool of medical and demographic data, this dataset contains a diverse array of features, including but not limited to age, body mass index (BMI), heart disease status, HbA1c level, and blood glucose level.

In the pursuit of answering our research question, we draw inspiration from relevant studies, such as the work by Brown et al. (2015), which explores predictive modeling in the context of diabetes, and the research by Diabetes Care (2005), delving into the prediction of diabetes development in older populations. These references serve as foundational pillars, guiding our approach and establishing connections between our project and established research in the field.


## Methods & Results

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import altair as alt
from sklearn.model_selection import cross_val_score

# Disables maximum rows allowed for altair plots
alt.data_transformers.disable_max_rows()
# Uncomment below to re-enable max rows
# alt.data_transformers.enable('default', max_rows=5000)

DataTransformerRegistry.enable('default')

In [4]:
url = "https://drive.google.com/file/d/1dTmTAiRGM5skZzMb9NwpOkcQrY0Dpq6t/view?usp=sharing"
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
diabetes = pd.read_csv(url) #read data
display(diabetes)
display(diabetes.info())
diabetes["diabetes"].value_counts(normalize = True) #show classification variable distribution

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0
...,...,...,...,...,...,...,...,...,...
99995,Female,80.0,0,0,No Info,27.32,6.2,90,0
99996,Female,2.0,0,0,No Info,17.37,6.5,100,0
99997,Male,66.0,0,0,former,27.83,5.7,155,0
99998,Female,24.0,0,0,never,35.42,4.0,100,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


None

0    0.915
1    0.085
Name: diabetes, dtype: float64

Next we need to resample the data to create an even distribution of positive and negative labels.

In [5]:
np.random.seed(1) # set seed

diabetes_negative = diabetes[diabetes["diabetes"] == 0] #create even amounts of positive and negative labels
diabetes_positive = diabetes[diabetes["diabetes"] == 1]
diabetes_negative_downscaled = resample(
    diabetes_negative, n_samples = diabetes_positive.shape[0]
)
diabetes_negative_downscaled.shape[0]
diabetes_downsampled = pd.concat((diabetes_positive, diabetes_negative_downscaled))
display(diabetes_downsampled["diabetes"].value_counts(normalize = True))
diabetes_downsampled.info()

1    0.5
0    0.5
Name: diabetes, dtype: float64

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17000 entries, 6 to 50426
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender               17000 non-null  object 
 1   age                  17000 non-null  float64
 2   hypertension         17000 non-null  int64  
 3   heart_disease        17000 non-null  int64  
 4   smoking_history      17000 non-null  object 
 5   bmi                  17000 non-null  float64
 6   HbA1c_level          17000 non-null  float64
 7   blood_glucose_level  17000 non-null  int64  
 8   diabetes             17000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 1.3+ MB


Now that the data is resampled we can create the train/test split

In [6]:
diabetes_train, diabetes_test = train_test_split(
    diabetes_downsampled, train_size = .75, stratify = (diabetes_downsampled["diabetes"]), random_state=42 # split data
)
display(diabetes_train.info())
diabetes_train["diabetes"].value_counts(normalize = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12750 entries, 7874 to 68193
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender               12750 non-null  object 
 1   age                  12750 non-null  float64
 2   hypertension         12750 non-null  int64  
 3   heart_disease        12750 non-null  int64  
 4   smoking_history      12750 non-null  object 
 5   bmi                  12750 non-null  float64
 6   HbA1c_level          12750 non-null  float64
 7   blood_glucose_level  12750 non-null  int64  
 8   diabetes             12750 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 996.1+ KB


None

0    0.5
1    0.5
Name: diabetes, dtype: float64

### Categorical Visualization

Categorical visualization serves as a tool for distilling complex data into insights, particularly when exploring variations across different categories within a dataset. By using bar chart as categorical visualization technique, we can effectively communicate the distribution of key categories, offering a clear and concise means to understand how specific factors vary across counties. 


In [16]:
gender_value = diabetes["gender"].value_counts()

gender_value

Female    58552
Male      41430
Other        18
Name: gender, dtype: int64

In [17]:
gender_bar = alt.Chart(pd.DataFrame({"gender": gender_value.index, "count": gender_value})).mark_bar().encode(
    x=alt.X("gender:N", title="gender"),  
    y=alt.Y("count:Q", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Gender Bar"
)

gender_bar

In [18]:
hypertension_value = diabetes["hypertension"].value_counts()

hypertension_value

0    92515
1     7485
Name: hypertension, dtype: int64

In [19]:
hypertension_bar = alt.Chart(pd.DataFrame({"hypertension": hypertension_value.index, "count": hypertension_value})).mark_bar().encode(
    x=alt.X("hypertension:N", title="hypertension"),  
    y=alt.Y("count:Q", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Hypertension Bar"
)

hypertension_bar

In [23]:
heart_disease_value = diabetes["heart_disease"].value_counts()

heart_disease_value

0    96058
1     3942
Name: heart_disease, dtype: int64

In [24]:
heart_disease_bar = alt.Chart(pd.DataFrame({"heart_disease": heart_disease_value.index, "count": heart_disease_value})).mark_bar().encode(
    x=alt.X("heart_disease:N", title="heart_disease"),  
    y=alt.Y("count:Q", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Heart Disease Bar"
)

heart_disease_bar

In [25]:
diabetes_value = diabetes["diabetes"].value_counts()

diabetes_value

0    91500
1     8500
Name: diabetes, dtype: int64

In [26]:
diabetes_bar = alt.Chart(pd.DataFrame({"diabetes": diabetes_value.index, "count": diabetes_value})).mark_bar().encode(
    x=alt.X("diabetes:N", title="Diabetes"),  
    y=alt.Y("count:Q", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Diabetes bar"
)

diabetes_bar

In [27]:
smoking_history_value = diabetes["smoking_history"].value_counts()

smoking_history_value

No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: smoking_history, dtype: int64

In [28]:
smoking_history_bar = alt.Chart(pd.DataFrame({"smoking_history": smoking_history_value.index, "count": smoking_history_value})).mark_bar().encode(
    x=alt.X("smoking_history:N", title="Smoking History Types"),  
    y=alt.Y("count:Q", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Smoking History Bar"
)

smoking_history_bar

Now we filter the data to the numeric columns to aggregate and observe trends for remaining variables.

In [20]:
diabetes_stats = diabetes.drop(["gender", "heart_disease", "hypertension", "smoking_history", "diabetes"], axis=1) # filter to numerical values values
display(diabetes_stats.agg(["mean","std"])) #show average + variability demographics for survey

Unnamed: 0,age,bmi,HbA1c_level,blood_glucose_level
mean,41.885856,27.320767,5.527507,138.05806
std,22.51684,6.636783,1.070672,40.708136


### Preprocessing

Next we'll make the preprocessor to use in K Means Classification

In [8]:
feature_names = ["age", "bmi", "HbA1c_level", "blood_glucose_level"]

diabetes_preprocessor = make_column_transformer(
    (StandardScaler(), feature_names),
)
diabetes_preprocessor

In [9]:
X_train = diabetes_train[["age", "bmi", "HbA1c_level", "blood_glucose_level"]]
y_train = diabetes_train["diabetes"]

diabetes_preprocessor.fit(X_train)
X_train_scaled = diabetes_preprocessor.transform(X_train)
diabetes_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_names)
diabetes_scaled_df

Unnamed: 0,age,bmi,HbA1c_level,blood_glucose_level
0,0.717328,-0.105324,-0.286005,0.639068
1,-1.790850,-1.594597,-1.063292,-1.119277
2,0.949567,-0.025256,0.491282,-0.415939
3,0.392194,-0.083973,0.335825,-1.295112
4,0.763776,0.664667,-1.685122,-0.152187
...,...,...,...,...
12745,-0.397418,-0.280140,0.024910,-0.328022
12746,0.717328,2.991990,1.579484,-0.081853
12747,-0.072284,0.691357,0.024910,1.694076
12748,0.392194,0.348397,-0.286005,-0.591774


Using our preprocessor, we can train a new model.

In [10]:
diabetes_pipe_knn = make_pipeline(diabetes_preprocessor, KNeighborsClassifier())

cv_scores = cross_val_score(diabetes_pipe_knn, X_train, y_train, cv=5, scoring='accuracy')

cv_scores_std = np.std(cv_scores)
print("average cv score:", cv_scores.mean(), "±", cv_scores_std)

average cv score: 0.8830588235294119 ± 0.005887056942680853


Here, we see a score of 0.88305. This is fine, however let's see if we can improve our score by tuning the `n_neighbors` hyperparameter in `KNeighborsClassifier()`.

### Hyperparameter Optimization
Now, we'll make a GridSearch CV object and a range of potential K values to find the best K value

In [11]:
diabetes_grid = {
    "kneighborsclassifier__n_neighbors"  : range(
        1,60, 2),
}
diabetes_pipe = make_pipeline(diabetes_preprocessor, KNeighborsClassifier())
diabetes_grid = GridSearchCV(
    estimator = diabetes_pipe,
    param_grid = diabetes_grid,
    cv = 5
)
accuracies_grid = pd.DataFrame(
    diabetes_grid.fit(
        diabetes_train[["age", "bmi", "HbA1c_level", "blood_glucose_level"]],
        diabetes_train["diabetes"],
    ).cv_results_
)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)
accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.87302,0.00177
1,3,0.877882,0.002248
2,5,0.883059,0.001862
3,7,0.887843,0.001569
4,9,0.888941,0.001294
5,11,0.892863,0.001753
6,13,0.894039,0.001591
7,15,0.89302,0.00155
8,17,0.892627,0.001156
9,19,0.893725,0.000976


In [12]:
accuracy = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x = alt.X("n_neighbors"),
    y = alt.Y("mean_test_score", title='Mean Test Score', scale=alt.Scale(zero=False)),
).properties(
    title='Ideal n_neighbors value based off Mean Test Score',
    width=600
)
accuracy

Based off `accuracies_grid` and our accuracy plot, we can see that `KNeighborsClassifier()` will perform best when `n_neighbors = 31` based off mean test score. Although `n_neighbors = 31` has a high `sem_test_score` such that there may be more variability in the mean test score, it will likely perform better than all other `n_neighbors` values up to `n_neighbors = 60`. There is a possibility of a better `n_neighbors` value past 60, however we will be unable to determine such value due to limited computational power.

In [14]:
diabetes_pipe_knn_31 = make_pipeline(diabetes_preprocessor, KNeighborsClassifier(n_neighbors=31))

X_train_knn_31 = diabetes_train[["age", "bmi", "HbA1c_level", "blood_glucose_level"]]
y_train_knn_31 = diabetes_train["diabetes"]

cv_scores = cross_val_score(diabetes_pipe_knn_31, X_train_knn_31, y_train_knn_31, cv=5, scoring='accuracy')

cv_scores_std = np.std(cv_scores)
print("average cv score:", cv_scores.mean(), "±", cv_scores_std)

average cv score: 0.8965490196078431 ± 0.0016711588825617852


After performing a cross-validation where cv=5, our model has a mean accuracy of 0.89655 with a standard deviation of 0.00167. A high accuracy may indicate that our model is able to correctly make predictions, and a low standard deviation indicates that our model is likely not underfitting nor overfitting. Compared to our pre-tuned model, the post-tuned model sees a slight improvement.

### Feature Selection
So far, we have found a feature set that allows us to train a relatively decent model. However, we would like to see whether there is another pool of features which may produce a more accurate model. To do this, we will separate our data into numerical features, binary features, and categorical features. We will also drop 'gender', as it may introduce bias into our final model. We could choose to represent 'gender' as a binary feature or an ordinal feature, but we believe it is best to not to include it at all.

We will pass all numerical features through `StandardScaler()` like previously, but this time we will pass categorical features (`smoking_history`) through `OrdinalEncoder()`, as we will choose to interpret smoking history as different levels of smoking intensity (never < former < current). However, we would also like to note that that 'no info' may affect our results, and using `OneHotEncoder()` is another option to consider in the future.

As for our binary features, we will use 'passthrough' as they do not need any additional processing to be used in `KNeighborsClassifier()`.

Lastly, we need to consider that the previous `n_neighbors=31` hyperparamater that we found is specific to the feature set containing only numerical features. Thus, we will use the default value of `n_neighbors=5` in our `KNeighborsClassifier()`.

In [12]:
X_train_all = diabetes_train.drop(columns=['diabetes', 'gender'])
y_train_all = diabetes_train['diabetes']
X_test_all = diabetes_test.drop(columns=['diabetes', 'gender'])
y_test_all = diabetes_test['diabetes']

numerical_features = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
binary_features = ['hypertension', 'heart_disease'] ## No need to perform any preprocessing
categorical_features = ['smoking_history', ]

preprocessor_all = make_column_transformer(
    (StandardScaler(), numerical_features),
    (OrdinalEncoder(), categorical_features),
    ('passthrough', binary_features)
)

X_train_transformed = preprocessor_all.fit_transform(X_train_all)
X_test_transformed = preprocessor_all.fit_transform(X_test_all)
y_train_transformed = y_train_all
y_test_transformed = y_test_all

diabetes_pipe_all = make_pipeline(preprocessor_all, KNeighborsClassifier())

cv_scores = cross_val_score(diabetes_pipe_all, X_train_all, y_train_all, cv=5, scoring='accuracy')

cv_scores_std = np.std(cv_scores)
print("average cv score:", cv_scores.mean(), "±", cv_scores_std)

average cv score: 0.8898823529411766 ± 0.005531497312158574


Our score seems to be be better than the previous pre-tuned model, while performing worse than the post-tuned model. Let's see if we can select an ideal combinations of features by manually sifting through all combinations and finding the set with the best score.

In [13]:
from sklearn.model_selection import cross_val_score
from itertools import combinations

## Used ChatGPT to help iterate over all possible combinations of features as to select the top ten feature sets with the greatest score

def score_feature_combinations(X, y, model, preprocessors):
    scores = []

    for r in range(1, len(X.columns) + 1):
        for feature_combination in combinations(X.columns, r):
            preprocessor = make_column_transformer(
                (StandardScaler(), [feature for feature in feature_combination if feature in numerical_features]),
                (OrdinalEncoder(), [feature for feature in feature_combination if feature in categorical_features]),
                ('passthrough', [feature for feature in feature_combination if feature in binary_features])
            )

            pipe = make_pipeline(preprocessor, model)

            avg_score = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
            scores.append((feature_combination, avg_score))
    scores.sort(key=lambda x: x[1], reverse=True)

    return scores

knn_model = KNeighborsClassifier()
feature_combination_scores = score_feature_combinations(X_train_all, y_train_all, knn_model, preprocessor_all)

for feature_combination, score in feature_combination_scores[:10]:
    print(f"Feature Combination: {feature_combination}, Score: {score}")

Feature Combination: ('age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8909803921568628
Feature Combination: ('age', 'hypertension', 'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8904313725490196
Feature Combination: ('age', 'hypertension', 'heart_disease', 'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8898823529411766
Feature Combination: ('age', 'hypertension', 'bmi', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8889411764705881
Feature Combination: ('age', 'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8871372549019607
Feature Combination: ('age', 'heart_disease', 'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8864313725490197
Feature Combination: ('age', 'hypertension', 'smoking_history', 'HbA1c_level', 'blood_glucose_level'), Score: 0.8853333333333333
Feature Combination: ('age', 'hypertension', 'heart_disease', 'smoking_his

The feature combination of `'age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level'` produces a model with a score of 0.89098, slightly outperforming our model which considers all features (0.88988) and the model which only considers only numerical features with hyperparameter optimization (0.88306).

There is other information we can infer from our feature combinations. `age, blood_glucose_level` and `HbA1c_level` seem to be the most important features to consider when predicting if a patient has diabetes as these three features all appear in top ten feature combinations. However, the addition of `hypertension`, `bmi` and/or `smoking_history` are necessary to help improve the accuracy of our models. 

In [14]:
X_train_opt = diabetes_train.drop(columns=['diabetes', 'gender', 'smoking_history'])
y_train_opt = diabetes_train['diabetes']

numerical_features_opt = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
binary_features_opt = ['hypertension', 'heart_disease'] ## No need to perform any preprocessing

preprocessor_opt = make_column_transformer(
    (StandardScaler(), numerical_features_opt),
    ('passthrough', binary_features_opt)
)


diabetes_pipe_opt = make_pipeline(preprocessor_opt, KNeighborsClassifier())

cv_scores_opt = cross_val_score(diabetes_pipe_opt, X_train_opt, y_train_opt, cv=5, scoring='accuracy')

cv_scores_opt_std = np.std(cv_scores)
print("average cv score:", cv_scores_opt.mean(), "±", cv_scores_opt_std)

average cv score: 0.8909803921568628 ± 0.005531497312158574


To further improve our model accuracy, let's perform hyperparamater optimization on `n_neighbors`.

In [15]:
diabetes_grid = {
    "kneighborsclassifier__n_neighbors"  : range(
        1,60, 2),
}
diabetes_pipe = make_pipeline(preprocessor_opt, KNeighborsClassifier())
diabetes_grid = GridSearchCV(
    estimator = diabetes_pipe_opt,
    param_grid = diabetes_grid,
    cv = 5
)
accuracies_grid = pd.DataFrame(
    diabetes_grid.fit(
        X_train_opt,
        y_train_opt,
    ).cv_results_
)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)
accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.877961,0.002519
1,3,0.883373,0.002673
2,5,0.89098,0.001766
3,7,0.894353,0.001342
4,9,0.896078,0.001038
5,11,0.895608,0.00069
6,13,0.896235,0.000605
7,15,0.896863,0.001563
8,17,0.896,0.001116
9,19,0.896078,0.001117


In [16]:
accuracy = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x = alt.X("n_neighbors"),
    y = alt.Y("mean_test_score", title='Mean Test Score', scale=alt.Scale(zero=False)),
).properties(
    title='Ideal n_neighbors value based off Mean Test Score',
    width=600
)
accuracy

Cool! we can see that the best `n_neighbors` value for our optimized feature set is `n_neighbors=23`. Let's apply this to our model.

In [17]:
diabetes_pipe_opt = make_pipeline(preprocessor_opt, KNeighborsClassifier(n_neighbors=23))

cv_scores_opt = cross_val_score(diabetes_pipe_opt, X_train_opt, y_train_opt, cv=5, scoring='accuracy')

cv_scores_opt_std = np.std(cv_scores)
print("average cv score:", cv_scores_opt.mean(), "±", cv_scores_opt_std)

average cv score: 0.8974117647058824 ± 0.005531497312158574


Our score seemed to have improved, leaving us with a score of 0.897412! Comparing this to our numerical-feature tuned model of 0.89655, both models perform well, with our tuned optimized-feature model performing just slightly better.

Now, we should use test our tuned models using our test set which we set aside at the beginning.

In [18]:
# Numerical Features only, optimized to knn=31
diabetes_pipe_knn_31.fit(X_train, y_train)

X_test_knn_31 = diabetes_test[["age", "bmi", "HbA1c_level", "blood_glucose_level"]]
y_test_knn_31 = diabetes_test["diabetes"]

test_score_knn_31 = diabetes_pipe_knn_31.score(X_test_knn_31, y_test_knn_31)
print("Test set accuracy for numerical only:", test_score_knn_31)


# # All features
# diabetes_pipe_all.fit(X_train_all, y_train_all)

# X_test_all = diabetes_test.drop(columns=['diabetes', 'gender'])
# y_test_all = diabetes_test["diabetes"]

# test_score_all = diabetes_pipe_all.score(X_test_all, y_test_all)
# print("Test set accuracy for all features:", test_score_all)


# Optimized Feature set
diabetes_pipe_opt.fit(X_train_opt, y_train_opt)

X_test_opt = diabetes_test.drop(columns=['diabetes', 'gender', 'smoking_history'])
y_test_opt = diabetes_test['diabetes']

test_score_opt = diabetes_pipe_opt.score(X_test_opt, y_test_opt)
print("Test set accuracy for optimized feature set:", test_score_opt)

Test set accuracy for numerical only: 0.8967058823529411
Test set accuracy for optimized feature set: 0.8950588235294118


These test scores indicate that each of our models are likely to generalize well to new, unseen data. The numerical only model seems to perform slightly better compared to our optimized feature set, however both models seem to perform well. 

In [29]:
top_row = alt.hconcat(gender_bar, hypertension_bar, heart_disease_bar)
bottom_row = alt.hconcat(diabetes_bar, smoking_history_bar)
final_layout = alt.vconcat(top_row, bottom_row)

final_layout

Now, after visualizing the frequency of each contributing factors. We can compare each factors with presene of diabetes.

In [62]:
gender_diabetes_value = diabetes.groupby(["gender", "diabetes"]).size().reset_index(name="count")

gender_diabetes_chart = alt.Chart(gender_diabetes_value).mark_bar().encode(
    x=alt.X("gender", title="Gender"),  
    y=alt.Y("count", title="Count"), 
    color=alt.Color("diabetes", title="Diabetes")
).properties(
    width=400,
    height=200,
    title="Diabetes & Diabetes"
)

gender_diabetes_chart

In [63]:
hypertension_diabetes_value = diabetes.groupby(["hypertension", "diabetes"]).size().reset_index(name="count")

hypertension_diabetes_chart = alt.Chart(hypertension_diabetes_value).mark_bar().encode(
    x=alt.X("hypertension", title="Hypertension"),  
    y=alt.Y("count", title="Count"), 
    color=alt.Color("diabetes", title="Diabetes")
).properties(
    width=400,
    height=200,
    title="Hypertension & Diabetes"
)

hypertension_diabetes_chart

In [64]:
heart_disease_diabetes_value = diabetes.groupby(["heart_disease", "diabetes"]).size().reset_index(name="count")

heart_disease_diabetes_chart = alt.Chart(heart_disease_diabetes_value).mark_bar().encode(
    x=alt.X("heart_disease", title="Heart Disease"),  
    y=alt.Y("count", title="Count"), 
    color=alt.Color("diabetes", title="Diabetes")
).properties(
    width=400,
    height=200,
    title="Heart Disease & Diabetes"
)

heart_disease_diabetes_chart

In [65]:
smoking_history_diabetes_value = diabetes.groupby(["smoking_history", "diabetes"]).size().reset_index(name="count")

smoking_history_diabetes_chart = alt.Chart(smoking_history_diabetes_value).mark_bar().encode(
    x=alt.X("smoking_history", title="Smoking History"),  
    y=alt.Y("count", title="Count"), 
    color=alt.Color("diabetes", title="Diabetes")
).properties(
    width=400,
    height=200,
    title="Smoking History & Diabetes"
)

smoking_history_diabetes_chart

In [66]:
smoking_history_diabetes_value = diabetes.groupby(["smoking_history", "diabetes"]).size().reset_index(name="count")

smoking_history_diabetes_chart = alt.Chart(smoking_history_diabetes_value).mark_bar().encode(
    x=alt.X("smoking_history", title="Smoking History"),  
    y=alt.Y("count", title="Count"), 
    color=alt.Color("diabetes:N", title="Diabetes")
).properties(
    width=400,
    height=200,
    title="Smoking History & Diabetes"
)

smoking_history_diabetes_chart

In [67]:
top_row = alt.hconcat(gender_diabetes_chart, hypertension_diabetes_chart)
bottom_row = alt.hconcat(heart_disease_diabetes_chart, smoking_history_diabetes_chart)
final_layout = alt.vconcat(top_row, bottom_row)

final_layout

The charts indicate that, despite a higher representation of females in the dataset, a comparable number of males and females have diabetes. Additionally, there appears to be no discernible relationship between diabetes and hypertension or smoking history. Interestingly, there are fewer individuals with heart disease among those with diabetes compared to those without diabetes.

Now, we are going to plot a histogram to see the distribution across age, bmi and HbA1c level.

In [68]:
age_distribution_value = diabetes["age"].value_counts()

age_distribution_value

80.00    5621
51.00    1619
47.00    1574
48.00    1568
53.00    1542
         ... 
0.48       83
1.00       83
0.40       66
0.16       59
0.08       36
Name: age, Length: 102, dtype: int64

In [69]:
age_distribution_chart = alt.Chart(pd.DataFrame({"age": age_distribution_value.index, "count": age_distribution_value})).mark_bar().encode(
    x=alt.X("age", title="age"),  
    y=alt.Y("count", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Age Distribution"
)

age_distribution_chart

In [70]:
bmi_distribution_value = diabetes["bmi"].value_counts()

bmi_distribution_value

27.32    25495
23.00      103
27.12      101
27.80      100
24.96      100
         ...  
58.23        1
48.18        1
55.57        1
57.07        1
60.52        1
Name: bmi, Length: 4247, dtype: int64

In [71]:
bmi_distribution_chart = alt.Chart(pd.DataFrame({"bmi": bmi_distribution_value.index, "count": bmi_distribution_value})).mark_bar().encode(
    x=alt.X("bmi", title="BMI"),
    y=alt.Y("count", title="Count"), 
).properties(
    width=400,
    height=200,
    title="bmi Distribution"
)

bmi_distribution_chart

In [72]:
HbA1c_level_value = diabetes["HbA1c_level"].value_counts()

HbA1c_level_value

6.6    8540
5.7    8413
6.5    8362
5.8    8321
6.0    8295
6.2    8269
6.1    8048
3.5    7662
4.8    7597
4.5    7585
4.0    7542
5.0    7471
8.8     661
8.2     661
9.0     654
7.5     643
6.8     642
7.0     634
Name: HbA1c_level, dtype: int64

In [73]:
HbA1c_level_chart = alt.Chart(pd.DataFrame({"HbA1c_level": HbA1c_level_value.index, "count": HbA1c_level_value})).mark_bar().encode(
    x=alt.X("HbA1c_level", title="HbA1c_level"),
    y=alt.Y("count", title="Count"), 
).properties(
    width=00,
    height=300,
    title="HbA1c level Distribution"
)

HbA1c_level_chart

In [74]:
blood_glucose_level_value = diabetes["blood_glucose_level"].value_counts()

blood_glucose_level_value

130    7794
159    7759
140    7732
160    7712
126    7702
145    7679
200    7600
155    7575
90     7112
80     7106
158    7026
100    7025
85     6901
280     729
300     674
240     636
260     635
220     603
Name: blood_glucose_level, dtype: int64

In [75]:
blood_glucose_level_chart = alt.Chart(pd.DataFrame({"blood_glucose_level": blood_glucose_level_value.index, "count": blood_glucose_level_value})).mark_bar().encode(
    x=alt.X("blood_glucose_level", title="blood_glucose_level"),
    y=alt.Y("count", title="Count"), 
).properties(
    width=400,
    height=200,
    title="Blood Glucose Level Distribution"
)

blood_glucose_level_chart

In [76]:
top_row = alt.hconcat(age_distribution_chart, bmi_distribution_chart)
bottom_row = alt.hconcat(HbA1c_level_chart, blood_glucose_level_chart)
final_layout = alt.vconcat(top_row, bottom_row)

final_layout

In [77]:
data= diabetes.head(5000)


scatter_matrix = alt.Chart(data).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    color='diabetes:N'
).properties(
    width=150,
    height=150
).repeat(
    row=["age", "hypertension", "heart_disease", "smoking_history", "HbA1c_level", "bmi", "blood_pressure"],
    column=["age", "hypertension", "heart_disease", "smoking_history", "HbA1c_level", "bmi", "blood_pressure"]
).interactive()

scatter_matrix

## Discussion

### Summary of Findings:

In our investigation to predict the onset of diabetes using a classification approach, we explored multiple feature combinations and their corresponding scores. Notably, the feature combination (`age`, `hypertension`, `heart_disease`, `bmi`, `HbA1c_level`, `blood_glucose_level`) emerged as the top performer with a test score of 0.89506 after hyperparameter tuning. This combination includes crucial variables such as age, hypertension, heart disease, BMI, HbA1c level, and blood glucose level.

Based on the scatter matrix and visualizations of the frequency of each variable, `age`, `bmi`, and `HbA1c_level` appear to be the features that are most important for indicating the presence of diabetes. 

### Expectations vs. Findings:

Our findings align with expectations to a considerable extent. The inclusion of age, hypertension, presence of heart disease, and indicators of metabolic health (BMI, HbA1c, blood glucose) is consistent with established medical knowledge linking these factors to diabetes risk. However, the presence of age, bmi, and HbA1c level are most crucial since these variables possess a stronger correlation to patients with diabetes.

### Impact of Findings:

Finding the best combination of features is important for predicting diabetes earlier on. Achieving a high score with the selected features highlights their predictive relevance, potentially enabling healthcare professionals to proactively identify individuals at risk. This can facilitate early intervention and lifestyle modifications, contributing to improved patient outcomes and reduced healthcare costs.

### Future Questions and Considerations:

Our findings creates opportunities for more in-depth exploration and improvement of models predicting the onset of diabetes. Future studies could look closely at how specific demographic and medical factors interact, aiming to find subtle details that make predictions more accurate. It is also valuable to check if the effectiveness of certain features changes over time or with different health habits.

Moreover, checking our results on different datasets can make our findings more widely applicable. For instance, bringing in genetic data and exploring new technologies such as wearables and continuous glucose monitoring could add more details to improve our predictive models.

As for our models, it is hard to conclusively say that a model performs well based off training and test scores. Future projects should consider using other metrics of evaluation, such as Precision/recall or F1 scores, or even testing using other classification methods such as RandomForesClassifier(). 

In summary, our analysis offers valuable insights into which features predict diabetes early, suggesting the potential for timely intervention. The unexpected absence of certain variables leaves room for more questions, emphasizing the changing nature of predictive modeling in healthcare. As we navigate these findings, continuous research efforts can refine models and contribute to the ever-evolving field of predictive analytics in healthcare.

## References
Herman, W. H., Robinson, N., & Aubert, R. E. (2005). Predicting the Development of Diabetes in Older Adults. Diabetes Care, 28(2), 404-408. https://doi.org/10.2337/diacare.28.2.404

Zhang, P., Zhang, X., Brown, J., Vistisen, D., Sicree, R., Shaw, J., & Nichols, G. (2016). Big Data Approaches for Predicting the Onset of Type 2 Diabetes Mellitus in a Chinese Hospital Database. Diabetes Care, 39(7), 1095-1101. https://doi.org/10.2337/dc15-2042

https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/code