### Task 7HD : Data Mining Challenge
### Truong Khang Thinh Nguyen - 223446545
### Email: s223446545@gmail.com
### SIT220 - Undergraduate



#### This report aims to examine the NHANES datasets spanning from 2017 to March 2020 with a focus on diabetes disease. Its aim is to extract valuable insights regarding the behaviors and information of individuals with and without diabetes across various datasets, including Laboratory, Demographic, Examination, Dietary, and Questionnaire datasets.
#### Furhthermore,this report will primarily concentrate on the standard data associated with diabetes indicators, including BMI, weight, height, glycohemoglobin, insulin level, plasma fasting glucose, exercise intensity, sleep duration, and dietary habits as well as races and genders.

#### Fianally, it will construct a simple regression model aimed at predicting whether an individual has diabetes using data extracted from the NHANES dataset.


#### To start, I import necessary libraries crucial for the analysis, such as NumPy and Pandas for data manipulation, Bokeh for interactive plotting, as well as Scikit-learn, SciPy, and Statsmodels for statistical testing and regression modeling. Next, I proceed to load the pertinent datasets for further examination and analysis.

In [745]:
# import necessary packages
# Data Manipulation
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np

# Plotting
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource
from bokeh.palettes import Category10
from bokeh.transform import dodge
from bokeh.models import ColumnDataSource, HoverTool, Legend, CustomJS, CheckboxGroup
from bokeh.layouts import row , column

# Statistical Test and build model
import scipy.stats as ss
from scipy.stats import kruskal , gaussian_kde
from statsmodels.miscmodels.ordinal_model import OrderedModel
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

In [746]:
# Load datasets
# Since the files are in the .XPT format so I need to use the read_sas function to load the datasets
demograph = pd.read_sas("Demographics.XPT")
phy_act = pd.read_sas("PhysicalActivity.XPT")
sleep_dis = pd.read_sas("Sleep_Dis.XPT")
nutrient_1st = pd.read_sas("Nutrient_In_1st.XPT")
nutrient_2nd = pd.read_sas("Nutrient_In_2nd.XPT")
BMI = pd.read_sas("BMI.XPT")
GlycoHemo = pd.read_sas("GlycoHemo.XPT")
PlasmaFastingGlu = pd.read_sas("PlasmaFastingGlu.XPT")
insulin = pd.read_sas("Insulin.XPT")
Diabetes = pd.read_sas("Diabetes.XPT")

# Create a list of dataframes
dataframes = [Diabetes, demograph, phy_act, sleep_dis,
             nutrient_1st, nutrient_2nd, BMI, GlycoHemo, PlasmaFastingGlu,insulin]


#### After importing the datasets, I merge each dataframe from my previously created data frames list into a single dataframe, using the respondent sequence number (code: SEQN) as the shared key.

In [747]:
# Merge DataFrames

# Initial merge with the first DataFrame which is the Diabetes dataframe
df = dataframes[0]

# Merge remaining DataFrames
for df_merge in dataframes[1:]:
    df = pd.merge(df, df_merge, on="SEQN", how="left")
print("Initially Created Dataframe:")
display(df.head())
print("Shape:",df.shape)

Initially Created Dataframe:


Unnamed: 0,SEQN,DIQ010,DID040,DIQ160,DIQ180,DIQ050,DID060,DIQ060U,DIQ070,DIQ230,...,BMXHIP,BMIHIP,LBXGH,WTSAFPRP_x,LBXGLU,LBDGLUSI,WTSAFPRP_y,LBXIN,LBDINSI,LBDINLC
0,109263.0,2.0,,,,,,,,,...,,,,,,,,,,
1,109264.0,2.0,,2.0,2.0,,,,,,...,85.0,,5.3,27533.174559,97.0,5.38,27533.174559,6.05,36.3,5.397605e-79
2,109265.0,2.0,,,,,,,,,...,,,,,,,,,,
3,109266.0,2.0,,1.0,1.0,,,,2.0,,...,126.1,,5.2,,,,,,,
4,109267.0,2.0,,2.0,2.0,,,,,,...,,,,,,,,,,


Shape: (14986, 362)


#### Since the datasets from NHANES contain attributes represented by codes, it is necessary to decode these codes to reveal the actual meanings of the attributes.


In [748]:
# Filter out irrlevant columns and just keep the appropriate ones to effectivly analyse the data

filtered_df = df[["DIQ010", # Diabetes column
                  "RIAGENDR","RIDAGEYR","RIDRETH3", # Demographic columns
                  "LBXGH", # Glycohemoglobin column
                  "LBXGLU", # Plasma Fasting Glucose column
                  "DR1TKCAL","DR1TPROT", "DR1TCARB", # Total Nutrient Intakes First Day columns
                  "DR1TSUGR","DR1TFIBE", 
                  
                  "DR2TKCAL","DR2TPROT","DR2TCARB", # Total Nutrient Intakes Second Day columns
                  "DR2TSUGR","DR2TFIBE",
                  
                  "PAQ610","PAQ625", # Physical Activty Columns
                  "SLD012","SLD013", # Sleep Disorders Columns
                  "BMXBMI", "BMXWT","BMXHT","BMXWAIST", # Body Measures columns
                  "LBXIN"]] # Insulin Column

# List of the renamed columns
rename_columns  = ["Told to have Diabetes", # Diabetes
                   "Gender","Age","Race", # Demographic
                   "Glycohemoglobin(%)", # Glycohemoglobin
                   "Fasting Glucose(mg/dL)", # Plasma Fasting Glucose
                  "Energy_1(kcal)","Protein_1(gm)","Carbonhydrate_1(gm)", # Total Nutrient Intakes First Day
                   "Total Sugars_1(gm)","Dietary Fiber_1(gm)", 
                   
                  "Energy_2(kcal)","Protein_2(gm)","Carbonhydrate_2(gm)", # Total Nutrient Intakes Second Day
                  "Total Sugars_2(gm)","Dietary Fiber_2(gm)",
                   
                  "Day_Vigorous","Day_Moderate", # Physical Activity
                   "Hr_sleep_weekday","Hr_sleep_weekend", # Sleep Disorders
                  "BMI","Weight(kg)","Height(cm)","Waist Circum(cm)", # Body Measures
                   "Insulin(uU/mL)"] # Insulin

# Rename the columns into a readble text
filtered_df.columns = rename_columns
print("Decoded-columns Dataframe:")
display(filtered_df.head())
print("Shape:",filtered_df.shape)

Decoded-columns Dataframe:


Unnamed: 0,Told to have Diabetes,Gender,Age,Race,Glycohemoglobin(%),Fasting Glucose(mg/dL),Energy_1(kcal),Protein_1(gm),Carbonhydrate_1(gm),Total Sugars_1(gm),...,Dietary Fiber_2(gm),Day_Vigorous,Day_Moderate,Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Insulin(uU/mL)
0,2.0,1.0,2.0,6.0,,,1402.0,52.79,187.65,73.42,...,4.3,,,,,,,,,
1,2.0,2.0,13.0,1.0,5.3,97.0,1046.0,55.55,121.68,27.86,...,17.2,,,,,17.6,42.2,154.7,63.8,6.05
2,2.0,1.0,2.0,3.0,,,1926.0,57.47,246.53,157.08,...,9.5,,,,,15.0,12.0,89.3,41.2,
3,2.0,2.0,29.0,6.0,5.2,,1698.0,52.58,217.69,94.2,...,18.7,,,7.5,8.0,37.8,97.1,160.2,117.9,
4,2.0,2.0,21.0,2.0,,,,,,,...,,,,8.0,8.0,,,,,


Shape: (14986, 25)


#### First and foremost, we need to examine the conditions of each column in the dataframe, as each may have specific requirements. For instance, most columns contain information for individuals aged 0-150 years, but there are exceptions. For instance, the BMI column includes data for those aged 2-150 years, the glycohemoglobin column contains data for ages 12-150 years, and the sleep hours column includes data only for ages 16-150 years.
#### As a result, we will exclude rows where the age is under 16. This is acceptable since diabetes predominantly occurs in teenagers and older individuals, so removing these rows will not significantly impact our data frame.
#### Futhermore,for the possible values of the "Told to have Diabetes" column we have 1 is Yes, 2 is No 3 is Borderline, 7 is Refused and 9 is Don't know so we'll remove rows where this column has the codes 7 and 9  because including them could affect the accuracy of our insights drawn from the dataframe.

In [749]:
# Filter out the age
filter_df = filtered_df.query("Age >= 16")

# Filter out the code 7 and 9
filter_df = filter_df.query("`Told to have Diabetes` != 7 and `Told to have Diabetes` != 9")
print("Filtered out Age Dataframe:")
display(filter_df.head())
print("Shape:",filter_df.shape)

Filtered out Age Dataframe:


Unnamed: 0,Told to have Diabetes,Gender,Age,Race,Glycohemoglobin(%),Fasting Glucose(mg/dL),Energy_1(kcal),Protein_1(gm),Carbonhydrate_1(gm),Total Sugars_1(gm),...,Dietary Fiber_2(gm),Day_Vigorous,Day_Moderate,Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Insulin(uU/mL)
3,2.0,2.0,29.0,6.0,5.2,,1698.0,52.58,217.69,94.2,...,18.7,,,7.5,8.0,37.8,97.1,160.2,117.9,
4,2.0,2.0,21.0,2.0,,,,,,,...,,,,8.0,8.0,,,,,
5,2.0,2.0,18.0,3.0,,,,,,,...,,5.0,5.0,8.5,8.0,,,,,
8,2.0,1.0,49.0,3.0,5.6,103.0,2310.0,110.62,207.54,85.0,...,16.2,,2.0,10.0,13.0,29.7,98.8,182.3,120.4,16.96
9,2.0,1.0,36.0,3.0,5.1,,1403.0,56.73,265.59,162.78,...,17.5,3.0,,6.5,8.0,21.9,74.3,184.2,86.8,


Shape: (10190, 25)


#### So we can observe that some rows have been partially removed.

#### Next, I'll calculate the average dietary measures for nutrient intake on both the first and second days.

In [750]:
# Calculate the mean values for day 1 and day 2 nutrient intakes for different measures
# List of Average Measure 
avg_measure = ["Energy(kcal)", "Protein(gm)","Carbonhydrate(gm)",
              "Total Sugars(gm)","Dietary Fiber(gm)"]

# List of the first day measure
first_day = ["Energy_1(kcal)","Protein_1(gm)","Carbonhydrate_1(gm)",
            "Total Sugars_1(gm)","Dietary Fiber_1(gm)"]
# List of the second day measure
second_day = ["Energy_2(kcal)","Protein_2(gm)","Carbonhydrate_2(gm)",
             "Total Sugars_2(gm)","Dietary Fiber_2(gm)"]

# Create a new column represents the average value for both days
# Iterate over the indices of avg_measure
for i in range(len(avg_measure)):
    # Calculate the average of the corresponding columns for each row
    filter_df[avg_measure[i]] = filter_df[ [first_day[i], second_day[i]] ].mean(axis="columns")
    # Drop the redundant columns
    filter_df.drop(columns=[first_day[i], second_day[i]], inplace=True)

display(filter_df.head())
print("Shape:",filter_df.shape)

Unnamed: 0,Told to have Diabetes,Gender,Age,Race,Glycohemoglobin(%),Fasting Glucose(mg/dL),Day_Vigorous,Day_Moderate,Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Insulin(uU/mL),Energy(kcal),Protein(gm),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
3,2.0,2.0,29.0,6.0,5.2,,,,7.5,8.0,37.8,97.1,160.2,117.9,,1797.0,57.75,246.655,79.955,19.7
4,2.0,2.0,21.0,2.0,,,,,8.0,8.0,,,,,,,,,,
5,2.0,2.0,18.0,3.0,,,5.0,5.0,8.5,8.0,,,,,,,,,,
8,2.0,1.0,49.0,3.0,5.6,103.0,,2.0,10.0,13.0,29.7,98.8,182.3,120.4,16.96,3149.5,125.55,303.21,129.51,12.85
9,2.0,1.0,36.0,3.0,5.1,,3.0,,6.5,8.0,21.9,74.3,184.2,86.8,,1766.0,68.84,243.94,134.545,11.85


Shape: (10190, 20)


#### For the next step, we'll convert certain columns from numerical representations to categorical ones. This includes columns like Race, Gender, and Told to have Diabetes, where the values correspond to specific categories as outlined in the NHANES documentation.

In [751]:
# Rename the Told to have Diabetes column
filter_df["Told to have Diabetes"].replace({1:"Yes",2:"No",3:"Borderline"},inplace = True)

# Rename the gender
filter_df["Gender"].replace({1:"Male",2:"Female"},inplace = True)

# Rename Race
filter_df["Race"].replace({1:"Mexican American",2:"Other Hispanic",
                         3:"Non-Hispanic White",4:"Non-Hispanic Black",
                         6:"Non-Hispanic Asian",7:"Other Race"},inplace = True)
display(filter_df)

Unnamed: 0,Told to have Diabetes,Gender,Age,Race,Glycohemoglobin(%),Fasting Glucose(mg/dL),Day_Vigorous,Day_Moderate,Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Insulin(uU/mL),Energy(kcal),Protein(gm),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
3,No,Female,29.0,Non-Hispanic Asian,5.2,,,,7.5,8.0,37.8,97.1,160.2,117.9,,1797.0,57.750,246.655,79.955,19.70
4,No,Female,21.0,Other Hispanic,,,,,8.0,8.0,,,,,,,,,,
5,No,Female,18.0,Non-Hispanic White,,,5.0,5.0,8.5,8.0,,,,,,,,,,
8,No,Male,49.0,Non-Hispanic White,5.6,103.0,,2.0,10.0,13.0,29.7,98.8,182.3,120.4,16.96,3149.5,125.550,303.210,129.510,12.85
9,No,Male,36.0,Non-Hispanic White,5.1,,3.0,,6.5,8.0,21.9,74.3,184.2,86.8,,1766.0,68.840,243.940,134.545,11.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14978,No,Male,52.0,Non-Hispanic Black,5.8,102.0,5.0,,6.0,6.0,29.5,94.3,178.8,99.3,7.10,1618.5,75.910,134.795,39.905,7.05
14980,Yes,Female,67.0,Mexican American,6.6,,3.0,,8.0,8.0,37.9,82.8,147.8,110.0,,1131.0,51.330,90.330,51.030,5.80
14981,No,Male,40.0,Non-Hispanic Black,5.9,,,,6.0,7.0,38.2,108.8,168.7,114.7,,3397.0,86.565,432.870,249.520,18.65
14984,Borderline,Male,63.0,Non-Hispanic Black,5.9,125.0,1.0,,8.0,9.0,25.5,79.5,176.4,97.1,7.75,1698.0,138.100,110.590,50.570,6.70


#### After completing merging, formatting, cleaning, and filtering out irrelevant columns and rows, we will conduct Exploratory Data Analysis (EDA) to gain a general understanding of the cleaned dataframe.
#### Let's start with by using the describe() function and checking the null values by using info() function

In [752]:
print("Information about the cleanned dataframe:")
display(filter_df.describe())

print("Cheking null values:")
display(filter_df.info())

Information about the cleanned dataframe:


Unnamed: 0,Age,Glycohemoglobin(%),Fasting Glucose(mg/dL),Day_Vigorous,Day_Moderate,Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Insulin(uU/mL),Energy(kcal),Protein(gm),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
count,10190.0,8900.0,4365.0,2417.0,4186.0,10101.0,10095.0,9260.0,9281.0,9275.0,8913.0,4264.0,8550.0,8550.0,8550.0,8550.0,8550.0
mean,47.960255,5.807337,112.332417,4.258585,4.4656,7.640382,8.36107,29.668326,82.804601,166.812119,99.598642,14.991637,2031.072749,77.329136,236.098028,100.638917,15.95038
std,19.502153,1.093816,37.442766,3.783973,4.941849,1.682259,1.823536,7.635875,23.272752,10.040574,17.63942,23.555759,862.85172,35.839438,108.468809,63.958059,9.242116
min,16.0,2.8,47.0,1.0,1.0,2.0,2.0,13.2,32.6,131.1,56.4,0.71,14.0,1.45,1.0,0.03,5.397605e-79
25%,31.0,5.3,95.0,3.0,3.0,7.0,7.0,24.3,66.5,159.3,87.0,6.21,1434.5,53.35125,162.7675,56.9525,9.55
50%,49.0,5.5,102.0,5.0,5.0,7.5,8.0,28.4,79.1,166.4,98.2,10.04,1898.5,71.69,220.5,87.9875,14.05
75%,64.0,5.9,114.0,5.0,5.0,8.5,9.5,33.5,95.0,174.0,110.5,16.55,2484.375,94.73375,291.3425,128.2775,20.15
max,80.0,16.2,524.0,99.0,99.0,14.0,14.0,92.3,254.3,199.6,187.5,512.5,8967.5,370.83,1183.345,892.645,103.4


Cheking null values:
<class 'pandas.core.frame.DataFrame'>
Index: 10190 entries, 3 to 14985
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Told to have Diabetes   10190 non-null  object 
 1   Gender                  10190 non-null  object 
 2   Age                     10190 non-null  float64
 3   Race                    10190 non-null  object 
 4   Glycohemoglobin(%)      8900 non-null   float64
 5   Fasting Glucose(mg/dL)  4365 non-null   float64
 6   Day_Vigorous            2417 non-null   float64
 7   Day_Moderate            4186 non-null   float64
 8   Hr_sleep_weekday        10101 non-null  float64
 9   Hr_sleep_weekend        10095 non-null  float64
 10  BMI                     9260 non-null   float64
 11  Weight(kg)              9281 non-null   float64
 12  Height(cm)              9275 non-null   float64
 13  Waist Circum(cm)        8913 non-null   float64
 14  Insulin(uU/mL)        

None

#### Next, we'll initially examine the distribution of ages among individuals with and without diabetes.

In [753]:
# Ensure the output is directed to the Jupyter notebook
output_notebook()

# Filter data for individuals with diabetes, without diabetes, and borderline diabetes
data_dict = {
    'Diabetes': filter_df[filter_df['Told to have Diabetes'].isin(['Yes', 'Borderline'])],
    'No Diabetes': filter_df[filter_df['Told to have Diabetes'] == 'No']
}

# Colors for each category
colors = {
    'Diabetes': 'red',
    'No Diabetes': 'orange'
}

# Create figure
age = figure(title="Age Distribution by Diabetes Status", x_axis_label='Age', y_axis_label='Frequency', width=800, height=400)

# Create histograms for each category
for label, data in data_dict.items():
    hist, edges = np.histogram(data['Age'], bins=20)
    source = ColumnDataSource(data=dict(top=hist, left=edges[:-1], right=edges[1:], label=[label] * len(hist), count=hist))
    
    age.quad(top='top', bottom=0, left='left', right='right', color=colors[label], alpha=0.7,
            source=source, legend_label=label)

# Add hover tool
hover = HoverTool(tooltips=[("Category", "@label"), ("Count", "@count"), ("Age Range", "@left{0.0} - @right{0.0}")])
age.add_tools(hover)

# Add legend
age.legend.location = 'top_right'
age.legend.click_policy = 'hide'

# Show the plot
show(age)



#### Firstly, the majority of individuals who do not have diabetes are teenagers. This suggests that younger people, particularly those in their teenage years, generally have a lower incidence of diabetes. As people age, however, the likelihood of developing diabetes begins to rise. This trend becomes apparent when examining individuals in their 30s.

#### The data shows that most people diagnosed with diabetes are in their 60s and 70s. This highlights that middle-aged and older adults are more susceptible to developing diabetes compared to younger individuals. A particularly interesting observation is the significant increase in diabetes prevalence among individuals who are nearly 80 years old. This suggests that very elderly individuals have a much higher likelihood of having diabetes compared to other age groups.

#### Let's investigate the number of individuals diagnosed with diabetes and those not diagnosed or pre-diabetes, comparing between genders.

In [754]:
# Ensure the output is directed to the Jupyter notebook
output_notebook()

# Group the data by 'Gender' and 'Told to have Diabetes', then count occurrences
grouped_data = filter_df.groupby(['Gender', 'Told to have Diabetes']).size().unstack(fill_value=0)

# Convert DataFrame to ColumnDataSource
source = ColumnDataSource(grouped_data)

# Define the categories and colors
categories = list(grouped_data.columns)
colors = Category10[len(categories)]

# Create the Bokeh plot
p = figure(x_range=grouped_data.index.tolist(), height=350,
           title="Number of people told to have Diabetes by Gender",
           toolbar_location=None, tools="")

# Plot bars for each category
renderers = []
for i, category in enumerate(categories):
    renderer = p.vbar(x=dodge('Gender', i*0.2, range=p.x_range), top=category, width=0.2, source=source,
                      color=colors[i], legend_label=category)
    renderers.append(renderer)

# Add hover tool with tooltips for each 'Told to have Diabetes' category
hover = HoverTool()
hover.tooltips = [("Gender", "@Gender"), ("No", "@No"), ("Borderline", "@Borderline"), ("Yes", "@Yes")]
p.add_tools(hover)

# Set plot attributes
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Gender"
p.yaxis.axis_label = "Number of individuals"
p.legend.title = "Told to have Diabetes"
p.legend.location = "top_right"
p.legend.click_policy = "hide"

# Show the plot
show(p)

#### The visualization highlights a clear trend: a larger proportion of individuals in the dataset are reported as not having diabetes, as opposed to those who do. This suggests that diabetes is not as prevalent among the surveyed population. However, among those who have reported diabetes, there appears to be a notable gender discrepancy, with males exhibiting a higher likelihood of being diagnosed with diabetes compared to females.

####  Furthermore, the disparity between those without and with diabetes is substantial, indicating that individuals with prediabetes (borderline) constitute a significantly smaller portion compared to those with and without diabetes.

#### The previous plot depicted the count of individuals diagnosed with diabetes or not, solely based on medical diagnoses and respondents' answers.
#### Let's examine some key indicators directly associated with diabetes, so I will take BMI (Body Mass Index) and Glycohemoglobin (%). BMI, although not a direct measure of blood sugar levels, is closely linked to the risk of diabetes because it correlates with obesity, a significant risk factor for type 2 diabetes. Glycohemoglobin (%), also known as HbA1c, provides insights into average blood sugar levels over a span of two to three months.


In [755]:
from bokeh.palettes import Spectral6  # Importing a different color palette

# Ensure the output is directed to the Jupyter notebook
output_notebook()

# Preparing data for Glycohemoglobin(%)
ghb_data = filter_df.pivot_table(index='Gender', columns='Told to have Diabetes', values='Glycohemoglobin(%)', aggfunc='mean').fillna(0).reset_index()
ghb_data_source = ColumnDataSource(ghb_data)

# Preparing data for BMI
bmi_data = filter_df.pivot_table(index='Gender', columns='Told to have Diabetes', values='BMI', aggfunc='mean').fillna(0).reset_index()
bmi_data_source = ColumnDataSource(bmi_data)

# Define categories and colors
categories = ['No', 'Borderline', 'Yes']
colors = Spectral6[:len(categories)]

# Create figures
p1 = figure(y_range=['Male', 'Female'], title="Glycohemoglobin(%) by Gender", 
            x_axis_label="Glycohemoglobin(%)", y_axis_label="Gender", width=450, height=450)
p2 = figure(y_range=['Male', 'Female'], title="BMI by Gender", 
            x_axis_label="BMI", y_axis_label="Gender", width=450, height=450)

# Plot Glycohemoglobin(%)
renderers_ghb = []
for i, category in enumerate(categories):
    renderer = p1.hbar(y=dodge('Gender', i*0.2, range=p1.y_range), right=category, height=0.2, source=ghb_data_source, 
                       color=colors[i], legend_label=category)
    renderers_ghb.append(renderer)

# Plot BMI
renderers_bmi = []
for i, category in enumerate(categories):
    renderer = p2.hbar(y=dodge('Gender', i*0.2, range=p2.y_range), right=category, height=0.2, source=bmi_data_source, 
                       color=colors[i], legend_label=category)
    renderers_bmi.append(renderer)

# Add hover tool for Glycohemoglobin
hover_ghb = HoverTool(tooltips=[("Gender", "@Gender")] +
                                [(f"Glycohemoglobin ({category})", f"@{category}%") for category in categories])
p1.add_tools(hover_ghb)

# Add hover tool for BMI
hover_bmi = HoverTool(tooltips=[("Gender", "@Gender")] +
                               [(f"BMI ({category})", f"@{category}") for category in categories])
p2.add_tools(hover_bmi)

# Create legend
legend = Legend(items=[(category, [renderer]) for category, renderer in zip(categories, renderers_ghb)], 
                location="top_center", title="Told to have Diabetes")
p1.add_layout(legend, "right")

# Share y-axis
p2.y_range = p1.y_range

# Remove existing legends
p1.legend.visible = False
p2.legend.visible = True

# Turn off y-axis in BMI plot ==> Share Y
p2.yaxis.visible = False

# Show the layout with plots arranged horizontally
show(row(p1, p2))




####  It's notable that there's a clear correlation between higher Glycohemoglobin (%) levels and the likelihood of having diabetes. While the American Diabetes Association (ADA) typically sets a threshold of 6.5% or higher for diagnosing diabetes, it's interesting to observe that individuals from out datasets with diabetes have Glycohemoglobin (%) levels ranging from 7.4% to 7.5%. This range significantly exceeds the threshold established by the ADA.

#### In the case of BMI, once again, there's a noticeable trend where higher BMI values correspond to a greater likelihood of having diabetes. However, unlike Glycohemoglobin (%), the BMI values for individuals with diabetes and those with borderline diabetes are not significantly different. This observation is particularly striking among females, where the BMI for individuals with borderline diabetes and those with diabetes is almost identical. This stands in stark contrast to the situation with Glycohemoglobin (%), where the values for individuals with diabetes are notably higher compared to those with borderline diabetes or without the condition.

#### Essentially, we've made certain observations based on our visualizations, but these are just initial insights. To verify our conclusions, we'll need to conduct regression analysis. This statistical method will allow us to delve deeper into the relationships between variables, providing a more precise understanding of how they interact and influence each other. So, while our visualizations offer valuable insights, regression analysis offers a more rigorous and accurate approach to analyzing the data and confirming our findings.

#### Let's take a look again at our dataset to identify any missing values.

In [756]:
filter_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10190 entries, 3 to 14985
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Told to have Diabetes   10190 non-null  object 
 1   Gender                  10190 non-null  object 
 2   Age                     10190 non-null  float64
 3   Race                    10190 non-null  object 
 4   Glycohemoglobin(%)      8900 non-null   float64
 5   Fasting Glucose(mg/dL)  4365 non-null   float64
 6   Day_Vigorous            2417 non-null   float64
 7   Day_Moderate            4186 non-null   float64
 8   Hr_sleep_weekday        10101 non-null  float64
 9   Hr_sleep_weekend        10095 non-null  float64
 10  BMI                     9260 non-null   float64
 11  Weight(kg)              9281 non-null   float64
 12  Height(cm)              9275 non-null   float64
 13  Waist Circum(cm)        8913 non-null   float64
 14  Insulin(uU/mL)          4264 non-null   flo

#### The considerable number of missing values in the columns "Fasting Glucose (mg/dL)", "Day_Vigorous", "Day_Moderate", and "Insulin (uU/mL)" could potentially affect the validity of our analysis if these columns are included in our model.

#### With the removal of 4 columns due to significant missing values, we are left with 16 columns in the dataset. Despite this reduction, the remaining 16 columns should still provide an adequate sample size for conducting the analysis.

In [757]:
# Drop columns that have a lot of null values
filter_df.drop(columns = ["Fasting Glucose(mg/dL)","Day_Vigorous",
                          "Day_Moderate","Insulin(uU/mL)"],inplace = True)
filter_df

Unnamed: 0,Told to have Diabetes,Gender,Age,Race,Glycohemoglobin(%),Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Energy(kcal),Protein(gm),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
3,No,Female,29.0,Non-Hispanic Asian,5.2,7.5,8.0,37.8,97.1,160.2,117.9,1797.0,57.750,246.655,79.955,19.70
4,No,Female,21.0,Other Hispanic,,8.0,8.0,,,,,,,,,
5,No,Female,18.0,Non-Hispanic White,,8.5,8.0,,,,,,,,,
8,No,Male,49.0,Non-Hispanic White,5.6,10.0,13.0,29.7,98.8,182.3,120.4,3149.5,125.550,303.210,129.510,12.85
9,No,Male,36.0,Non-Hispanic White,5.1,6.5,8.0,21.9,74.3,184.2,86.8,1766.0,68.840,243.940,134.545,11.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14978,No,Male,52.0,Non-Hispanic Black,5.8,6.0,6.0,29.5,94.3,178.8,99.3,1618.5,75.910,134.795,39.905,7.05
14980,Yes,Female,67.0,Mexican American,6.6,8.0,8.0,37.9,82.8,147.8,110.0,1131.0,51.330,90.330,51.030,5.80
14981,No,Male,40.0,Non-Hispanic Black,5.9,6.0,7.0,38.2,108.8,168.7,114.7,3397.0,86.565,432.870,249.520,18.65
14984,Borderline,Male,63.0,Non-Hispanic Black,5.9,8.0,9.0,25.5,79.5,176.4,97.1,1698.0,138.100,110.590,50.570,6.70


In [758]:
# Checking null values again
filter_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10190 entries, 3 to 14985
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Told to have Diabetes  10190 non-null  object 
 1   Gender                 10190 non-null  object 
 2   Age                    10190 non-null  float64
 3   Race                   10190 non-null  object 
 4   Glycohemoglobin(%)     8900 non-null   float64
 5   Hr_sleep_weekday       10101 non-null  float64
 6   Hr_sleep_weekend       10095 non-null  float64
 7   BMI                    9260 non-null   float64
 8   Weight(kg)             9281 non-null   float64
 9   Height(cm)             9275 non-null   float64
 10  Waist Circum(cm)       8913 non-null   float64
 11  Energy(kcal)           8550 non-null   float64
 12  Protein(gm)            8550 non-null   float64
 13  Carbonhydrate(gm)      8550 non-null   float64
 14  Total Sugars(gm)       8550 non-null   float64
 15  Dietary

#### Now that we've removed columns with a significant number of null values, we still have some columns with a moderate number of null values. However, these null values are not prevalent enough to warrant the deletion of these columns.And if we delete them our regression analysis will be severely affected.

#### Instead, we will employ imputation techniques such as median, mean, and random sample imputation to handle the remaining null values in the columns. Subsequently, we will compare the effects of these three techniques on the distribution of the data to determine the most suitable approach for filling the null values.

#### Let's conduct an experiment on our "BMI" column.

In [759]:
# BMI

# Mean values
mean_value = filter_df["BMI"].mean()
mean_impt = filter_df["BMI"].fillna(mean_value)

# Median value 
median = filter_df["BMI"].median()
median_impt = filter_df["BMI"].fillna(median)


# Random sample imputation
def random_sample_imputation(df):
    cols_with_missing_values = df.columns[df.isna().any()].tolist()

    for var in cols_with_missing_values:
        # extract a random sample
        random_sample_df = df[var].dropna().sample(df[var].isnull().sum(), replace=True, random_state=0)
        # re-index the randomly extracted sample
        random_sample_df.index = df[df[var].isnull()].index
        # replace the NA
        df.loc[df[var].isnull(), var] = random_sample_df.values

    return df

random_impt = random_sample_imputation(filter_df[["BMI"]])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df[var].isnull(), var] = random_sample_df.values


In [760]:
def create_kde_plot(data, color, legend_label, visible):
    kde = gaussian_kde(data)
    x_vals = np.linspace(min(data), max(data), 1000)
    y_vals = kde(x_vals)
    if visible:
        plot_figure.line(x_vals, y_vals, color=color, legend_label=legend_label)

# Data for different imputation methods
data_dict = {
    "Random Sample Imputation": random_impt["BMI"],
    "Median Imputation": median_impt,
    "Mean Imputation": mean_impt,
    "Original Data": filter_df["BMI"].dropna()
}

# Colors for each plot
colors = ['red', 'green', 'blue', 'orange']

# Create Bokeh plot
plot_figure = figure(title="BMI Distribution with different imputation methods", x_axis_label="BMI",
               y_axis_label="Density", width=600, height=400)

# Plot KDE for each imputation method with different colors
for i, (label, data) in enumerate(data_dict.items()):
    create_kde_plot(data, color=colors[i], legend_label=label, visible=True)

# Checkbox callback function
checkboxes = CheckboxGroup(labels=list(data_dict.keys()), active=list(range(len(data_dict))))
checkbox_callback = CustomJS(args=dict(checkboxes=checkboxes, plot_figure=plot_figure), code="""
    const active = checkboxes.active;
    for (let i = 0; i < checkboxes.labels.length; i++) {
        const visible = active.includes(i);
        const glyph = plot_figure.renderers[i];
        glyph.visible = visible;
    }
""")
checkboxes.js_on_change('active', checkbox_callback)

# Show the plot with checkboxes
column_layout = column(plot_figure, checkboxes)
show(column_layout)

#### The distribution of the BMI data clearly indicates that the random sample technique closely resembles the original distribution, unlike the mean and median values.

#### Let's test another column to confirm our assumption. We'll use the "Protein(gm)" column for the next experiment.

In [761]:
# Protein 
# Mean values
mean_value = filter_df["Protein(gm)"].mean()
mean_impt = filter_df["Protein(gm)"].fillna(mean_value)

# Median value 
median = filter_df["Protein(gm)"].median()
median_impt = filter_df["Protein(gm)"].fillna(median)

# Random sample
random_impt = random_sample_imputation(filter_df[["Protein(gm)"]])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df[var].isnull(), var] = random_sample_df.values


In [762]:
# Data for different imputation methods
data_dict = {
    "Random Sample Imputation": random_impt["Protein(gm)"],
    "Median Imputation": median_impt,
    "Mean Imputation": mean_impt,
    "Original Data": filter_df["Protein(gm)"].dropna()
}

# Colors for each plot
colors = ['black', 'green', 'purple', 'yellow']

# Create Bokeh plot
plot_figure = figure(title="Protein(gm) Distribution with different imputation methods", x_axis_label="Protein(gm)",
               y_axis_label="Density", width=600, height=400)

# Plot KDE for each imputation method with different colors
for i, (label, data) in enumerate(data_dict.items()):
    create_kde_plot(data, color=colors[i], legend_label=label, visible=True)

# Checkbox callback function
checkboxes = CheckboxGroup(labels=list(data_dict.keys()), active=list(range(len(data_dict))))
checkbox_callback = CustomJS(args=dict(checkboxes=checkboxes, plot_figure=plot_figure), code="""
    const active = checkboxes.active;
    for (let i = 0; i < checkboxes.labels.length; i++) {
        const visible = active.includes(i);
        const glyph = plot_figure.renderers[i];
        glyph.visible = visible;
    }
""")
checkboxes.js_on_change('active', checkbox_callback)

# Show the plot with checkboxes
column_layout = column(plot_figure, checkboxes)
show(column_layout)

#### Therefore, we can conclude that the random sample technique provides the most accurate replacement for null values.

#### So now we will replace those null values in our column by using this techinque.

In [763]:
# Create a list of columns that have null values
columns_with_null = filter_df.columns[filter_df.isnull().any()].tolist()
for column in columns_with_null:
     filter_df[column] = random_sample_imputation(filter_df[[column]])   

filter_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10190 entries, 3 to 14985
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Told to have Diabetes  10190 non-null  object 
 1   Gender                 10190 non-null  object 
 2   Age                    10190 non-null  float64
 3   Race                   10190 non-null  object 
 4   Glycohemoglobin(%)     10190 non-null  float64
 5   Hr_sleep_weekday       10190 non-null  float64
 6   Hr_sleep_weekend       10190 non-null  float64
 7   BMI                    10190 non-null  float64
 8   Weight(kg)             10190 non-null  float64
 9   Height(cm)             10190 non-null  float64
 10  Waist Circum(cm)       10190 non-null  float64
 11  Energy(kcal)           10190 non-null  float64
 12  Protein(gm)            10190 non-null  float64
 13  Carbonhydrate(gm)      10190 non-null  float64
 14  Total Sugars(gm)       10190 non-null  float64
 15  Dietary

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df[var].isnull(), var] = random_sample_df.values
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df[var].isnull(), var] = random_sample_df.values
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df[var].isnull(), var] = random_sample_df.values
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

#### After dealing with null values ,to effectively carry out our ordinal regression analysis, we must define the target variable and predictors. In this scenario, our target variable will be "Told to have Diabetes," which signifies whether an individual is classified as having No, Borderline, or Yes status regarding diabetes diagnosis.

#### We'll divide the remaining columns in our dataset into two types: categorical and continuous. Among them, "Gender" and "Race" fall under the categorical type, while the rest will serve as our continuous predictors.

#### In this step, we will assess whether there is an association between our target variable and the two columns "Gender" and "Race" by conducting the Cramer's V test.

In [764]:
# Define a new dataframe represents the categorical variables in our dataset
category_df = pd.DataFrame(({"Told to have Diabetes": filter_df["Told to have Diabetes"],
                         "Gender": filter_df["Gender"],
                         "Race": filter_df["Race"]}))
category_df

Unnamed: 0,Told to have Diabetes,Gender,Race
3,No,Female,Non-Hispanic Asian
4,No,Female,Other Hispanic
5,No,Female,Non-Hispanic White
8,No,Male,Non-Hispanic White
9,No,Male,Non-Hispanic White
...,...,...,...
14978,No,Male,Non-Hispanic Black
14980,Yes,Female,Mexican American
14981,No,Male,Non-Hispanic Black
14984,Borderline,Male,Non-Hispanic Black


In [765]:
# Define our function to calculate the cramers v test
def cramers_v(confusion_matrix):
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
    rcorr = r - ((r - 1) ** 2) / (n - 1)
    kcorr = k - ((k - 1) ** 2) / (n - 1)
    return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

In [766]:
matrix = pd.crosstab(filter_df["Told to have Diabetes"],filter_df["Gender"])
cramers_v(matrix.values)

0.04066328225684235

In [767]:
matrix = pd.crosstab(filter_df["Told to have Diabetes"],filter_df["Race"])
cramers_v(matrix.values)

0.0316639654883984

#### So we can see that in this case, the relationship between the variables "Gender" and "Race" and the target variable is relatively weak (since  both Cramer's V values are considered to be very low)

#### Therefore, we proceed by removing those two categorical columns from our dataset.

In [768]:
filter_df.drop(columns = ["Gender","Race"],inplace = True)
filter_df

Unnamed: 0,Told to have Diabetes,Age,Glycohemoglobin(%),Hr_sleep_weekday,Hr_sleep_weekend,BMI,Weight(kg),Height(cm),Waist Circum(cm),Energy(kcal),Protein(gm),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
3,No,29.0,5.2,7.5,8.0,37.8,97.1,160.2,117.9,1797.0,57.750,246.655,79.955,19.70
4,No,21.0,6.1,8.0,8.0,25.6,86.1,168.3,98.8,1022.0,33.520,158.635,43.285,5.95
5,No,18.0,7.1,8.5,8.0,27.5,61.3,154.4,116.4,3970.0,144.755,519.490,194.120,45.30
8,No,49.0,5.6,10.0,13.0,29.7,98.8,182.3,120.4,3149.5,125.550,303.210,129.510,12.85
9,No,36.0,5.1,6.5,8.0,21.9,74.3,184.2,86.8,1766.0,68.840,243.940,134.545,11.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14978,No,52.0,5.8,6.0,6.0,29.5,94.3,178.8,99.3,1618.5,75.910,134.795,39.905,7.05
14980,Yes,67.0,6.6,8.0,8.0,37.9,82.8,147.8,110.0,1131.0,51.330,90.330,51.030,5.80
14981,No,40.0,5.9,6.0,7.0,38.2,108.8,168.7,114.7,3397.0,86.565,432.870,249.520,18.65
14984,Borderline,63.0,5.9,8.0,9.0,25.5,79.5,176.4,97.1,1698.0,138.100,110.590,50.570,6.70


#### Next, we will investigate the relationship between our target variable and the remaining continuous variables.

#### We will assess their correlation using the Kruskal-Wallis test. Since our target variable is categorical, Pearson's correlation coefficient cannot be utilized.

In [769]:
accept_column =[] 
# Loop through each column to calculate the test
for column in filter_df.columns[1:]:
    h_statistic, p_value = kruskal(filter_df[column][filter_df["Told to have Diabetes"] == 'No'],
                             filter_df[column][filter_df["Told to have Diabetes"] == 'Borderline'],
                              filter_df[column][filter_df["Told to have Diabetes"] == 'Yes'])
    if p_value < 0.05 : 
        accept_column.append(column)
print("Columns that pass the test:",accept_column)

Columns that pass the test: ['Age', 'Glycohemoglobin(%)', 'Hr_sleep_weekend', 'BMI', 'Weight(kg)', 'Waist Circum(cm)', 'Energy(kcal)', 'Carbonhydrate(gm)', 'Total Sugars(gm)', 'Dietary Fiber(gm)']


#### Let's filter out those columns that have not passed the test

In [770]:
filter_df = filter_df[ ["Told to have Diabetes"] + accept_column]
filter_df

Unnamed: 0,Told to have Diabetes,Age,Glycohemoglobin(%),Hr_sleep_weekend,BMI,Weight(kg),Waist Circum(cm),Energy(kcal),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
3,No,29.0,5.2,8.0,37.8,97.1,117.9,1797.0,246.655,79.955,19.70
4,No,21.0,6.1,8.0,25.6,86.1,98.8,1022.0,158.635,43.285,5.95
5,No,18.0,7.1,8.0,27.5,61.3,116.4,3970.0,519.490,194.120,45.30
8,No,49.0,5.6,13.0,29.7,98.8,120.4,3149.5,303.210,129.510,12.85
9,No,36.0,5.1,8.0,21.9,74.3,86.8,1766.0,243.940,134.545,11.85
...,...,...,...,...,...,...,...,...,...,...,...
14978,No,52.0,5.8,6.0,29.5,94.3,99.3,1618.5,134.795,39.905,7.05
14980,Yes,67.0,6.6,8.0,37.9,82.8,110.0,1131.0,90.330,51.030,5.80
14981,No,40.0,5.9,7.0,38.2,108.8,114.7,3397.0,432.870,249.520,18.65
14984,Borderline,63.0,5.9,9.0,25.5,79.5,97.1,1698.0,110.590,50.570,6.70


In [771]:
# Check the values range
filter_df.describe()

Unnamed: 0,Age,Glycohemoglobin(%),Hr_sleep_weekend,BMI,Weight(kg),Waist Circum(cm),Energy(kcal),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
count,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0
mean,47.960255,5.809764,8.361531,29.666075,82.700353,99.5363,2033.074141,236.32481,100.614493,15.97332
std,19.502153,1.100796,1.821767,7.597456,23.296978,17.682151,863.040285,108.53688,63.701624,9.250914
min,16.0,2.8,2.0,13.2,32.6,56.4,14.0,1.0,0.03,5.397605e-79
25%,31.0,5.3,7.0,24.3,66.3,86.9,1432.0,162.67125,56.64125,9.6
50%,49.0,5.5,8.0,28.4,79.1,98.2,1899.0,220.325,88.075,14.075
75%,64.0,5.9,9.5,33.5,94.9,110.4,2489.0,292.265,128.305,20.1375
max,80.0,16.2,14.0,92.3,254.3,187.5,8967.5,1183.345,892.645,103.4


#### As we can see the data columns exhibit a wide range of values, notably high range of values such as Energy(kcal) and Dietary Fiber(gm). Hence, standardizing the dataset is imperative to ensure consistency in scale across all features.

In [772]:
continuous_df = filter_df.drop(columns = ["Told to have Diabetes"])
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data and transform it
std_df = scaler.fit_transform(continuous_df)

# Convert the standardized array back to a DataFrame
std_df = pd.DataFrame(std_df, columns=continuous_df.columns)


In [773]:
display(std_df.info())
display(std_df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10190 entries, 0 to 10189
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 10190 non-null  float64
 1   Glycohemoglobin(%)  10190 non-null  float64
 2   Hr_sleep_weekend    10190 non-null  float64
 3   BMI                 10190 non-null  float64
 4   Weight(kg)          10190 non-null  float64
 5   Waist Circum(cm)    10190 non-null  float64
 6   Energy(kcal)        10190 non-null  float64
 7   Carbonhydrate(gm)   10190 non-null  float64
 8   Total Sugars(gm)    10190 non-null  float64
 9   Dietary Fiber(gm)   10190 non-null  float64
dtypes: float64(10)
memory usage: 796.2 KB


None

Unnamed: 0,Age,Glycohemoglobin(%),Hr_sleep_weekend,BMI,Weight(kg),Waist Circum(cm),Energy(kcal),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
count,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0,10190.0
mean,1.534047e-16,-2.768258e-16,1.136589e-16,2.105828e-16,5.2994360000000003e-17,-4.413872e-16,4.6021410000000005e-17,-1.554966e-16,1.0459410000000001e-17,-1.115671e-16
std,1.000049,1.000049,1.000049,1.000049,1.000049,1.000049,1.000049,1.000049,1.000049,1.000049
min,-1.638887,-2.734305,-3.492127,-2.167421,-2.150614,-2.439659,-2.339605,-2.168262,-1.579072,-1.726759
25%,-0.8697033,-0.4631099,-0.7474049,-0.7063335,-0.7040037,-0.714671,-0.6964955,-0.6786373,-0.6903341,-0.688973
50%,0.05331698,-0.2814143,-0.1984604,-0.1666527,-0.1545492,-0.07557711,-0.1553586,-0.1474208,-0.196857,-0.2052133
75%,0.8225006,0.081977,0.6249563,0.5046576,0.5236836,0.614418,0.5283047,0.5154279,0.4347121,0.4501596
max,1.642963,9.439302,3.095206,8.24447,7.366108,4.974961,8.035275,8.725759,12.43405,9.451064


#### Now we have our standardised continuous dataframe.Let's plug them into our Ordinal Regression Model

In [774]:
# Make out target variable in an order No --> Borderline --> Yes
cat_type = CategoricalDtype(categories=["No","Borderline","Yes"], ordered=True)
filter_df["Told to have Diabetes"] = filter_df["Told to have Diabetes"].astype(cat_type)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filter_df["Told to have Diabetes"] = filter_df["Told to have Diabetes"].astype(cat_type)


In [775]:
std_df.reset_index(drop = True)
print("Predictors:")
display(std_df)

Predictors:


Unnamed: 0,Age,Glycohemoglobin(%),Hr_sleep_weekend,BMI,Weight(kg),Waist Circum(cm),Energy(kcal),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
0,-0.972261,-0.553958,-0.198460,1.070664,0.618121,1.038595,-0.273551,0.095181,-0.324332,0.402865
1,-1.382492,0.263673,-0.198460,-0.535215,0.145934,-0.041643,-1.171583,-0.715827,-0.900013,-1.083548
2,-1.536329,1.172151,-0.198460,-0.285119,-0.918634,0.953760,2.244415,2.609059,1.467939,3.170295
3,0.053317,-0.190566,2.546262,0.004466,0.691096,1.179988,1.293660,0.616274,0.453629,-0.337639
4,-0.613309,-0.644806,-0.198460,-1.022244,-0.360595,-0.720327,-0.309473,0.070166,0.532674,-0.445742
...,...,...,...,...,...,...,...,...,...,...
10185,0.207154,-0.008871,-1.296349,-0.021860,0.497928,-0.013364,-0.480388,-0.935487,-0.953076,-0.964635
10186,0.976337,0.717912,-0.198460,1.083827,0.004277,0.591795,-1.045280,-1.345183,-0.778425,-1.099763
10187,-0.408193,0.081977,-0.747405,1.123316,1.120357,0.857613,1.580451,1.810950,2.337661,0.289357
10188,0.771222,0.081977,0.350484,-0.548378,-0.137379,-0.137790,-0.388268,-1.158509,-0.785646,-1.002471


In [776]:
# Reset the index since the index of the dataframe is actually the Correspondent number
filter_df = filter_df.reset_index(drop = True)
filter_df

Unnamed: 0,Told to have Diabetes,Age,Glycohemoglobin(%),Hr_sleep_weekend,BMI,Weight(kg),Waist Circum(cm),Energy(kcal),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm)
0,No,29.0,5.2,8.0,37.8,97.1,117.9,1797.0,246.655,79.955,19.70
1,No,21.0,6.1,8.0,25.6,86.1,98.8,1022.0,158.635,43.285,5.95
2,No,18.0,7.1,8.0,27.5,61.3,116.4,3970.0,519.490,194.120,45.30
3,No,49.0,5.6,13.0,29.7,98.8,120.4,3149.5,303.210,129.510,12.85
4,No,36.0,5.1,8.0,21.9,74.3,86.8,1766.0,243.940,134.545,11.85
...,...,...,...,...,...,...,...,...,...,...,...
10185,No,52.0,5.8,6.0,29.5,94.3,99.3,1618.5,134.795,39.905,7.05
10186,Yes,67.0,6.6,8.0,37.9,82.8,110.0,1131.0,90.330,51.030,5.80
10187,No,40.0,5.9,7.0,38.2,108.8,114.7,3397.0,432.870,249.520,18.65
10188,Borderline,63.0,5.9,9.0,25.5,79.5,97.1,1698.0,110.590,50.570,6.70


In [777]:
# Plug the Target variable into out standardized dataframe
std_df["Told to have Diabetes"] = filter_df["Told to have Diabetes"]
std_df

Unnamed: 0,Age,Glycohemoglobin(%),Hr_sleep_weekend,BMI,Weight(kg),Waist Circum(cm),Energy(kcal),Carbonhydrate(gm),Total Sugars(gm),Dietary Fiber(gm),Told to have Diabetes
0,-0.972261,-0.553958,-0.198460,1.070664,0.618121,1.038595,-0.273551,0.095181,-0.324332,0.402865,No
1,-1.382492,0.263673,-0.198460,-0.535215,0.145934,-0.041643,-1.171583,-0.715827,-0.900013,-1.083548,No
2,-1.536329,1.172151,-0.198460,-0.285119,-0.918634,0.953760,2.244415,2.609059,1.467939,3.170295,No
3,0.053317,-0.190566,2.546262,0.004466,0.691096,1.179988,1.293660,0.616274,0.453629,-0.337639,No
4,-0.613309,-0.644806,-0.198460,-1.022244,-0.360595,-0.720327,-0.309473,0.070166,0.532674,-0.445742,No
...,...,...,...,...,...,...,...,...,...,...,...
10185,0.207154,-0.008871,-1.296349,-0.021860,0.497928,-0.013364,-0.480388,-0.935487,-0.953076,-0.964635,No
10186,0.976337,0.717912,-0.198460,1.083827,0.004277,0.591795,-1.045280,-1.345183,-0.778425,-1.099763,Yes
10187,-0.408193,0.081977,-0.747405,1.123316,1.120357,0.857613,1.580451,1.810950,2.337661,0.289357,No
10188,0.771222,0.081977,0.350484,-0.548378,-0.137379,-0.137790,-0.388268,-1.158509,-0.785646,-1.002471,Borderline


#### Next, We'll employ an Ordinal Regression Model to evaluate the association between the predictors and the target variable.

In [779]:
# Logit Ordinal Regression model
continuous_df = std_df.drop(columns = ["Told to have Diabetes"])
mod_prob = OrderedModel(std_df["Told to have Diabetes"],
                        std_df.drop(columns = ["Told to have Diabetes"]),
                        distr='logit')
 
res_log = mod_prob.fit(method='bfgs')
print(res_log.summary())

Optimization terminated successfully.
         Current function value: 0.362304
         Iterations: 54
         Function evaluations: 55
         Gradient evaluations: 55
                               OrderedModel Results                              
Dep. Variable:     Told to have Diabetes   Log-Likelihood:                -3691.9
Model:                      OrderedModel   AIC:                             7408.
Method:               Maximum Likelihood   BIC:                             7495.
Date:                   Sun, 19 May 2024                                         
Time:                           15:40:44                                         
No. Observations:                  10190                                         
Df Residuals:                      10178                                         
Df Model:                             10                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
---

#### The summary of our Ordinal Regression Model indicates that several columns, such as "Hr_sleep_weekend," "Energy(kcal)," "Carbohydrate(gm)," and notably BMI - the factor that is assumed to have a strong influence on the Diabetes condition , are not statistically significant predictors of an individual having diabetes. This is because their p-values are greater than 5%, which means there is insufficient evidence to reject the null hypothesis that these predictors are unrelated to the target variable.

#### After constructing the Ordinal Regression model, we need to assess its assumptions, which include four key points: First, the dependent variable has to be ordered, indicating different levels or categories. Second, one or more independent variables can be continuous, categorical, or ordinal in nature. Third, there should be no multicollinearity among the independent variables, meaning they should not be highly correlated.

#### We've confirmed that the initial two assumptions are met. Now, we must assess the last  assumption.

#### Let's begin by examining the presence of multicollinearity. We will test the multicollinearity by using the VIF calculation.


In [780]:
continuous_df = std_df.drop(columns = ["Told to have Diabetes"])
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = df.columns
    vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data

vif_result = calculate_vif(continuous_df)
print(vif_result)



              Feature        VIF
0                 Age   1.251188
1  Glycohemoglobin(%)   1.108860
2    Hr_sleep_weekend   1.034384
3                 BMI   3.728717
4          Weight(kg)   3.791290
5    Waist Circum(cm)   3.158437
6        Energy(kcal)   4.945455
7   Carbonhydrate(gm)  12.256326
8    Total Sugars(gm)   4.847410
9   Dietary Fiber(gm)   1.864115


#### Based on the VIF test results, we find that one column, namely "Carbonhydrate(gm)", exhibits a VIF value exceeding 10. This indicates the presence of multicollinearity within the dataset, as per the general rule of thumb where VIF values exceeding 10 signify multicollinearity.

#### So we need to drop that column



In [781]:
std_df.drop(columns = ["Carbonhydrate(gm)"],inplace = True)

In [783]:
# Test the VIF test again to check multicollinearity again
continuous_df = std_df.drop(columns = ["Told to have Diabetes"])
vif_result = calculate_vif(continuous_df)
print(vif_result)

              Feature       VIF
0                 Age  1.233665
1  Glycohemoglobin(%)  1.108023
2    Hr_sleep_weekend  1.034374
3                 BMI  3.728521
4          Weight(kg)  3.778346
5    Waist Circum(cm)  3.157339
6        Energy(kcal)  2.509722
7    Total Sugars(gm)  1.911666
8   Dietary Fiber(gm)  1.431470


#### Now that the VIF results appear satisfactory, we can proceed to build our model again.
#### Given the fundamental nature of this module, it seems that constructing intricate models aimed at achieving high accuracy is beyond its intended scope. 

In [784]:
# Drop columns that are not statiscally significant to our target variable
std_df.drop(columns = ["Hr_sleep_weekend","BMI","Energy(kcal)"],inplace = True)



In [785]:
# Logit Ordinal Regression model
mod_prob = OrderedModel(std_df["Told to have Diabetes"],
                        std_df.drop(columns= ["Told to have Diabetes"]),
                        distr='logit')
 
res_log = mod_prob.fit(method='bfgs')
print(res_log.summary())

Optimization terminated successfully.
         Current function value: 0.362308
         Iterations: 39
         Function evaluations: 40
         Gradient evaluations: 40
                               OrderedModel Results                              
Dep. Variable:     Told to have Diabetes   Log-Likelihood:                -3691.9
Model:                      OrderedModel   AIC:                             7400.
Method:               Maximum Likelihood   BIC:                             7458.
Date:                   Sun, 19 May 2024                                         
Time:                           15:42:34                                         
No. Observations:                  10190                                         
Df Residuals:                      10182                                         
Df Model:                              6                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
---

#### So the model looks prettry good now. Let's check the accuracy of it.

In [786]:
predicted = res_log.model.predict(res_log.params, exog=std_df.drop(columns = ["Told to have Diabetes"]))
predicted = pd.DataFrame(predicted,columns = ["No","Boderline","Yes"])

print("The probability of having the Diabetes condition :")
display(predicted)

The probability of having the Diabetes condition :


Unnamed: 0,No,Boderline,Yes
0,0.969787,0.008320,0.021893
1,0.958585,0.011309,0.030106
2,0.887269,0.029084,0.083647
3,0.908492,0.024025,0.067483
4,0.985027,0.004170,0.010803
...,...,...,...
10185,0.876069,0.031673,0.092257
10186,0.568764,0.078599,0.352638
10187,0.941048,0.015883,0.043069
10188,0.817703,0.044239,0.138057


In [787]:
std_df["Told to have Diabetes"].replace({"No":0,"Borderline":1,"Yes":2},inplace = True)
true_labels = std_df["Told to have Diabetes"].values

# Identify the predicted labels based on the highest probability for each row
predicted_labels = np.argmax(predicted, axis=1)
# Calculate the accuracy score
accuracy = accuracy_score(true_labels, predicted_labels)
print("Accuracy Score:", accuracy)


Accuracy Score: 0.882531894013739


#### Our model achieves a prediction accuracy of around 88% for the samples, which is considered somewhat acceptable. This could be attributed to the relatively lower frequency of Borderline values compared to Yes and No, which might influence our accuracy assessment.


#### In summary, our analysis sheds light on the relationship between common indicators like age, weight, and waist circumference and the likelihood of developing diabetes. It's evident that as individuals age or exhibit signs of increased body size, their risk of diabetes escalates. Surprisingly, our study challenges the widely held assumption linking BMI with diabetes risk, revealing no significant correlation—a finding that prompts a reevaluation of existing beliefs.

#### Furthermore, our research challenges the notion that gender significantly impacts diabetes susceptibility. Instead, we've identified biochemical markers, particularly Glycohemoglobin(%), as potent predictors of diabetes risk. Additionally, dietary habits emerge as pivotal factors; higher sugar consumption and lower fiber intake are associated with elevated diabetes risk.

#### Expanding on these insights, we recognize the multifaceted nature of diabetes prediction. While age and physical appearance provide initial clues, focusing solely on these visible cues overlooks critical biochemical and dietary indicators. By incorporating markers like Glycohemoglobin(%) and dietary patterns into our assessment, we gain a more nuanced understanding of diabetes risk.

#### However, it's important to acknowledge the limitations of our analysis. The NHANES dataset, while rich in information, may not fully represent the broader population. Factors such as response bias, where individuals may withhold or provide inaccurate information, can influence the validity of our findings. Therefore, while our analysis offers valuable insights, further research incorporating diverse datasets and robust methodologies is necessary to validate and generalize these finding.