## The dataset for this project was downloaded from [Kaggle](https://www.kaggle.com/datasets/ashmitcajla/dataset-for-blood-glucose-level-readings?). <br/>
The given Dataset is record of different age group people either diabetic or non-diabetic for
their blood glucose level reading with superficial body features like body temperature, heart
rate, blood pressure etc. <br/>
The main purpose of the dataset is to understand the effect of blood glucose level on human
body. <br/>
The different superficial body parameters show significant variation according to change in
blood glucose level. <br/>
The given records is the combination of true records with synthetic records. The true records
are gathered from the diabetic patients using electronic gadgets to get the readings for body
temperature, heart rate and blood pressure. Also true record from diabetic patients are gathered
either by pricking method or flash glucose monitoring technique used by the diabetic patients.
The non-diabetic people are gathered using electronic gadget like smart wrist bands and based
on their average blood glucose level of 5 days the synthetic blood glucose level is recorded.
For non-diabetic people there is no much variation in the blood glucose levels.

In [1]:
# Importing Necessary Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

In [3]:
# Load the dataset from the Excel file
file_path = '../data/blood_glucose_level.xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet1')
df.head()

Unnamed: 0,Age,Blood Glucose Level(BGL),Diastolic Blood Pressure,Systolic Blood Pressure,Heart Rate,Body Temperature,SPO2,Sweating (Y/N),Shivering (Y/N),Diabetic/NonDiabetic (D/N)
0,9,79,73,118,98,98.300707,99,0,0,N
1,9,80,73,119,102,98.300707,94,1,0,N
2,9,70,76,110,81,98.300707,98,1,0,N
3,9,70,78,115,96,98.300707,96,1,0,N
4,66,100,96,144,92,97.807052,98,0,0,N


### The dataset contains the following key columns:

1. Age: Age of the person. <br/>
2. Blood Glucose Level (BGL): Blood glucose level. <br/>
3. Diastolic and Systolic Blood Pressure: Blood pressure measurements. <br/>
4. Heart Rate: Heartbeat per minute. <br/>
5. Body Temperature: Body temperature in degrees. <br/>
6. SPO2: Blood oxygen level. <br/>
7. Sweating (Y/N) and Shivering (Y/N): Binary indicators (1 = Yes, 0 = No). <br/>
8. Diabetic/NonDiabetic (D/N): Target variable to classify. <br/>

In [4]:
# Display basic information about the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16969 entries, 0 to 16968
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         16969 non-null  int64  
 1   Blood Glucose Level(BGL)    16969 non-null  int64  
 2   Diastolic Blood Pressure    16969 non-null  int64  
 3   Systolic Blood Pressure     16969 non-null  int64  
 4   Heart Rate                  16969 non-null  int64  
 5   Body Temperature            16969 non-null  float64
 6   SPO2                        16969 non-null  int64  
 7   Sweating  (Y/N)             16969 non-null  int64  
 8   Shivering (Y/N)             16969 non-null  int64  
 9   Diabetic/NonDiabetic (D/N)  16969 non-null  object 
dtypes: float64(1), int64(8), object(1)
memory usage: 1.3+ MB
None


In [5]:
# Check for missing values
print(df.isnull().sum())

Age                           0
Blood Glucose Level(BGL)      0
Diastolic Blood Pressure      0
Systolic Blood Pressure       0
Heart Rate                    0
Body Temperature              0
SPO2                          0
Sweating  (Y/N)               0
Shivering (Y/N)               0
Diabetic/NonDiabetic (D/N)    0
dtype: int64


In [6]:
# Statistical summary of numerical features
print(df.describe())

                Age  Blood Glucose Level(BGL)  Diastolic Blood Pressure  \
count  16969.000000              16969.000000              16969.000000   
mean      30.988862                 95.722789                 77.173493   
std       25.585606                 42.994199                  7.241511   
min        9.000000                 50.000000                 60.000000   
25%        9.000000                 68.000000                 71.000000   
50%       14.000000                 83.000000                 76.000000   
75%       55.000000                108.000000                 83.000000   
max       77.000000                250.000000                 98.000000   

       Systolic Blood Pressure    Heart Rate  Body Temperature          SPO2  \
count             16969.000000  16969.000000      16969.000000  16969.000000   
mean                118.187165     91.524191         97.356146     97.382403   
std                   7.700363     10.409780          0.813555      0.848689   
min 

In [7]:
# Encode the target variable
label_encoder = LabelEncoder()
df['Target'] = label_encoder.fit_transform(df['Diabetic/NonDiabetic (D/N)'])

# Display the mapping
print(dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

{'D': 0, 'N': 1}


In [8]:
# Drop the original target column
df.drop(columns=['Diabetic/NonDiabetic (D/N)'], inplace=True)

In [9]:
# Check the distribution of target classes
print(df['Target'].value_counts())

Target
0    16641
1      328
Name: count, dtype: int64


The dataset is highly imbalanced, with the majority class (Diabetic) comprising about 98% of the data.

We will use XGBoost classifier model which will handle the data imbalance well.

In [10]:
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_sample_weight

In [12]:
# Define features (X) and target (y)
X = df.drop(columns=['Target'])
y = df['Target']

# Normalize numerical features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [13]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

In [14]:
# Calculate class weights
num_negative = sum(y_train == 0)
num_positive = sum(y_train == 1)
scale_pos_weight = num_negative / num_positive

In [15]:
# Initialize and train the XGBoost classifier with scale_pos_weight
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', scale_pos_weight=scale_pos_weight, random_state=42)
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=12,
              num_parallel_tree=1, random_state=42, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=50.81297709923664, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

In [16]:
# Make predictions
y_pred = model.predict(X_test)

In [17]:
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix:
 [[3325    3]
 [   1   65]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3328
           1       0.96      0.98      0.97        66

    accuracy                           1.00      3394
   macro avg       0.98      0.99      0.98      3394
weighted avg       1.00      1.00      1.00      3394

