# MetaboliQ AI
### ~ by Mavericks

This Jupyter Notebook contains the *Random Forest Classifier Model* for the third functionality of the MetaboliQ AI.  
Problem Statement: To predict the type of diabetes based on the medical reports of a patient.  
Developed for: 1. Doctors  
               2. Nurses  
               3. Trained Professionals  
               4. Laboratory Professionals  
Output: Predicts the type of diabetes.

## Importing the required libraries

For this model, we have used the sklearnex (Extension for Scikit-Learn) module by intel.  
Details: https://github.com/oneapi-src/oneAPI-samples

In [1]:
import pandas as pd
import numpy as np
from sklearnex import patch_sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [2]:
data = pd.read_csv("diabetes_dataset00.csv")
data.head()

Unnamed: 0,Target,Genetic Markers,Autoantibodies,Family History,Environmental Factors,Insulin Levels,Age,BMI,Physical Activity,Dietary Habits,...,Pulmonary Function,Cystic Fibrosis Diagnosis,Steroid Use History,Genetic Testing,Neurological Assessments,Liver Function Tests,Digestive Enzyme Levels,Urine Test,Birth Weight,Early Onset Symptoms
0,Steroid-Induced Diabetes,Positive,Negative,No,Present,40,44,38,High,Healthy,...,76,No,No,Positive,3,Normal,56,Ketones Present,2629,No
1,Neonatal Diabetes Mellitus (NDM),Positive,Negative,No,Present,13,1,17,High,Healthy,...,60,Yes,No,Negative,1,Normal,28,Glucose Present,1881,Yes
2,Prediabetic,Positive,Positive,Yes,Present,27,36,24,High,Unhealthy,...,80,Yes,No,Negative,1,Abnormal,55,Ketones Present,3622,Yes
3,Type 1 Diabetes,Negative,Positive,No,Present,8,7,16,Low,Unhealthy,...,89,Yes,No,Positive,2,Abnormal,60,Ketones Present,3542,No
4,Wolfram Syndrome,Negative,Negative,Yes,Present,17,10,17,High,Healthy,...,41,No,No,Positive,1,Normal,24,Protein Present,1770,No


In [3]:
data["Target"].value_counts()

Target
MODY                                          5553
Secondary Diabetes                            5479
Cystic Fibrosis-Related Diabetes (CFRD)       5464
Type 1 Diabetes                               5446
Neonatal Diabetes Mellitus (NDM)              5408
Wolcott-Rallison Syndrome                     5400
Type 2 Diabetes                               5397
Prediabetic                                   5376
Gestational Diabetes                          5344
Type 3c Diabetes (Pancreatogenic Diabetes)    5320
Wolfram Syndrome                              5315
Steroid-Induced Diabetes                      5275
LADA                                          5223
Name: count, dtype: int64

## Encoding
Our dataset mostly comprises of categorical features which make calculations and interpretations very difficult.  
Hence we use label encoding to map them to numerical features accordingly.

In [4]:
from sklearn.preprocessing import LabelEncoder
le_label = LabelEncoder()
data['Target']=le_label.fit_transform(data['Target'])
data["Target"].unique()

array([ 7,  4,  5,  8, 12,  2,  9, 11,  6, 10,  1,  0,  3])

In [5]:
data.head()

Unnamed: 0,Target,Genetic Markers,Autoantibodies,Family History,Environmental Factors,Insulin Levels,Age,BMI,Physical Activity,Dietary Habits,...,Pulmonary Function,Cystic Fibrosis Diagnosis,Steroid Use History,Genetic Testing,Neurological Assessments,Liver Function Tests,Digestive Enzyme Levels,Urine Test,Birth Weight,Early Onset Symptoms
0,7,Positive,Negative,No,Present,40,44,38,High,Healthy,...,76,No,No,Positive,3,Normal,56,Ketones Present,2629,No
1,4,Positive,Negative,No,Present,13,1,17,High,Healthy,...,60,Yes,No,Negative,1,Normal,28,Glucose Present,1881,Yes
2,5,Positive,Positive,Yes,Present,27,36,24,High,Unhealthy,...,80,Yes,No,Negative,1,Abnormal,55,Ketones Present,3622,Yes
3,8,Negative,Positive,No,Present,8,7,16,Low,Unhealthy,...,89,Yes,No,Positive,2,Abnormal,60,Ketones Present,3542,No
4,12,Negative,Negative,Yes,Present,17,10,17,High,Healthy,...,41,No,No,Positive,1,Normal,24,Protein Present,1770,No


In [6]:
data["Target"].value_counts()

Target
3     5553
6     5479
0     5464
8     5446
4     5408
11    5400
9     5397
5     5376
1     5344
10    5320
12    5315
7     5275
2     5223
Name: count, dtype: int64

In [7]:
data['Genetic Markers']=le_label.fit_transform(data['Genetic Markers'])
data["Genetic Markers"].unique()

array([1, 0])

In [8]:
data['Autoantibodies']=le_label.fit_transform(data['Autoantibodies'])
data['Family History']=le_label.fit_transform(data['Family History'])
data['Environmental Factors']=le_label.fit_transform(data['Environmental Factors'])
data['Physical Activity']=le_label.fit_transform(data['Physical Activity'])
data['Dietary Habits']=le_label.fit_transform(data['Dietary Habits'])
data['Ethnicity']=le_label.fit_transform(data['Ethnicity'])
data['Socioeconomic Factors']=le_label.fit_transform(data['Socioeconomic Factors'])
data['Smoking Status']=le_label.fit_transform(data['Smoking Status'])
data['Alcohol Consumption']=le_label.fit_transform(data['Alcohol Consumption'])
data['Glucose Tolerance Test']=le_label.fit_transform(data['Glucose Tolerance Test'])
data['History of PCOS']=le_label.fit_transform(data['History of PCOS'])
data['Previous Gestational Diabetes']=le_label.fit_transform(data['Previous Gestational Diabetes'])
data['Pregnancy History']=le_label.fit_transform(data['Pregnancy History'])
data['Cystic Fibrosis Diagnosis']=le_label.fit_transform(data['Cystic Fibrosis Diagnosis'])
data['Steroid Use History']=le_label.fit_transform(data['Steroid Use History'])
data['Genetic Testing']=le_label.fit_transform(data['Genetic Testing'])
data['Liver Function Tests']=le_label.fit_transform(data['Liver Function Tests'])
data['Urine Test']=le_label.fit_transform(data['Urine Test'])
data['Early Onset Symptoms']=le_label.fit_transform(data['Early Onset Symptoms'])

In [9]:
data.head()

Unnamed: 0,Target,Genetic Markers,Autoantibodies,Family History,Environmental Factors,Insulin Levels,Age,BMI,Physical Activity,Dietary Habits,...,Pulmonary Function,Cystic Fibrosis Diagnosis,Steroid Use History,Genetic Testing,Neurological Assessments,Liver Function Tests,Digestive Enzyme Levels,Urine Test,Birth Weight,Early Onset Symptoms
0,7,1,0,0,1,40,44,38,0,0,...,76,0,0,1,3,1,56,1,2629,0
1,4,1,0,0,1,13,1,17,0,0,...,60,1,0,0,1,1,28,0,1881,1
2,5,1,1,1,1,27,36,24,0,1,...,80,1,0,0,1,0,55,1,3622,1
3,8,0,1,0,1,8,7,16,1,1,...,89,1,0,1,2,0,60,1,3542,0
4,12,0,0,1,1,17,10,17,0,0,...,41,0,0,1,1,1,24,3,1770,0


In [10]:
data["Ethnicity"].value_counts()

Ethnicity
1    35018
0    34982
Name: count, dtype: int64

In [11]:
data.dtypes

Target                           int64
Genetic Markers                  int64
Autoantibodies                   int64
Family History                   int64
Environmental Factors            int64
Insulin Levels                   int64
Age                              int64
BMI                              int64
Physical Activity                int64
Dietary Habits                   int64
Blood Pressure                   int64
Cholesterol Levels               int64
Waist Circumference              int64
Blood Glucose Levels             int64
Ethnicity                        int64
Socioeconomic Factors            int64
Smoking Status                   int64
Alcohol Consumption              int64
Glucose Tolerance Test           int64
History of PCOS                  int64
Previous Gestational Diabetes    int64
Pregnancy History                int64
Weight Gain During Pregnancy     int64
Pancreatic Health                int64
Pulmonary Function               int64
Cystic Fibrosis Diagnosis

In [12]:
data = data.astype(float)

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Target                         70000 non-null  float64
 1   Genetic Markers                70000 non-null  float64
 2   Autoantibodies                 70000 non-null  float64
 3   Family History                 70000 non-null  float64
 4   Environmental Factors          70000 non-null  float64
 5   Insulin Levels                 70000 non-null  float64
 6   Age                            70000 non-null  float64
 7   BMI                            70000 non-null  float64
 8   Physical Activity              70000 non-null  float64
 9   Dietary Habits                 70000 non-null  float64
 10  Blood Pressure                 70000 non-null  float64
 11  Cholesterol Levels             70000 non-null  float64
 12  Waist Circumference            70000 non-null 

## Developing the Model

In [14]:
# Making the features(X) and target(y) variables.
X = data.drop("Target", axis=1)  # Features
y = data["Target"] #Target

In [15]:
# Standard Train-Test-Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [16]:
# Initialize Random Forest Classifier Model
rfc = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')

In [17]:
# Training the model
rfc.fit(X_train, y_train)

In [18]:
# Make predictions
y_pred = rfc.predict(X_test)
y_pred

array([2., 9., 8., ..., 4., 1., 2.])

In [19]:
# Evaluating the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))

Confusion Matrix:
 [[ 994   56   11   21    0    3    0    0    7    1    0    0    0]
 [   7  993   26   16    0   23    0    0    4    0    0    0    0]
 [   3   55  971    0    0   14    0    1    0    0    0    0    0]
 [   0    0    0  927    0    0    0    0  184    0    0    0    0]
 [   0    0    0    0 1082    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0 1075    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0  856   71    0   66  103    0    0]
 [   0    3    1    0    0    0   33  867    0   32  119    0    0]
 [   0    0    0    3    0    0    0    0 1086    0    0    0    0]
 [   0    0    1    0    0    0  188  104    0  763   23    0    0]
 [   0    0    0    0    0    0    0    0    0    0 1064    0    0]
 [   0    0    0    0    0    0    0    0    0    0    0  900  180]
 [   0    0    0    0    0    0    0    0    0    0    0   16 1047]]

Classification Report:
               precision    recall  f1-score   support

         0.0    

In [20]:
importances = rfc.feature_importances_
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
print("\nTop Features:\n", feature_importances.head(10))


Top Features:
                          Feature  Importance
5                            Age    0.139449
12          Blood Glucose Levels    0.129539
9                 Blood Pressure    0.084363
21  Weight Gain During Pregnancy    0.084052
29       Digestive Enzyme Levels    0.080523
6                            BMI    0.077157
11           Waist Circumference    0.071506
4                 Insulin Levels    0.065140
10            Cholesterol Levels    0.058867
23            Pulmonary Function    0.051606


In [21]:
data.iloc[0]

Target                              7.0
Genetic Markers                     1.0
Autoantibodies                      0.0
Family History                      0.0
Environmental Factors               1.0
Insulin Levels                     40.0
Age                                44.0
BMI                                38.0
Physical Activity                   0.0
Dietary Habits                      0.0
Blood Pressure                    124.0
Cholesterol Levels                201.0
Waist Circumference                50.0
Blood Glucose Levels              168.0
Ethnicity                           1.0
Socioeconomic Factors               2.0
Smoking Status                      1.0
Alcohol Consumption                 0.0
Glucose Tolerance Test              1.0
History of PCOS                     0.0
Previous Gestational Diabetes       0.0
Pregnancy History                   1.0
Weight Gain During Pregnancy       18.0
Pancreatic Health                  36.0
Pulmonary Function                 76.0


## Testing the model on real-world data in interactive mode

In [22]:
# Example: Real-time user input
patient_data = {
    'Genetic Markers': 0.0,
    'Autoantibodies': 1.0,
    'Family History': 1.0,
    'Environmental Factors': 0.0,
    'Insulin Levels': 30.0,
    'Age': 20.0,
    'BMI': 18.0,
    'Physical Activity': 1.0,
    'Dietary Habits': 1.0,
    'Blood Pressure': 110.0,
    'Cholesterol Levels': 80.0,
    'Waist Circumference': 28.0,
    'Blood Glucose Levels': 80.0,
    'Ethnicity': 0.0,
    'Socioeconomic Factors': 1.0,
    'Smoking Status': 0.0,
    'Alcohol Consumption': 1.0,
    'Glucose Tolerance Test': 1.0,
    'History of PCOS': 0.0,
    'Previous Gestational Diabetes': 0.0,
    'Pregnancy History': 0.0,
    'Weight Gain During Pregnancy': 0.0,
    'Pancreatic Health': 20.0,
    'Pulmonary Function': 53.0,
    'Cystic Fibrosis Diagnosis': 1.0,
    'Steroid Use History': 1.0,
    'Genetic Testing': 0.0,
    'Neurological Assessments': 0.0,
    'Liver Function Tests': 0.0,
    'Digestive Enzyme Levels': 26.0,
    'Urine Test': 0.0,
    'Birth Weight': 1353.0,
    'Early Onset Symptoms': 1.0
}

# Convert the input dictionary to a DataFrame
patient_df = pd.DataFrame([patient_data])

# Make a prediction
prediction = rfc.predict(patient_df)

# Map prediction to diabetes type
diabetes_types = { 3: 'MODY', 6: 'Secondary Diabetes', 0: 'Cystic Fibrosis-Related Diabetes (CFRD)', 8: 'Type 1 Diabetes', 
                  4: 'Neonatal Diabetes Mellitus (NDM)', 11: 'Wolcott-Rallison Syndrome', 9: 'Type 2 Diabetes', 5: 'Prediabetic',
                  1: 'Gestational Diabetes', 10: 'Type 3c Diabetes (Pancreatogenic Diabetes)', 12: 'Wolfram Syndrome', 7: 'Steroid-Induced Diabetes',
                  2: 'LADA' }
predicted_diabetes_type = diabetes_types.get(prediction[0], 'Unknown')

# Display the result
print(f"The patient is classified as: {predicted_diabetes_type}.")


The patient is classified as: Wolcott-Rallison Syndrome.


In [23]:
data["Physical Activity"].value_counts()

Physical Activity
2.0    23427
1.0    23348
0.0    23225
Name: count, dtype: int64

In [24]:
data["Urine Test"].value_counts()

Urine Test
3.0    17628
2.0    17528
1.0    17422
0.0    17422
Name: count, dtype: int64

In [28]:
from joblib import dump
dump(rfc, 'random_forest_model.joblib')

['random_forest_model.joblib']