## About Dataset

This refined dataset is originally based on the "Diabetes Dataset" uploaded by Ahlam Rashid in Mendeley Data. The link to the original dataset is: https://data.mendeley.com/datasets/wj9rwkp9c2/1. The original dataset contains a total of 1000 subjects divided into three classes: diabetic, non-diabetic, and predict-diabetic.

Among the 1000 subjects, 844 are diabetic, 103 are non-diabetic, and 53 are predict-diabetic, resulting in an extreme class imbalance. We found a total of 174 duplicate subjects in the original dataset, which we subsequently removed. After removing the duplicate subjects, there were 690 diabetic, 96 non-diabetic, and 40 predict-diabetic subjects remaining.

<img src="./table.bmp">

## Intructions

### Note: 
1. Use clf as the variable name of the classifier
2. Don't remove/delete any cells
3. Remove/delete line "NotImplementedError"  


In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv("train_multiclass_diabetes.csv")

print("Columns in dataset:", df.columns.tolist())
print("Class distribution:\n", df['Class'].value_counts())

X = df.drop(columns=['Class'])
y = df['Class']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, test_size=0.2, random_state=60)

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

target_names = ['non-diabetic', 'predict-diabetic', 'diabetic']
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=target_names))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Columns in dataset: ['Gender', 'AGE', 'Urea', 'Cr', 'HbA1c', 'Chol', 'TG', 'HDL', 'LDL', 'VLDL', 'BMI', 'Class']
Class distribution:
 Class
2    102
0     77
1     32
Name: count, dtype: int64
Classification Report:
                   precision    recall  f1-score   support

    non-diabetic       0.94      1.00      0.97        16
predict-diabetic       1.00      1.00      1.00         6
        diabetic       1.00      0.95      0.98        21

        accuracy                           0.98        43
       macro avg       0.98      0.98      0.98        43
    weighted avg       0.98      0.98      0.98        43

Confusion Matrix:
 [[16  0  0]
 [ 0  6  0]
 [ 1  0 20]]
