<a href="https://colab.research.google.com/github/vlvt/Bitcoin-Forecasting-Volatility/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Diabetes Prediction Project

## 1. Introduction

## 2. Data Loading


## 3. Exploratory Data Analysis (EDA)


## 4. Data Preprocessing


## 5. Modeling


## 6. Model Evaluation


## 7. Conclusions and Recommendations



## 1. Introduction

- Briefly describe the problem and its importance.
- Explain why you chose this dataset and project.


## Data Loading

In [None]:
import pandas as pd

df = pd.read_csv('diabetes_prediction_dataset.csv')


df.head()

## Data Overview

In [None]:
df.info()

In [None]:
print("\nMissing. values per column:")
print(df.isnull().sum())

df.describe()

We found 1 missing value in the target variable 'diabetes'.  
Since the target variable is critical for supervised learning, we will drop this row from the dataset.


In [None]:
df = df.dropna(subset=['diabetes'])


I have completed the basic data overview:
- All columns except the target variable have no missing values.
- The target variable has one missing value, which we decided to drop.
- The dataset is now ready for exploratory data analysis.


## EDA



In this section, I will:
- Visualize the distribution of key features.
- Explore the relationships between features and the target variable.
- Check for class imbalance in the target variable.
- Analyze feature correlations.


In [None]:
#Target Variable Distribution

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x = 'diabetes', data = df)
plt.title('Distribution of Diabetes')
plt.xlabel('Diabetes')
plt.ylabel('Count')
plt.show()


print(df['diabetes'].value_counts())



> Observation: Target Variable Imbalance

- The dataset shows a significant imbalance between non-diabetic (0) and diabetic (1) cases:
    - Non-diabetic: 75,549 cases (~91.5%)
    - Diabetic: 7,015 cases (~8.5%)
- This imbalance could affect model performance by biasing predictions toward the majority class.
- To address this, we will consider:
    - Using appropriate evaluation metrics (recall, precision, F1-score, ROC-AUC).
    - Potentially applying resampling techniques (e.g., oversampling, undersampling) during model training.




In [None]:
#Feature Distributions

numeric_features = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']

for feature in numeric_features:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[feature], kde=True)
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()





>  Observations from Feature Distributions

- **Age** shows a relatively uniform distribution from 0 to 80 years, with a noticeable spike at 80. This might indicate grouped values for older patients.
- **BMI** has a suspiciously high peak around 40, suggesting potential data duplication or outliers. Further investigation is needed.
- **HbA1c_level** and **blood_glucose_level** both exhibit stepped distributions with repeated identical values. This could result from rounding or discretization in data collection.
- **Blood_glucose_level** also shows some high-value outliers (above 250), which might require outlier handling or transformation.



In [None]:
# Boxplots by Target Variable

for feature in numeric_features:
  plt.figure(figsize=(6,4))
  sns.boxplot(x = 'diabetes', y = feature, data = df)
  plt.title(f'{feature} by Diabetes')
  plt.xlabel('Diabetes')
  plt.ylabel(feature)
  plt.show()



>  Observations from Boxplots by Diabetes

- **Age:** Diabetic patients tend to be older than non-diabetic patients, aligning with medical expectations.
- **BMI:** No significant difference in BMI between diabetic and non-diabetic patients, though many outliers are present.
- **HbA1c_level:** Diabetic patients have clearly higher HbA1c levels compared to non-diabetic patients, indicating it is a strong predictor.
- **Blood_glucose_level:** Diabetic patients also exhibit higher blood glucose levels, making this another important predictor.




In [None]:
#Correlation Heatmap

plt.figure(figsize = (12,8))
corr = df[numeric_features + ['diabetes']].corr()
sns.heatmap(corr, annot=True, cmap = 'coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()

 ## Analysis of Categorical Features



Although the correlation heatmap only includes numerical features, we will include all features in our modeling phase.  
Categorical features like gender, hypertension, and smoking_history may have important relationships with diabetes that are not captured by simple correlation analysis.


In [None]:
categorical_features = ['gender', 'hypertension', 'heart_disease', 'smoking_history']

for feature in categorical_features:
    plt.figure(figsize=(6, 4))
    sns.countplot(x=feature, hue='diabetes', data=df)
    plt.title(f'{feature} by Diabetes')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.show()


 Observations from Categorical Features

- **Gender:** Both male and female groups have diabetic and non-diabetic patients. Gender alone may not be a strong predictor, but it could still provide additional context.
- **Hypertension:** Patients with hypertension show a higher proportion of diabetes cases compared to those without hypertension.
- **Heart Disease:** Patients with heart disease also exhibit a higher proportion of diabetes cases, suggesting potential predictive value.
- **Smoking History:** All categories show diabetic cases, but categories like "never" and "No Info" dominate in count. This feature may benefit from one-hot encoding or category grouping during preprocessing.


In [None]:
 #smoking history feature angineering
def simplify_smoking(status):
    if status in ['former', 'ever', 'not current']:
        return 'former'
    elif status == 'No Info':
        return 'unknown'
    else:
        return status

df['smoking_history_simplified'] = df['smoking_history'].apply(simplify_smoking)


print(df['smoking_history_simplified'].value_counts())

category_counts = df['smoking_history_simplified'].value_counts()
category_percentages = category_counts / len(df) * 100


print("\nCategory Percentages:")
print(category_percentages)

I calculated the distribution of the simplified smoking_history variable:

- The table below shows both the counts and the percentages of each category.

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='smoking_history_simplified', hue='diabetes', data=df)
plt.title('Smoking History (Simplified) by Diabetes')
plt.xlabel('Smoking History (Simplified)')
plt.ylabel('Count')
plt.show()


## Data Preprocessing



> In this section, I will:
- Handle outliers if necessary.
- Encode categorical variables.
- Scale numerical features.
- Split the dataset into training and testing sets for model building.



In [None]:

categorical_features = ['gender', 'hypertension', 'heart_disease', 'smoking_history_simplified']


df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)

print("Shape after encoding:", df_encoded.shape)


In [None]:
from sklearn.preprocessing import MinMaxScaler


numerical_features = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']

scaler = MinMaxScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

df_encoded[numerical_features].head()


In [None]:
from sklearn.model_selection import train_test_split


X = df_encoded.drop('diabetes', axis=1)
y = df_encoded['diabetes']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

#I split the data into training and testing sets to evaluate model performance on unseen data and prevent overfitting.

In [None]:

df_encoded = df_encoded.drop(columns=['smoking_history'])


 ## Modeling

In this section, I will:
- Train multiple machine learning models (Logistic Regression, Decision Tree, Random Forest).
- Evaluate model performance using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
- Compare models and select the best one for further analysis.


In [None]:

X = df_encoded.drop('diabetes', axis=1)
y = df_encoded['diabetes']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2,
    random_state=42,
    stratify=y
)


from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import seaborn as sns
import matplotlib.pyplot as plt

# Predict on test data
y_pred = lr_model.predict(X_test)
y_pred_prob = lr_model.predict_proba(X_test)[:, 1]

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC-AUC Score:", roc_auc)

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f"ROC-AUC: {roc_auc:.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression')
plt.legend()
plt.show()


Logistic Regression Results

- **Accuracy:** 96% — overall good performance.
- **Precision:** 0.88 for diabetic class — good precision in identifying diabetics.
- **Recall:** 0.63 for diabetic class — relatively low recall means some diabetics are misclassified as healthy.
- **F1-Score:** 0.73 for diabetic class — indicates a trade-off between precision and recall.
- **ROC-AUC:** 0.96 — excellent discrimination between classes.

- **Recommendation:** Since recall is lower for the diabetic class, further analysis or advanced models may be needed to improve sensitivity.


#####  Decision Tree Classifier

In this section, I will train a Decision Tree Classifier to predict diabetes.  
Decision Trees can capture nonlinear relationships and handle categorical features well.  
We will evaluate its performance using metrics like precision, recall, F1-score, and ROC-AUC.


In [None]:
from sklearn.tree import DecisionTreeClassifier


dt_model = DecisionTreeClassifier(random_state=42)


dt_model.fit(X_train, y_train)


y_pred_dt = dt_model.predict(X_test)
y_pred_prob_dt = dt_model.predict_proba(X_test)[:, 1]


from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import seaborn as sns
import matplotlib.pyplot as plt

print("Classification Report:")
print(classification_report(y_test, y_pred_dt))

roc_auc_dt = roc_auc_score(y_test, y_pred_prob_dt)
print("ROC-AUC Score:", roc_auc_dt)

# Confusion Matrix
cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# ROC Curve
fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test, y_pred_prob_dt)
plt.figure(figsize=(6, 4))
plt.plot(fpr_dt, tpr_dt, label=f"ROC-AUC: {roc_auc_dt:.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Decision Tree')
plt.legend()
plt.show()


Decision Tree Classifier Results

- **Accuracy:** 95% — slightly lower than Logistic Regression (96%).
- **Precision (diabetic class):** 0.71 — lower than Logistic Regression (0.88).
- **Recall (diabetic class):** 0.72 — higher than Logistic Regression (0.63).
- **F1-Score (diabetic class):** 0.72 — similar or slightly better than Logistic Regression.
- **ROC-AUC:** 0.85 — lower than Logistic Regression (0.96).

- **Observation:** The Decision Tree model detects more diabetic cases (higher recall) but also makes more false positive predictions (lower precision). It may be a better choice if we prioritize finding diabetic cases over minimizing false alarms.


## Random Forest Classifier

In this section, we will train a Random Forest Classifier to predict diabetes.  
Random Forests are ensemble models that can capture complex patterns and often perform better than single trees.  
We will evaluate its performance using metrics like precision, recall, F1-score, and ROC-AUC.


In [None]:
from sklearn.ensemble import RandomForestClassifier


rf_model = RandomForestClassifier(random_state=42)


rf_model.fit(X_train, y_train)


y_pred_rf = rf_model.predict(X_test)
y_pred_prob_rf = rf_model.predict_proba(X_test)[:, 1]
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import seaborn as sns
import matplotlib.pyplot as plt

print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

roc_auc_rf = roc_auc_score(y_test, y_pred_prob_rf)
print("ROC-AUC Score:", roc_auc_rf)

# Confusion Matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# ROC Curve
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_pred_prob_rf)
plt.figure(figsize=(6, 4))
plt.plot(fpr_rf, tpr_rf, label=f"ROC-AUC: {roc_auc_rf:.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Random Forest')
plt.legend()
plt.show()


Random Forest Classifier Results

- **Accuracy:** 97% — highest among all models.
- **Precision (diabetic class):** 0.95 — highest, indicating very few false positives.
- **Recall (diabetic class):** 0.68 — slightly lower than Decision Tree but higher than Logistic Regression.
- **F1-Score (diabetic class):** 0.79 — highest overall.
- **ROC-AUC:** 0.96 — excellent discrimination between classes.

- **Observation:** Random Forest achieves the best balance between precision and recall while maintaining high ROC-AUC. This makes it the best-performing model among the three.


Feature Importance - Random Forest

In this section, we will analyze which features the Random Forest Classifier considers most important for predicting diabetes.  
This analysis helps us understand which variables contribute most to the model's decision-making process.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


importances = rf_model.feature_importances_

feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
})


feature_importances = feature_importances.sort_values(by='Importance', ascending=False)


print(feature_importances)


plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'])
plt.gca().invert_yaxis()
plt.xlabel('Importance')
plt.title('Feature Importance - Random Forest')
plt.show()


Feature Importance - Random Forest

The most important features for predicting diabetes are:
- **HbA1c_level** (0.41) — a strong indicator of blood sugar control.
- **blood_glucose_level** (0.33) — directly related to diabetes.
- **bmi** (0.12) — an important factor in diabetes risk.
- **age** (0.10) — older patients are at higher risk.

Other features (hypertension, heart disease, gender, smoking history) contributed less to the model.


# Summary


In this project, I aimed to build a machine learning model to predict diabetes based on medical and lifestyle features. I started with a thorough exploratory data analysis (EDA), including both numerical and categorical variables, and handled data preprocessing (one-hot encoding, scaling, train/test split).

We tested three different machine learning models:
- **Logistic Regression:** High ROC-AUC (0.96) but relatively low recall (0.63) for detecting diabetic cases.
- **Decision Tree:** Improved recall (0.72) but lower ROC-AUC (0.85), with more false positives.
- **Random Forest:** Best overall performance with ROC-AUC (0.96), precision (0.95), and F1-score (0.79) for diabetic class, balancing between precision and recall.

The feature importance analysis highlighted:
- **HbA1c_level** and **blood_glucose_level** as the most important predictors.
- **BMI** and **age** as moderately important.
- Other features like **hypertension**, **heart disease**, **gender**, and **smoking history** had minimal impact on the model.

**Conclusion:**  
- Random Forest is recommended as the best model for predicting diabetes in this dataset.
- HbA1c_level and blood_glucose_level are critical features for prediction and should be closely monitored in medical practice.

**Next Steps:**  
- Further improve recall for diabetic class, possibly with techniques like class weights or SMOTE.
- Validate the model on external datasets.
- Consider deploying the model in a clinical setting with interpretability tools such as SHAP.
