# Diabetes Prediction Model

This notebook contains a comprehensive analysis and prediction model for diabetes using a large dataset with 100,000 records.

## 1. Introduction

Diabetes is a chronic disease that affects how your body turns food into energy. It occurs when your blood glucose (blood sugar) is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy.

Early detection of diabetes is crucial for effective management and prevention of complications. In this project, we will build a machine learning model to predict whether a person has diabetes based on various health metrics.

## 2. Import Libraries

First, let's import all the necessary libraries for our analysis.

In [None]:
# Import basic libraries for data manipulation and analysis
import numpy as np
import pandas as pd

# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Import libraries for machine learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, roc_auc_score

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## 3. Load and Explore the Dataset

Let's load the dataset and take a look at its structure.

In [None]:
# Load the dataset
df = pd.read_csv('../data/diabetes_prediction_dataset.csv')

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Check the shape of the dataset
print(f"Dataset shape: {df.shape}")

In [None]:
# Get information about the dataset
print("Dataset information:")
df.info()

In [None]:
# Get statistical summary of the dataset
print("Statistical summary of numerical features:")
df.describe()

In [None]:
# Check for missing values
print("Missing values in each column:")
df.isnull().sum()

In [None]:
# Check the distribution of the target variable
print("Distribution of diabetes (target variable):")
print(df['diabetes'].value_counts())
print(f"Percentage of diabetic patients: {df['diabetes'].mean() * 100:.2f}%")

# Visualize the distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='diabetes', data=df)
plt.title('Distribution of Diabetes')
plt.xlabel('Diabetes (0: No, 1: Yes)')
plt.ylabel('Count')
plt.savefig('../images/diabetes_distribution.png')
plt.show()

## 4. Data Cleaning and Preprocessing

Let's clean the data and prepare it for analysis.

In [None]:
# Check unique values in categorical columns
print("Unique values in 'gender':")
print(df['gender'].unique())

print("\nUnique values in 'smoking_history':")
print(df['smoking_history'].unique())

In [None]:
# Handle any missing values if they exist
# For numerical columns, fill with median
numerical_cols = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

# For categorical columns, fill with mode
categorical_cols = ['gender', 'smoking_history']
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)

In [None]:
# Convert categorical variables to numerical using one-hot encoding
df_encoded = pd.get_dummies(df, columns=['gender', 'smoking_history'], drop_first=True)

# Display the first few rows of the encoded dataset
print("First 5 rows of the encoded dataset:")
df_encoded.head()

## 5. Exploratory Data Analysis (EDA)

Let's explore the relationships between different features and the target variable.

In [None]:
# Distribution of age
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='age', hue='diabetes', kde=True, bins=30)
plt.title('Age Distribution by Diabetes Status')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig('../images/age_distribution.png')
plt.show()

In [None]:
# Distribution of BMI
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='bmi', hue='diabetes', kde=True, bins=30)
plt.title('BMI Distribution by Diabetes Status')
plt.xlabel('BMI')
plt.ylabel('Count')
plt.savefig('../images/bmi_distribution.png')
plt.show()

In [None]:
# Distribution of HbA1c level
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='HbA1c_level', hue='diabetes', kde=True, bins=30)
plt.title('HbA1c Level Distribution by Diabetes Status')
plt.xlabel('HbA1c Level')
plt.ylabel('Count')
plt.savefig('../images/hba1c_distribution.png')
plt.show()

In [None]:
# Distribution of blood glucose level
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='blood_glucose_level', hue='diabetes', kde=True, bins=30)
plt.title('Blood Glucose Level Distribution by Diabetes Status')
plt.xlabel('Blood Glucose Level')
plt.ylabel('Count')
plt.savefig('../images/glucose_distribution.png')
plt.show()

In [None]:
# Correlation matrix
# Select only numerical columns for correlation
numerical_df = df[['age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level', 'diabetes']]

# Calculate correlation matrix
corr_matrix = numerical_df.corr()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.savefig('../images/correlation_matrix.png')
plt.show()

In [None]:
# Diabetes prevalence by gender
plt.figure(figsize=(8, 6))
sns.countplot(x='gender', hue='diabetes', data=df)
plt.title('Diabetes Prevalence by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.savefig('../images/diabetes_by_gender.png')
plt.show()

In [None]:
# Diabetes prevalence by smoking history
plt.figure(figsize=(12, 6))
sns.countplot(x='smoking_history', hue='diabetes', data=df)
plt.title('Diabetes Prevalence by Smoking History')
plt.xlabel('Smoking History')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.savefig('../images/diabetes_by_smoking.png')
plt.show()

In [None]:
# Diabetes prevalence by hypertension
plt.figure(figsize=(8, 6))
sns.countplot(x='hypertension', hue='diabetes', data=df)
plt.title('Diabetes Prevalence by Hypertension')
plt.xlabel('Hypertension (0: No, 1: Yes)')
plt.ylabel('Count')
plt.savefig('../images/diabetes_by_hypertension.png')
plt.show()

In [None]:
# Diabetes prevalence by heart disease
plt.figure(figsize=(8, 6))
sns.countplot(x='heart_disease', hue='diabetes', data=df)
plt.title('Diabetes Prevalence by Heart Disease')
plt.xlabel('Heart Disease (0: No, 1: Yes)')
plt.ylabel('Count')
plt.savefig('../images/diabetes_by_heart_disease.png')
plt.show()

In [None]:
# Pairplot for numerical features
plt.figure(figsize=(15, 15))
sns.pairplot(data=df, vars=['age', 'bmi', 'HbA1c_level', 'blood_glucose_level'], hue='diabetes')
plt.savefig('../images/pairplot.png')
plt.show()

## 6. Feature Selection and Data Preparation

Let's prepare our data for modeling.

In [None]:
# Separate features and target variable
X = df_encoded.drop('diabetes', axis=1)
y = df_encoded['diabetes']

# Display the features
print("Features used for modeling:")
print(X.columns.tolist())

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

In [None]:
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 7. Model Building and Evaluation

Let's build and evaluate different machine learning models.

### 7.1 Logistic Regression

In [None]:
# Build a logistic regression model
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate the model
print("Logistic Regression Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_lr):.4f}")

# Display confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('../images/lr_confusion_matrix.png')
plt.show()

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

In [None]:
# ROC curve for logistic regression
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_proba_lr)
auc_lr = roc_auc_score(y_test, y_pred_proba_lr)

plt.figure(figsize=(8, 6))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {auc_lr:.4f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression')
plt.legend(loc='lower right')
plt.savefig('../images/lr_roc_curve.png')
plt.show()

### 7.2 Decision Tree

In [None]:
# Build a decision tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluate the model
print("Decision Tree Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_dt):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_dt):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_dt):.4f}")

# Display confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('../images/dt_confusion_matrix.png')
plt.show()

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt))

In [None]:
# ROC curve for decision tree
y_pred_proba_dt = dt_model.predict_proba(X_test_scaled)[:, 1]
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_proba_dt)
auc_dt = roc_auc_score(y_test, y_pred_proba_dt)

plt.figure(figsize=(8, 6))
plt.plot(fpr_dt, tpr_dt, label=f'Decision Tree (AUC = {auc_dt:.4f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Decision Tree')
plt.legend(loc='lower right')
plt.savefig('../images/dt_roc_curve.png')
plt.show()

### 7.3 Random Forest

In [None]:
# Build a random forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)

# Evaluate the model
print("Random Forest Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_rf):.4f}")

# Display confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('../images/rf_confusion_matrix.png')
plt.show()

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

In [None]:
# ROC curve for random forest
y_pred_proba_rf = rf_model.predict_proba(X_test_scaled)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)
auc_rf = roc_auc_score(y_test, y_pred_proba_rf)

plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.4f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Random Forest')
plt.legend(loc='lower right')
plt.savefig('../images/rf_roc_curve.png')
plt.show()

### 7.4 Compare Models

In [None]:
# Compare ROC curves of all models
plt.figure(figsize=(10, 8))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {auc_lr:.4f})')
plt.plot(fpr_dt, tpr_dt, label=f'Decision Tree (AUC = {auc_dt:.4f})')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.4f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend(loc='lower right')
plt.savefig('../images/model_comparison_roc.png')
plt.show()

In [None]:
# Compare model performance metrics
models = ['Logistic Regression', 'Decision Tree', 'Random Forest']
accuracy = [accuracy_score(y_test, y_pred_lr), accuracy_score(y_test, y_pred_dt), accuracy_score(y_test, y_pred_rf)]
precision = [precision_score(y_test, y_pred_lr), precision_score(y_test, y_pred_dt), precision_score(y_test, y_pred_rf)]
recall = [recall_score(y_test, y_pred_lr), recall_score(y_test, y_pred_dt), recall_score(y_test, y_pred_rf)]
f1 = [f1_score(y_test, y_pred_lr), f1_score(y_test, y_pred_dt), f1_score(y_test, y_pred_rf)]
auc = [auc_lr, auc_dt, auc_rf]

# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    'Model': models,
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1,
    'AUC': auc
})

print("Model Performance Comparison:")
comparison_df

In [None]:
# Visualize model comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'AUC']
comparison_data = comparison_df[metrics].values

plt.figure(figsize=(12, 8))
x = np.arange(len(models))
width = 0.15
multiplier = 0

for attribute, measurement in zip(metrics, comparison_data.T):
    offset = width * multiplier
    rects = plt.bar(x + offset, measurement, width, label=attribute)
    plt.bar_label(rects, fmt='%.2f', padding=3)
    multiplier += 1

plt.xlabel('Models')
plt.ylabel('Scores')
plt.title('Model Performance Comparison')
plt.xticks(x + width * 2, models)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=5)
plt.ylim(0, 1.2)
plt.savefig('../images/model_comparison_metrics.png')
plt.show()

## 8. Feature Importance Analysis

In [None]:
# Get feature importance from Random Forest model
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
})

# Sort by importance
feature_importance = feature_importance.sort_values('Importance', ascending=False).reset_index(drop=True)

print("Feature Importance from Random Forest:")
feature_importance

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Feature Importance from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.savefig('../images/feature_importance.png')
plt.show()

## 9. Model Interpretation and Insights

Based on our analysis, we can draw the following insights:

1. **Model Performance**: The Random Forest model performed the best among the three models we built, with the highest accuracy, precision, recall, F1 score, and AUC.

2. **Key Predictors**: The most important features for predicting diabetes are:
   - Blood glucose level
   - HbA1c level
   - Age
   - BMI

3. **Medical Relevance**: This aligns with medical knowledge, as high blood glucose levels and HbA1c levels are directly related to diabetes diagnosis.

4. **Risk Factors**: Age and BMI are significant risk factors for diabetes, which is also consistent with medical literature.

5. **Gender and Lifestyle Factors**: Gender and smoking history have some influence on diabetes risk, but they are less important than the physiological measurements.

## 10. Conclusion and Recommendations

### Conclusion

In this project, we built and compared three machine learning models to predict diabetes based on various health metrics. The Random Forest model performed the best, achieving high accuracy and AUC scores. The most important predictors were blood glucose level, HbA1c level, age, and BMI.

### Recommendations

1. **Regular Monitoring**: Individuals should regularly monitor their blood glucose and HbA1c levels, especially if they are in high-risk categories (older age, high BMI).

2. **Lifestyle Changes**: Maintaining a healthy BMI through diet and exercise can significantly reduce the risk of developing diabetes.

3. **Early Intervention**: Early detection and intervention can help manage diabetes effectively and prevent complications.

4. **Model Deployment**: This model could be integrated into healthcare systems to help identify high-risk individuals who may benefit from preventive interventions.

5. **Further Research**: Future work could explore more complex models or incorporate additional features such as dietary habits, physical activity levels, and family history for even more accurate predictions.

## 11. Save the Best Model

In [None]:
# Import joblib for saving the model
import joblib

# Save the Random Forest model (best performing model)
joblib.dump(rf_model, '../src/diabetes_prediction_model.pkl')

# Save the scaler for preprocessing new data
joblib.dump(scaler, '../src/scaler.pkl')

print("Model and scaler saved successfully!")