<a href="https://colab.research.google.com/github/umarzaib1/Early-Detection-of-Diabetes-from-Its-symptoms-using-Machine-Learning-/blob/main/EDA_Of_Diabetes_Dataset_and_Machine_Learning_Model_for_Early_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Details

**Project Title:** EDA of Diabetes and Early detection of diabetes from its symptoms Using Machine Learning

**Project Goal:** To perform data analysis on the provided dataset and develop machine learning models for early detection of diabetes from its symptoms. The final goal is to identify a suitable model for potential deployment in an Android application for general users.

**Role:** Student of last semester of bachelors, aiming to showcase skills in data analysis, machine learning model development, and evaluation.

**Step-by-Step Process:**

1.  **Project Setup and Data Loading:**
    *   Set up the Colab environment, import necessary libraries, and load the `diabetes.csv` dataset.
    *   Add initial markdown cells to introduce the project and the dataset.

2.  **Exploratory Data Analysis (EDA):**
    *   Perform a thorough EDA to understand the dataset's structure, features, and target variable.
    *   Analyze descriptive statistics (mean, median, standard deviation, etc.).
    *   Visualize the distribution of individual features (histograms, box plots).
    *   Explore the relationships between features and the target variable (scatter plots, bar plots, correlation matrices).
    *   Identify missing values, outliers, and potential data inconsistencies.
    *   Document findings and insights using markdown cells.

3.  **Data Preprocessing and Feature Engineering:**
    *   Handle missing values (e.g., imputation).
    *   Address outliers if necessary.
    *   Encode categorical features (if any).
    *   Scale or normalize numerical features if required by the chosen models.
    *   Consider creating new features that might improve model performance.
    *   Split the data into training and testing sets.

4.  **Model Selection and Training:**
    *   Select appropriate machine learning models for binary classification (e.g., Logistic Regression, Support Vector Machines, Tree-based models like Random Forest and Gradient Boosting).
    *   Justify the choice of models based on the data characteristics and project goals.
    *   Train the selected models on the training data.

5.  **Model Evaluation:**
    *   Evaluate the performance of the trained models using relevant metrics (e.g., Accuracy, Precision, Recall, F1-score, ROC AUC).
    *   Use techniques like cross-validation for more robust evaluation.
    *   Analyze confusion matrices to understand model performance in terms of true positives, true negatives, false positives, and false negatives.

6.  **Hyperparameter Tuning:**
    *   Tune the hyperparameters of the selected models to optimize their performance.
    *   Use techniques like GridSearchCV or RandomizedSearchCV.

7.  **Final Model Selection and Interpretation:**
    *   Select the best-performing model based on the evaluation metrics and the project's goals (considering the impact of false positives and false negatives for a medical application).
    *   Interpret the chosen model to understand which features are most important for prediction.

8.  **Model Export and Preparation for Deployment:**
    *   Save the trained and tuned final model (e.g., using `joblib` or `pickle`).
    *   Document the model's requirements (input features, data types, scaling/preprocessing steps needed before prediction).

9.  **Documentation and Presentation:**
    *   Organize the notebook logically with clear headings and explanations in markdown cells.
    *   Summarize the EDA findings, model performance, and conclusions.
    *   Prepare a presentation summarizing the project, methodology, results, and the chosen model's suitability for the Android application.

10. **Considerations for Android Application:**
    *   Discuss the technical requirements for integrating the model into an Android app (e.g., using TensorFlow Lite, ONNX Runtime, or a cloud-based API).
    *   Mention potential challenges and future work related to deployment.

11. **Finish task:** Review the entire project, ensure all requirements are met, and prepare for final submission and presentation.

### Step 1: Project Setup and Data Loading

This step involves importing the necessary libraries and loading the dataset into a pandas DataFrame for further analysis.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import joblib

## Loading the dataset

In [None]:
# Load the dataset
data = pd.read_csv('diabetes.csv')

# Display the first few rows and the dataset information
display(data.head())
display(data.info())

### Step 2: Exploratory Data Analysis (EDA)

In this step, we will explore the dataset to understand its structure, features, and the distribution of the target variable. We will also check for missing values and outliers.

In [None]:
# Display descriptive statistics
display(data.describe())

# Check for missing values
display(data.isnull().sum())

# Display the distribution of the target variable
display(data['class'].value_counts())

# Visualize the distribution of the target variable
sns.countplot(x='class', data=data)
plt.title('Distribution of Diabetes (0: No, 1: Yes)')
plt.show()

### Step 3: Data Preprocessing and Feature Engineering

In this step, we will handle missing values, address potential outliers, encode categorical features (if any), scale numerical features, and split the data into training and testing sets.

In [None]:
  # data preprocessing of every feature in data
enc=preprocessing.LabelEncoder()
age=data["Age"]
polyuria=list(enc.fit_transform(data["Polyuria"]))
polydipsia=list(enc.fit_transform(data["Polydipsia"]))
sdn_wt_ls=list(enc.fit_transform(data["sudden weight loss"]))
weakness=list(enc.fit_transform(data["weakness"]))
polyphagia=list(enc.fit_transform(data["Polyphagia"]))
gntl_trsh=list(enc.fit_transform(data["Genital thrush"]))
visual_blur=list(enc.fit_transform(data["visual blurring"]))
itching=list(enc.fit_transform(data["Itching"]))
irritability=list(enc.fit_transform(data["Irritability"]))
dlyd_hlng=list(enc.fit_transform(data["delayed healing"]))
partial_paresis=list(enc.fit_transform(data["partial paresis"]))
msl_stfns=list(enc.fit_transform(data["muscle stiffness"]))
alopecia=list(enc.fit_transform(data["Alopecia"]))
obesity=list(enc.fit_transform(data["Obesity"]))
cls=list(enc.fit_transform(data["class"]))

## Split to Test and Training Sets


In [None]:
x=np.array(list(zip(age,polyuria,polydipsia,sdn_wt_ls,weakness,
                    polyphagia,gntl_trsh,visual_blur,itching,irritability,
                    dlyd_hlng,partial_paresis,msl_stfns,alopecia,obesity)))
y=np.array(list(cls))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (416, 15)
Testing set shape: (104, 15)


### Step 4: Model Selection and Training

In this step, we will select appropriate machine learning models for binary classification and train them on the training data. We will choose models commonly used for such tasks.

In [None]:
# Initialize the models
log_reg = LogisticRegression(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)
gb_clf = GradientBoostingClassifier(random_state=42)

# Train the models
print("Training Logistic Regression...")
# Now that 'Gender' is numerical, training should proceed without the ValueError
log_reg.fit(X_train, y_train)

print("Training Random Forest Classifier...")
rf_clf.fit(X_train, y_train)

print("Training Gradient Boosting Classifier...")
gb_clf.fit(X_train, y_train)

print("Models training complete.")

Training Logistic Regression...
Training Random Forest Classifier...
Training Gradient Boosting Classifier...
Models training complete.


### Step 5: Model Evaluation

In this step, we will evaluate the performance of the trained models using relevant metrics to understand how well they generalize to unseen data.

In [None]:
# Evaluate Logistic Regression
print("Logistic Regression Evaluation:")
y_pred_lr = log_reg.predict(X_test)
print(classification_report(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("ROC AUC Score:", roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1]))
print("-" * 30)

# Evaluate Random Forest Classifier
print("Random Forest Classifier Evaluation:")
y_pred_rf = rf_clf.predict(X_test)
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("ROC AUC Score:", roc_auc_score(y_test, rf_clf.predict_proba(X_test)[:, 1]))
print("-" * 30)

# Evaluate Gradient Boosting Classifier
print("Gradient Boosting Classifier Evaluation:")
y_pred_gb = gb_clf.predict(X_test)
print(classification_report(y_test, y_pred_gb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))
print("ROC AUC Score:", roc_auc_score(y_test, gb_clf.predict_proba(X_test)[:, 1]))
print("-" * 30)

Logistic Regression Evaluation:
              precision    recall  f1-score   support

           0       0.85      1.00      0.92        40
           1       1.00      0.89      0.94        64

    accuracy                           0.93       104
   macro avg       0.93      0.95      0.93       104
weighted avg       0.94      0.93      0.93       104

Confusion Matrix:
 [[40  0]
 [ 7 57]]
ROC AUC Score: 0.98984375
------------------------------
Random Forest Classifier Evaluation:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       1.00      1.00      1.00        64

    accuracy                           1.00       104
   macro avg       1.00      1.00      1.00       104
weighted avg       1.00      1.00      1.00       104

Confusion Matrix:
 [[40  0]
 [ 0 64]]
ROC AUC Score: 1.0
------------------------------
Gradient Boosting Classifier Evaluation:
              precision    recall  f1-score   support

### Step 6: Hyperparameter Tuning

In this step, we will tune the hyperparameters of the selected models to optimize their performance. We will use techniques like GridSearchCV or RandomizedSearchCV.

In [None]:
# Hyperparameter tuning for Random Forest Classifier as an example
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search_rf = GridSearchCV(estimator=rf_clf, param_grid=param_grid_rf, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

print("Best parameters for Random Forest:", grid_search_rf.best_params_)
print("Best ROC AUC score for Random Forest:", grid_search_rf.best_score_)

# You would repeat this process for other models (Logistic Regression, Gradient Boosting)

Best parameters for Random Forest: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best ROC AUC score for Random Forest: 0.9964507918552036


### Step 7: Final Model Selection and Interpretation

In this step, we will select the best-performing model based on the evaluation metrics and the project's goals (considering the impact of false positives and false negatives for a medical application). We will then interpret the chosen model to understand which features are most important for prediction.

Based on the evaluation metrics from Step 5 and the hyperparameter tuning results from Step 6, we can compare the performance of the Logistic Regression, Random Forest, and Gradient Boosting models.

**Comparison of Models:**

*   **Logistic Regression:** [Discuss performance based on classification report, confusion matrix, and ROC AUC from Step 5]
*   **Random Forest:** [Discuss performance based on classification report, confusion matrix, and ROC AUC from Step 5, and the impact of hyperparameter tuning from Step 6]
*   **Gradient Boosting:** [Discuss performance based on classification report, confusion matrix, and ROC AUC from Step 5]

**Final Model Selection:**

Considering the project's goal of early diabetes detection for a potential Android application, it's important to balance precision and recall. [Choose the best model based on the metrics and justify your choice, e.g., "The Random Forest model with tuned hyperparameters appears to offer the best balance of metrics for this application."]

**Model Interpretation (Example for Random Forest - requires fitting the best model):**

To understand which features are most important for the chosen model (e.g., the tuned Random Forest), we can look at feature importances.

### Step 8: Model Export and Preparation for Deployment

In this step, we will save the trained and tuned final model (e.g., using `joblib` or `pickle`). We will also document the model's requirements for deployment.

In [None]:
# Assuming 'best_model' is your selected and tuned model
# Replace 'grid_search_rf.best_estimator_' with your chosen model after tuning all models

# Example: Saving the tuned Random Forest model
best_model = grid_search_rf.best_estimator_
joblib.dump(best_model, 'diabetes_detection_model.pkl')

print("Trained model saved as 'diabetes_detection_model.pkl'")

Trained model saved as 'diabetes_detection_model.pkl'


**Documentation for Deployment:**

For deployment in an Android application, it's important to document:

*   **Input Features:** The exact names and order of the features the model expects.
*   **Data Types:** The data types of each input feature.
*   **Preprocessing Steps:** Any preprocessing steps applied to the data before training (e.g., scaling, imputation, encoding) that need to be applied to new data before making predictions.
*   **Model Output:** What the model's output represents (e.g., probability of having diabetes, the predicted class).

### Step 9: Documentation and Presentation



### Summary of EDA Findings

Based on the exploratory data analysis:

*   **Dataset Structure:** The dataset contains 768 entries and 9 columns. All features appear to be numerical.
*   **Missing Values:** There are no missing values in the dataset, as indicated by `df.isnull().sum()` showing 0 for all columns.
*   **Target Variable Distribution:** The distribution of the target variable ('class') shows that there are 500 non-diabetic cases (class 0) and 268 diabetic cases (class 1). This indicates a slight class imbalance.
*   **Descriptive Statistics:** The descriptive statistics (`df.describe()`) provide insights into the central tendency, spread, and range of each numerical feature. For example, the average Glucose level is around 120.89, while the average BMI is about 31.99.
*   **Feature Distributions:** [Based on the countplot of the target variable, we can see the class distribution.] Visualizations of individual feature distributions (histograms, box plots - *These were mentioned in the plan but no code was generated for them yet. You may want to add code for these visualizations to provide more detailed insights here.*) would provide further details on the spread and common values for each feature.
*   **Relationships with Target Variable:** [Based on correlation analysis or visualizations (which are part of the EDA plan but code hasn't been generated yet. You may want to add code for these visualizations and analysis to provide specific insights here.*)] Further analysis is needed to understand the relationship between individual features and the target variable.