# Diabetes Prediction Using Machine Learning

## Project Overview
This project, authored by **Roy Njuguna**, focuses on predicting the likelihood of diabetes in patients based on diagnostic health measurements. The dataset used in this project originates from the National Institute of Diabetes and Digestive and Kidney Diseases and contains medical information from female patients of Pima Indian heritage, all of whom are at least 21 years old.

The goal of this project is to develop a machine learning model capable of predicting whether a patient has diabetes, based on key health indicators. The machine learning techniques used here will be trained and evaluated using various diagnostic features such as glucose levels, blood pressure, and body mass index (BMI).

## Dataset Information
The dataset includes several critical medical parameters:

- **Pregnancies**: Number of times the patient has been pregnant.
- **Glucose**: Plasma glucose concentration after 2 hours in an oral glucose tolerance test.
- **Blood Pressure**: Diastolic blood pressure (in mm Hg).
- **Skin Thickness**: Triceps skin fold thickness (in mm).
- **Insulin**: 2-hour serum insulin (in mu U/ml).
- **BMI**: Body mass index (calculated as weight in kg/(height in m)^2).
- **Diabetes Pedigree Function**: A score that represents the likelihood of diabetes based on family history.
- **Age**: Age of the patient (in years).
- **Outcome**: The class label indicating whether the patient has diabetes (1) or not (0).

## Objective
The primary objective of this project is to build a machine learning model that can accurately classify patients as diabetic or non-diabetic, helping in early diagnosis based on the medical data provided.

## Data Source
The dataset was provided by Vincent Sigillito from the Johns Hopkins University, and was originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

---


This project focuses on developing a machine learning model for diabetes prediction using various health metrics. To interact with the model and test its functionality, please visit the following link:

[Try My Diabetes Prediction Web App](https://huggingface.co/spaces/roy123njuguna/diabetes_prediction)


In [None]:
!pip install gradio

Import the dependencies

In [None]:
# Import necessary libraries

import numpy as np  # Used for numerical computations and handling arrays
import pandas as pd  # Used for data manipulation, analysis, and handling DataFrames

# Import libraries from scikit-learn (sklearn) for preprocessing, model training, and evaluation
from sklearn.preprocessing import StandardScaler  # Standardizes features by removing the mean and scaling to unit variance
from sklearn.model_selection import train_test_split  # Used to split the dataset into training and testing sets
from sklearn import svm  # Support Vector Machine (SVM) algorithm for classification tasks
from sklearn.metrics import accuracy_score  # To calculate the accuracy of the machine learning model

# Import additional libraries
import joblib  # Used for saving and loading the trained machine learning models
import gradio as gr  # Gradio is used to create an easy-to-use user interface for model predictions


## Data Collection and Analysis

In this section, we focus on the data collection process and conduct an initial analysis of the dataset used for diabetes prediction.

### Data Collection
The dataset used in this project is sourced from the [PIMA Indian Diabetes Database](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database). It contains medical information from female patients of Pima Indian heritage, including various health metrics that contribute to diabetes prediction.

### Dataset Features
The dataset consists of the following features:

- **Pregnancies**: Number of times the individual has been pregnant
- **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **Blood Pressure**: Diastolic blood pressure (mm Hg)
- **Skin Thickness**: Triceps skin fold thickness (mm)
- **Insulin**: 2-Hour serum insulin (mu U/ml)
- **BMI**: Body mass index (weight in kg/(height in m)^2)
- **Diabetes Pedigree Function**: A function that scores the likelihood of diabetes based on family history
- **Age**: Age of the individual (in years)
- **Outcome**: Target variable indicating the presence (1) or absence (0) of diabetes

### Preliminary Analysis
To better understand the dataset, we perform a preliminary analysis by checking for missing values, data distribution, and relationships between features. This analysis helps identify patterns and potential issues in the data that may impact model performance.


In [None]:
# Loading the dataset into a pandas DataFrame
diabetes_dataset = pd.read_csv('/content/diabetes.csv')

In [None]:
# Printing the first five rows of the dataset to understand its structure
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# Getting some basic statistics of the dataset (like mean, min, max values, etc.)
diabetes_dataset.describe()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
# Checking for the number of rows and columns in the dataset
diabetes_dataset.shape

(768, 9)

In [None]:
# Check the distribution of the target variable 'Outcome'
# 'Outcome' is the target variable, where 1 indicates the patient has diabetes, and 0 indicates they do not
# This will give us an idea of whether the dataset is balanced or imbalanced
diabetes_data['Outcome'].value_counts()


Unnamed: 0_level_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


0 represents non diabetic and 1 represents diabetic

In [None]:
# Calculate the mean of each feature grouped by 'Outcome'
# This provides insight into how each feature varies between diabetic (1) and non-diabetic (0) patients
diabetes_data.groupby('Outcome').mean()


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


Seperating features and label

In [None]:
# Separate the dataset into features (X) and target variable (Y)
# X contains all the input features used for prediction (i.e., the dataset without the 'Outcome' column)
# Y contains the target variable 'Outcome', which represents whether a patient has diabetes (1) or not (0)
X = diabetes_data.drop('Outcome', axis=1)  # Dropping the 'Outcome' column from the dataset
Y = diabetes_data['Outcome']  # Storing the 'Outcome' column as the target variable

# Print the features (X) and the target variable (Y) to verify the separation
print(X)
print(Y)


     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


Data standardization

In [None]:
# Initialize the StandardScaler for feature scaling
# StandardScaler standardizes features by removing the mean and scaling them to unit variance
# This ensures that all features contribute equally to the model, avoiding bias due to differing scales
scaler = StandardScaler()

In [None]:
# Apply the StandardScaler to the features (X)
# The fit_transform method first computes the mean and standard deviation, then scales the data accordingly
# This results in standardized data where each feature has a mean of 0 and a standard deviation of 1
standardized_data = scaler.fit_transform(X)

# Print the standardized feature data to verify the transformation
print(standardized_data)


[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [None]:
# Assign the standardized data back to X
# This ensures that the features now have uniform scales, which is essential for many machine learning algorithms
X = standardized_data

# Print the standardized features (X) and target variable (Y) to verify the updates
print(X)
print(Y)


[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


In [None]:
# Split the data into training and testing sets
# We use 80% of the data for training and 20% for testing (test_size=0.2)
# The random_state parameter ensures reproducibility of the split
# The stratify parameter ensures that the proportion of classes in Y is maintained in both the training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2, stratify=Y)


Training the model

In [None]:
# Initialize the Support Vector Classifier (SVC) with a linear kernel
# The SVC model uses the 'linear' kernel, which is suitable for linearly separable data
# The linear kernel works by finding the best hyperplane to separate the two classes (diabetic vs. non-diabetic)
classifier = svm.SVC(kernel='linear')


In [None]:
# Train the SVM classifier using the training data
# The classifier learns the relationship between the features (X_train) and the target variable (Y_train)
# The fit method trains the model to find the optimal hyperplane that separates the classes
classifier.fit(X_train, Y_train)


In [None]:
# Save the trained SVM classifier to a file using joblib
# This allows us to reuse the model later without retraining, facilitating deployment and inference
# The model is saved as 'diabetes_prediction_model.joblib'
joblib.dump(classifier, 'diabetes_prediction_model.joblib')


['diabetes_prediction_model.joblib']

## Model Evaluation

In this section, we will evaluate the performance of our machine learning model using the accuracy score as the primary metric.

### Accuracy Score
The accuracy score is a key performance metric that measures the proportion of correct predictions made by the model out of the total predictions. It is calculated using the formula:

\[
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
\]

This metric provides a straightforward indication of how well the model performs on both the training and test datasets.

### Evaluation Process
We will compute the accuracy score for the test dataset to assess the model's ability to generalize to unseen data. A higher accuracy score suggests better model performance, while a significantly lower score may indicate issues such as overfitting or the need for additional feature engineering.

### Interpretation of Results
By evaluating the accuracy score, we can determine the effectiveness of our diabetes prediction model. If the accuracy is satisfactory, we may proceed with deploying the model; if not, we may explore further enhancements to improve performance.



In [None]:
# Make predictions on the training data using the trained classifier
X_train_prediction = classifier.predict(X_train)

# Calculate the accuracy score of the model on the training data
# The accuracy score compares the predicted values (X_train_prediction) with the actual values (Y_train)
# This metric indicates how well the model has learned from the training data
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

# Print the accuracy score of the training data
print('Accuracy score of training data: ', training_data_accuracy)


Accuracy score of training data:  0.7866449511400652


In [None]:
# Make predictions on the test data using the trained classifier
X_test_prediction = classifier.predict(X_test)

# Calculate the accuracy score of the model on the test data
# The accuracy score compares the predicted values (X_test_prediction) with the actual values (Y_test)
# This metric helps evaluate how well the model generalizes to unseen data
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

# Print the accuracy score of the test data
print('Accuracy score of test data: ', test_data_accuracy)


Accuracy score of test data:  0.7727272727272727


## Making a Predictive System

In this section, we will outline the process of creating a predictive system for diabetes diagnosis using the trained machine learning model.

### Overview of the Predictive System
The predictive system allows users to input specific health metrics and receive a prediction regarding the likelihood of diabetes. This system can assist healthcare providers and individuals in early diagnosis and intervention.

### Components of the Predictive System
1. **User Input**: The system will require input data representing an individual's health metrics, such as glucose levels, blood pressure, BMI, and other relevant features.

2. **Data Preprocessing**: Before making predictions, the input data must be standardized to ensure it matches the scale of the training data. This involves:
   - Converting input data into a suitable format (e.g., NumPy array).
   - Reshaping the array to the correct dimensions.
   - Applying the same scaling method used during training (e.g., StandardScaler).

3. **Prediction**: Using the trained machine learning model, the system will predict whether the individual has diabetes based on the standardized input data. The model outputs either a positive or negative diagnosis.

4. **Output**: The system will present the prediction result to the user in a clear and concise manner, indicating whether the individual is likely to have diabetes or not.

### Implementation
To implement the predictive system, we will utilize libraries such as Gradio for building a user-friendly interface that allows easy interaction with the model. This will enable users to enter their health metrics and receive immediate feedback on their diabetes risk.

### Conclusion
The predictive system serves as a valuable tool in promoting awareness and early detection of diabetes, potentially leading to better health outcomes through timely interventions.


In [None]:
# Input data representing a new individual's health metrics
# The values correspond to features such as age, glucose level, blood pressure, etc.
input_data = (11, 143, 94, 33, 146, 36.6, 0.254, 51)

# Convert the input data to a NumPy array for processing
input_data_as_array = np.array(input_data)

# Reshape the array to ensure it has the correct dimensions for prediction
# Reshaping to (1, -1) converts the array into a 2D array with one sample
input_data_reshaped = input_data_as_array.reshape(1, -1)

# Standardize the input data using the previously fitted scaler
# This ensures that the input features are on the same scale as the training data
std_data = scaler.transform(input_data_reshaped)

# Use the trained classifier to make a prediction based on the standardized input data
prediction = classifier.predict(std_data)

# Output the prediction result
if prediction == 0:
    print('Individual does not have diabetes')
else:
    print('Individual has diabetes')


## Test the Web App

To experience the functionality of the diabetes prediction model, you can test the web application created for this project. The web app allows users to input health metrics and receive predictions regarding the likelihood of diabetes.

### Access the Web App
Follow the link below to access and test the web app for the diabetes prediction model:

[Diabetes Prediction Web App](https://huggingface.co/spaces/roy123njuguna/diabetes_prediction)

### Instructions for Use
1. Click on the link above to open the web app.
2. Enter the required health metrics in the provided input fields.
3. Submit the data to receive a prediction on whether the individual is likely to have diabetes.

### Feedback
I welcome any feedback regarding your experience with the web app, which will help us improve its functionality and user interface.
