# **Project-2**

***Project Title:*** Predicting Diabetes

***Project Description:*** In this project, you will build a machine learning model to predict whether a person has diabetes or not based on their health metrics such as BMI, blood pressure, glucose levels, etc. The data set includes information on individuals' health metrics, including whether they have diabetes or not.

***Dataset Details:*** The data set contains over 750 records of female patients aged 21 years or older. The dataset has eight features (e.g., age, BMI, blood pressure, insulin level, etc.) and one target variable that indicates whether the person has diabetes or not.

***Datasets Location:*** Canvas -> Modules -> Week 13 -> Datasets -> **"patients.csv"**.

***Tasks:*** 

1) *Data Exploration and Preprocessing:* You will explore the data set, handle missing values, perform feature engineering, and preprocess the data to get it ready for model building.

2) *Model Building:* You will train and evaluate several machine learning models on the preprocessed data set, including logistic regression, decision trees, and support vector machines.

3) *Model Evaluation:* You will evaluate the models' performance using several metrics such as accuracy, precision, recall, F1-score, and ROC curve analysis. You will also compare the models' performance and select the best-performing one.

4) *Deployment:* Once you have selected the best-performing model, you will deploy it and make predictions on new, unseen data.

This project will give you hands-on experience with supervised classification, data preprocessing, and model evaluation. It also has real-world applications in healthcare, where early detection of diabetes can help in the timely management of the disease.


In [4]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive

drive.mount('/content/drive')

diabetes_df = pd.read_csv("drive/My Drive/Machine Learning/Project 2/patients.csv")


Mounted at /content/drive


Task 1: Data Exploration and Preprocessing

In [14]:
# first five rows of the dataset
print(diabetes_df.head())

   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0     0.644576  0.866954       0.110259       0.883801 -0.705231  0.214926   
1    -0.842907 -1.109289      -0.230279       0.504234 -0.705231 -0.684454   
2     1.239570  1.964866      -0.343792      -1.330341 -0.705231 -1.108447   
3    -0.842907 -0.983813      -0.230279       0.124667  0.106027 -0.491729   
4    -1.140404  0.521895      -1.705944       0.883801  0.744677  1.435513   

   DiabetesPedigreeFunction       Age  Outcome  
0                  0.461673  1.430763        1  
1                 -0.368227 -0.184414        0  
2                  0.596983 -0.099404        1  
3                 -0.921494 -1.034507        0  
4                  5.456109 -0.014395        1  


In [15]:
# shape of the dataset
print(diabetes_df.shape)

(750, 9)


In [16]:
# Getting the datatype and the number of non-null values in each column
print(diabetes_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750 entries, 0 to 749
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               750 non-null    float64
 1   Glucose                   750 non-null    float64
 2   BloodPressure             750 non-null    float64
 3   SkinThickness             750 non-null    float64
 4   Insulin                   750 non-null    float64
 5   BMI                       750 non-null    float64
 6   DiabetesPedigreeFunction  750 non-null    float64
 7   Age                       750 non-null    float64
 8   Outcome                   750 non-null    int64  
dtypes: float64(8), int64(1)
memory usage: 52.9 KB
None


In [17]:
# summary statistics of the numerical columns
print(diabetes_df.describe())

        Pregnancies       Glucose  BloodPressure  SkinThickness       Insulin  \
count  7.500000e+02  7.500000e+02   7.500000e+02   7.500000e+02  7.500000e+02   
mean  -4.736952e-17 -8.526513e-17   5.980401e-17  -4.973799e-17 -9.592327e-17   
std    1.000667e+00  1.000667e+00   1.000667e+00   1.000667e+00  1.000667e+00   
min   -1.140404e+00 -3.775649e+00  -3.976198e+00  -1.330341e+00 -7.052307e-01   
25%   -8.429073e-01 -6.701241e-01  -3.437921e-01  -1.330341e+00 -7.052307e-01   
50%   -2.479139e-01 -1.368522e-01   1.102587e-01   1.246667e-01 -3.556993e-01   
75%    6.445762e-01  6.160022e-01   5.643094e-01   7.414633e-01  4.167220e-01   
max    3.917040e+00  2.466769e+00   2.948076e+00   4.932517e+00  6.596092e+00   

                BMI  DiabetesPedigreeFunction           Age     Outcome  
count  7.500000e+02              7.500000e+02  7.500000e+02  750.000000  
mean   5.589603e-16             -4.026409e-17 -8.408089e-17    0.333333  
std    1.000667e+00              1.000667e+00  1

In [18]:
# Handling missing values
print(diabetes_df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [6]:

# Scaling the numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

diabetes_df[num_features] = scaler.fit_transform(diabetes_df[num_features])


In [19]:
# Encoding the categorical feature
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
diabetes_df['Outcome'] = label_encoder.fit_transform(diabetes_df['Outcome'])

# Feature selection
X = diabetes_df.drop(['Outcome'], axis=1)
y = diabetes_df['Outcome']


In [20]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=4)
X_new = selector.fit_transform(X, y)

In [21]:
# Getting the selected features
selected_features = X.columns[selector.get_support()]

# Create a new df with the selected features
X_new_df = pd.DataFrame(X_new, columns=selected_features)

Task 2: Model Building

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_new_df, y, test_size=0.2, random_state=42)

# Training the Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

# Training the Decision Tree Classifier model
dtc_model = DecisionTreeClassifier(random_state=42)
dtc_model.fit(X_train, y_train)

# Training the Support Vector Machine model
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)


Task 3: Model Evaluation

In [26]:
# Making predictions using the trained models
lr_pred = lr_model.predict(X_test)
dtc_pred = dtc_model.predict(X_test)
svm_pred = svm_model.predict(X_test)

# Evaluating the performance of the logistic regression
print('Logistic Regression:')
print('Accuracy:', accuracy_score(y_test, lr_pred))
print('Precision:', precision_score(y_test, lr_pred))
print('Recall:', recall_score(y_test, lr_pred))
print('F1 Score:', f1_score(y_test, lr_pred))
print('ROC AUC Score:', roc_auc_score(y_test, lr_pred))

Logistic Regression:
Accuracy: 0.8133333333333334
Precision: 0.84
Recall: 0.4666666666666667
F1 Score: 0.6000000000000001
ROC AUC Score: 0.7142857142857144


In [27]:
# Evaluating the performance of the decision tree
print('\nDecision Tree Classifier:')
print('Accuracy:', accuracy_score(y_test, dtc_pred))
print('Precision:', precision_score(y_test, dtc_pred))
print('Recall:', recall_score(y_test, dtc_pred))
print('F1 Score:', f1_score(y_test, dtc_pred))
print('ROC AUC Score:', roc_auc_score(y_test, dtc_pred))


Decision Tree Classifier:
Accuracy: 0.6933333333333334
Precision: 0.4883720930232558
Recall: 0.4666666666666667
F1 Score: 0.47727272727272724
ROC AUC Score: 0.6285714285714286


In [28]:
# Evaluating the performance of the SVM
print('\nSupport Vector Machine:')
print('Accuracy:', accuracy_score(y_test, svm_pred))
print('Precision:', precision_score(y_test, svm_pred))
print('Recall:', recall_score(y_test, svm_pred))
print('F1 Score:', f1_score(y_test, svm_pred))
print('ROC AUC Score:', roc_auc_score(y_test, svm_pred))


Support Vector Machine:
Accuracy: 0.78
Precision: 0.7142857142857143
Recall: 0.4444444444444444
F1 Score: 0.5479452054794521
ROC AUC Score: 0.6841269841269841


Logistic Regression model has the best overall performance with the highest accuracy, precision, F1 score, and ROC AUC score among the three models.

Task 4: Deployment