# Diabetes Prediction with Machine Learning

# Understanding the Problem

Diabetes is a chronic disease that affects millions worldwide. Early detection is crucial for effective management. Machine Learning can be a powerful tool in predicting the onset of diabetes based on various factors.

# Dataset

A common dataset used for diabetes prediction is the Pima Indians Diabetes Dataset. It contains medical records of Pima Indian women and includes attributes such as:

Number of pregnancies Glucose level Blood pressure Skin thickness Insulin level BMI Diabetes pedigree function Age Outcome (0 for non-diabetic, 1 for diabetic

# Steps Involved

# Task 1. Import Necessary Libraries:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Task 2. Load and Explore the Dataset:

In [2]:
# 1. Load the Dataset
data = pd.read_csv("Diabeteas_Dataset.csv")
data.head()
data.describe()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


# Task 3. Data Preprocessing:

Handle missing values (if any): Imputation or removal. Outlier detection and treatment. Feature scaling: Normalize or standardize features.

In [3]:
# Replace zero values in certain columns with the median value of that column
columns_with_zero_values = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for column in columns_with_zero_values:
    data[column] = data[column].replace(0, data[column].median())

# Task 5. Feature Engineering:

Create new features if necessary (e.g., combining existing features). Consider feature selection to improve model performance.

In [4]:
# Separate the features and the target variable
X = data.drop(columns='Outcome')
y = data['Outcome']

# Task 6. Split Data into Training and Testing Sets:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Task 7. Feature Scaling

In [6]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Task 7. Model Selection and Training:

Choose appropriate algorithms (Logistic Regression, Decision Trees, Random Forest, SVM, Naive Bayes, etc.). Train models on the training set.

In [7]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)

# Task 8. Model Evaluation:

Use metrics like accuracy, precision, recall, F1-score, and confusion matrix. Compare performance of different models.

In [8]:
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Task 9. Model Deployment:

In [9]:
print(f'Accuracy: {accuracy * 100:.2f}%')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 76.62%
Confusion Matrix:
[[80 19]
 [17 38]]
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.81      0.82        99
           1       0.67      0.69      0.68        55

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154

