## ü©∫ Diabetes Prediction Using Logistic Regression ‚Äî Summary

This notebook demonstrates a full machine learning classification workflow:

### ‚úî Steps Completed
- Loaded the PIMA diabetes dataset
- Performed data cleaning & exploration
- Visualized correlations
- Trained Logistic Regression
- Evaluated accuracy, confusion matrix, classification report
- Predicted diabetes for a sample patient

### üìà Model Performance
- Accuracy: typically **0.74‚Äì0.78**
- Logistic Regression is simple, interpretable, and effective for medical classification tasks.

### üéØ Skills Demonstrated
- Data preprocessing
- Classification modeling
- Model evaluation
- Visualization
- Reproducible ML workflow

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import os

In [22]:
df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [23]:
df.info()
df.describe()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [24]:
X = df.drop("Outcome", axis=1)
y = df["Outcome"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [25]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [26]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.title("Confusion Matrix")
plt.show()

In [28]:
# üîç Predict for a sample patient
sample = [[5, 116, 74, 0, 0, 25.6, 0.201, 30]]
prediction = model.predict(sample)[0]
print("Prediction (1 = Diabetic, 0 = Non-Diabetic):", prediction)

Prediction (1 = Diabetic, 0 = Non-Diabetic): 1


# ‚úÖ Summary

We trained a Logistic Regression classifier on the Pima Indians Diabetes Dataset.

### Key Results:
- Accuracy: 75‚Äì80%
- Model: Logistic Regression
- Scaling: StandardScaler
- Data: Pima Indians Diabetes Dataset