# Diabetes Prediction (Riset/Screening) — Pipeline yang Benar (Tanpa Data Leakage)

Catatan:
- Notebook ini untuk **riset/screening**, bukan pengganti diagnosis dokter.
- Dataset Pima sering memiliki nilai `0` pada variabel yang secara biologis tidak mungkin 0 (Glucose, BloodPressure, SkinThickness, Insulin, BMI). Di sini `0` diperlakukan sebagai *missing value* lalu diimputasi median.
- Preprocessing (imputer + scaler) **fit hanya di data train** untuk mencegah *data leakage*.

Output model:
- `predict_proba` → **risk score** (probabilitas) untuk Outcome=1
- Evaluasi: accuracy, precision, recall, F1, ROC-AUC, confusion matrix


In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)
import pickle

# Load data
df = pd.read_csv("diabetes.csv")
df.head()


In [None]:
# Quick sanity checks
print("Shape:", df.shape)
print(df.isna().sum())

zero_cols = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
print("\nZero counts (should be treated as missing):")
print((df[zero_cols] == 0).sum())


## 1) Split data (train/test)

In [None]:
X = df.drop(columns=["Outcome"])
y = df["Outcome"].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(X_train.shape, X_test.shape)


## 2) Preprocessing + Model (Pipeline)

Langkah:
1. Ubah nilai `0` pada kolom tertentu menjadi `NaN`
2. Imputasi median
3. StandardScaler
4. LogisticRegression (keluar probabilitas → risk score)


In [None]:
zero_as_missing_cols = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
all_cols = X.columns.tolist()

def zero_to_nan(X_in: pd.DataFrame) -> pd.DataFrame:
    X_out = X_in.copy()
    for c in zero_as_missing_cols:
        X_out[c] = X_out[c].replace(0, np.nan)
    return X_out

zero_to_nan_tf = FunctionTransformer(zero_to_nan, feature_names_out="one-to-one")

preprocess = Pipeline(steps=[
    ("zero_to_nan", zero_to_nan_tf),
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

model = LogisticRegression(max_iter=2000)

pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", model)
])

pipe


## 3) Train

In [None]:
pipe.fit(X_train, y_train)
print("Trained.")


## 4) Evaluate

In [None]:
# Predictions
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:, 1]

# Metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, zero_division=0)
rec = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
roc = roc_auc_score(y_test, y_proba)

print("Accuracy :", acc)
print("Precision:", prec)
print("Recall   :", rec)
print("F1       :", f1)
print("ROC-AUC  :", roc)

print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred, zero_division=0))


## 5) Save pipeline (imputer + scaler + model)

Simpan sebagai satu objek agar Streamlit memakai preprocessing yang sama dengan training.


In [None]:
with open("diabetes_pipeline.pkl", "wb") as f:
    pickle.dump(pipe, f)

print("Saved: diabetes_pipeline.pkl")


## 6) Contoh inferensi + risk level (heuristik)

Kategori risk score (bisa kamu ubah):
- < 0.30  → Rendah
- 0.30–0.60 → Sedang
- > 0.60 → Tinggi


In [None]:
def risk_level(p: float) -> str:
    if p < 0.30:
        return "Rendah"
    if p < 0.60:
        return "Sedang"
    return "Tinggi"

sample = pd.DataFrame([{
    "Pregnancies": 5,
    "Glucose": 166,
    "BloodPressure": 72,
    "SkinThickness": 19,
    "Insulin": 175,
    "BMI": 25.8,
    "DiabetesPedigreeFunction": 0.587,
    "Age": 51
}])

p = pipe.predict_proba(sample)[:, 1][0]
pred = pipe.predict(sample)[0]
print("Risk score:", round(float(p), 4), "| Level:", risk_level(float(p)) , "| Pred:", pred)
