Memprediksi kemungkinan penyakit serius yang bisa dialami pasien di masa depan berdasarkan riwayat kondisi yang pernah tercatat.

### Feature Engineering
- Ambil seluruh kondisi yang pernah dialami pasien → vector of historical condition codes

- Agregasi menjadi representasi:

    - One-hot encoding penyakit

    - Bisa juga pakai TF-IDF atau embeddings untuk representasi kode penyakit

### Label Generation
- Untuk setiap pasien, ambil kondisi terbaru (misal, penyakit yang muncul di 1 bulan terakhir)

- Gunakan itu sebagai label target (klasifikasi multi-label)

### Modeling
- Problem: Multi-label Classification

- Contoh model:

    - Logistic Regression + OneVsRest

    - Random Forest Classifier

    - XGBoost

    - (Lanjut) LSTM atau Transformer jika urutan waktu penting

    - (Terbaru) Graph Neural Networks (GNN) jika mau pakai relasi antar penyakit



### Output Model
Pasien: Abel832 Keebler762
Top-3 Risiko:
1. Pneumonia (86%)
2. Bronchitis (72%)
3. Asthma (55%)

Ini bukan diagnosis, tapi early warning.

In [None]:
# === TRAINING & SAVE MODEL ===
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report
import joblib

df_code_ref = pd.read_csv('../data/csvdata/condition_code_ref.csv')
df_condition = pd.read_csv('../data/csvdata/condition.csv')
df_encounter = pd.read_csv('../data/csvdata/encounter.csv')
df_medical_procedure = pd.read_csv('../data/csvdata/medical_procedure.csv')
df_observation = pd.read_csv('../data/csvdata/observation.csv')
df_patient = pd.read_csv('../data/csvdata/patient.csv')


# --- DATA PREPROCESSING ---
# Join condition → encounter → patient
df_condition_patient = pd.merge(
    df_condition,
    df_encounter[['ID', 'PATIENT_ID']],
    left_on='ENCOUNTER_ID',
    right_on='ID',
    how='left'
).drop(columns=['ID_y']).rename(columns={'ID_x': 'ID'})

# Join with condition_code_ref
df_condition_cat = pd.merge(
    df_condition_patient,
    df_code_ref,
    left_on='CODE',
    right_on='SNOMED_CODE',
    how='left'
)

# Filter dan drop "Other"
df_condition_cat = df_condition_cat[['PATIENT_ID', 'DISEASE_CATEGORY']].dropna()
df_condition_cat = df_condition_cat[df_condition_cat['DISEASE_CATEGORY'] != "Other"]

# Group per pasien
df_grouped = df_condition_cat.groupby('PATIENT_ID')['DISEASE_CATEGORY'].apply(list).reset_index()

# Tambah kolom NAME_GIVEN
df_grouped = pd.merge(df_grouped, df_patient[['ID', 'NAME_GIVEN']], left_on='PATIENT_ID', right_on='ID', how='left')
df_grouped.drop(columns=['ID'], inplace=True)

# --- FEATURE ENGINEERING ---
mlb = MultiLabelBinarizer()
X = mlb.fit_transform(df_grouped['DISEASE_CATEGORY'])
y = [list(set(d[-1:])) if len(d) > 1 else list(set(d)) for d in df_grouped['DISEASE_CATEGORY']]
Y = mlb.transform(y)

# --- MODEL TRAINING ---
X_train, X_test, y_train, y_test, name_train, name_test = train_test_split(
    X, Y, df_grouped['NAME_GIVEN'], test_size=0.3, random_state=42
)

model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
model.fit(X_train, y_train)

# --- EVALUATION ---
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, target_names=mlb.classes_, output_dict=True, zero_division=0)
df_report = pd.DataFrame(report).transpose()

# --- SAVE MODEL, ENCODER, & DATA ---
df_grouped.to_csv("model/patient_disease_history.csv", index=False)
joblib.dump(model, "model/rf/model_saved.pkl")
joblib.dump(mlb, "model/rf/encoder_saved.pkl")

print("✅ Model, encoder, dan data pasien berhasil disimpan.")

✅ Model, encoder, dan data pasien berhasil disimpan.


In [14]:
df_code_ref.head()

Unnamed: 0,ID,CODE,SNOMED_CODE,ICD10_CODE,DISEASE_CATEGORY,DESCRIPTION
0,1,Acute bronchitis (disorder),10509002,M54,Respiratory,Inflammation of the bronchial tubes in the lungs.
1,2,Concussion with no loss of consciousness,62106007,M54,Neurological,A brain injury caused by a blow to the head.
2,3,Acute viral pharyngitis (disorder),195662009,J02.9,Infectious Disease,"Inflammation of the pharynx, leading to a sore..."
3,4,Headache (finding),25064002,F32,Neurological,Pain in the head or upper neck.
4,5,Sputum finding (finding),248595008,E11,Laboratory Finding,Material expelled from the respiratory tract.


In [15]:
df_code_ref.DISEASE_CATEGORY.unique()

array(['Respiratory', 'Neurological', 'Infectious Disease',
       'Laboratory Finding', 'Other', 'Endocrine / Metabolic',
       'Cardiovascular', 'Musculoskeletal', 'Mental and Behavioral',
       'Neoplasms'], dtype=object)

### Memprediksi risiko penyakit di masa depan berdasarkan riwayat historis penyakit pasien
Model ini mempelajari pola dari seluruh pasien (bukan hanya diri pasien itu sendiri) — dan menerapkannya pada input pasien tertentu.

Jika seseorang punya epilepsy + stroke, maka banyak dari mereka juga pernah mengalami Alzheimer's.
Maka model belajar bahwa kombinasi ini → berisiko terhadap Alzheimer.

kalau penyakit terakhir di data adalah penyakit lama juga (misalnya Concussion with no loss of consciousness), maka model bisa “memilih ulang” penyakit tersebut jika memang dominan.

Karena kita pakai multi-label dengan single-output, model hanya bisa memprediksi yang paling umum atau terakhir, bukan progres penyakit jangka panjang.

In [None]:
# === TESTING FROM SAVED MODEL ===
import pandas as pd
import numpy as np
import joblib

# Load artifacts
model = joblib.load("model/rf/model_saved.pkl")
mlb = joblib.load("model/rf/encoder_saved.pkl")
df_grouped = pd.read_csv("model/patient_disease_history.csv")
df_code_ref = pd.read_csv("../data/csvdata/condition_code_ref.csv")  

# Buat mapping dari DISEASE_CATEGORY → contoh CODE yang representatif
disease_map = df_code_ref.groupby("DISEASE_CATEGORY")["CODE"].first().to_dict()

# --- FUNGSI PREDIKSI ---
def predict_top3_from_saved_model(name_given: str):
    patient_row = df_grouped[df_grouped['NAME_GIVEN'] == name_given]
    if patient_row.empty:
        print(f"❌ Pasien dengan nama '{name_given}' tidak ditemukan.")
        return

    disease_list = eval(patient_row.iloc[0]['DISEASE_CATEGORY'])
    X_input = mlb.transform([disease_list])
    probas = model.predict_proba(X_input)

    adjusted_probs = []
    for p in probas:
        if hasattr(p, 'shape') and len(p.shape) == 2 and p.shape[1] == 2:
            adjusted_probs.append(p[:, 1])
        else:
            adjusted_probs.append(np.zeros(p.shape[0]))

    df_proba = pd.DataFrame(np.array(adjusted_probs).T, columns=mlb.classes_)
    top3 = df_proba.iloc[0].sort_values(ascending=False).head(3)

    # 🔁 Get historical disease CODEs from df_code_ref
    history_codes = df_code_ref[df_code_ref['DISEASE_CATEGORY'].isin(disease_list)]['CODE'].unique()

    print(f"🧑 Pasien: {name_given}")
    print("📋 Riwayat Penyakit Sebelumnya:")
    for code in history_codes:
        print(f"  - {code}")

    print("🔮 Top-3 Risiko Penyakit:")
    for i, (disease_cat, score) in enumerate(top3.items(), 1):
        disease_name = disease_map.get(disease_cat, disease_cat)
        print(f"  {i}. {disease_name} ({score*100:.1f}%)")

# --- TESTING ---
nama_pasien = df_grouped['NAME_GIVEN'].iloc[123]
predict_top3_from_saved_model(nama_pasien)

🧑 Pasien: Leroy603
📋 Riwayat Penyakit Sebelumnya:
  - Acute viral pharyngitis (disorder)
🔮 Top-3 Risiko Penyakit:
  1. Acute viral pharyngitis (disorder) (100.0%)
  2. Hypertension (0.0%)
  3. Prediabetes (0.0%)


### Train & Save model XGBoost

In [2]:
# === PREPROCESSING ===
import pandas as pd
import numpy as np
import pickle
import os

# Pastikan folder 'model' ada
os.makedirs("model", exist_ok=True)

# Load CSV
df_condition = pd.read_csv("../data/csvdata/condition.csv")
df_encounter = pd.read_csv("../data/csvdata/encounter.csv")
df_code_ref = pd.read_csv("../data/csvdata/condition_code_ref.csv")
df_patient = pd.read_csv("../data/csvdata/patient.csv")

# Gabung CONDITION → ENCOUNTER → PATIENT
df_condition = pd.merge(
    df_condition,
    df_encounter[['ID', 'PATIENT_ID']],
    left_on='ENCOUNTER_ID',
    right_on='ID',
    how='left'
).rename(columns={'ID_x': 'CONDITION_ID', 'ID_y': 'ENCOUNTER_ID'})

# Gabung CONDITION → CODE_REF untuk nama penyakit
df_code_ref = df_code_ref.rename(columns={"CODE": "CODE_NAME"})
df_condition = pd.merge(
    df_condition,
    df_code_ref[['SNOMED_CODE', 'CODE_NAME']],
    left_on='CODE',
    right_on='SNOMED_CODE',
    how='left'
)

# Filter dan urutkan
df_condition = df_condition.dropna(subset=['CODE_NAME'])
df_condition['ONSET_TIME'] = pd.to_datetime(df_condition['ONSET_TIME'], errors='coerce')
df_condition = df_condition.dropna(subset=['ONSET_TIME'])
df_condition = df_condition.sort_values(by=['PATIENT_ID', 'ONSET_TIME'])

# Mapping kode penyakit → index
unique_codes = df_condition['CODE_NAME'].unique()
code_to_index = {code: i for i, code in enumerate(sorted(unique_codes))}

# Susun data sekuensial per pasien
sequences = {}
for _, row in df_condition.iterrows():
    pid = row['PATIENT_ID']
    onset = row['ONSET_TIME']
    code = row['CODE_NAME']
    idx = code_to_index.get(code)
    if idx is None: continue

    if pid not in sequences:
        sequences[pid] = {}
    if onset not in sequences[pid]:
        sequences[pid][onset] = [0] * len(code_to_index)
    sequences[pid][onset][idx] = 1

# Mapping PATIENT_ID → NAME_GIVEN
id_to_name = dict(zip(df_patient['ID'], df_patient['NAME_GIVEN']))

# Buat X_seq, y_seq, name_seq
X_seq, y_seq, name_seq = [], [], []
for pid, timeline in sequences.items():
    steps = [timeline[t] for t in sorted(timeline.keys())]
    if len(steps) > 1:
        X_seq.append(steps[:-1])
        y_seq.append(steps[-1])
        name_seq.append(id_to_name.get(pid, f"Unknown_{pid}"))

# Simpan hasil
with open("model/xgboost/X_seq.pkl", "wb") as f:
    pickle.dump(X_seq, f)
with open("model/xgboost/y_seq.pkl", "wb") as f:
    pickle.dump(y_seq, f)
with open("model/xgboost/code_index_mapping.pkl", "wb") as f:
    pickle.dump(code_to_index, f)
with open("model/xgboost/name_seq.pkl", "wb") as f:
    pickle.dump(name_seq, f)

print("✅ Tahap 1: Data sekuensial berhasil diproses dan disimpan.")

✅ Tahap 1: Data sekuensial berhasil diproses dan disimpan.


In [3]:
# === TRAINING MODEL XGBOOST ===
import pickle
import numpy as np
from sklearn.multioutput import MultiOutputClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

# Load data
with open("model/xgboost/X_seq.pkl", "rb") as f:
    X_seq = pickle.load(f)
with open("model/xgboost/y_seq.pkl", "rb") as f:
    y_seq = pickle.load(f)
with open("model/xgboost/code_index_mapping.pkl", "rb") as f:
    code_to_index = pickle.load(f)
with open("model/xgboost/name_seq.pkl", "rb") as f:
    name_seq = pickle.load(f)

# Validasi panjang data
assert len(X_seq) == len(y_seq) == len(name_seq)

# Max pooling untuk tabular
X_tabular = [np.max(seq, axis=0) for seq in X_seq]
y_tabular = y_seq

# Split
X_train, X_test, y_train, y_test, name_train, name_test = train_test_split(
    X_tabular, y_tabular, name_seq, test_size=0.2, random_state=42
)

# Model XGBoost multi-label
model = MultiOutputClassifier(XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric='logloss',
    base_score=0.5,
    verbosity=0
))
model.fit(X_train, y_train)

# Simpan model dan data test
with open("model/xgboost/xgb_multioutput_model.pkl", "wb") as f:
    pickle.dump(model, f)
with open("model/xgboost/xgb_test_data.pkl", "wb") as f:
    pickle.dump((X_test, y_test), f)
with open("model/xgboost/xgb_test_names.pkl", "wb") as f:
    pickle.dump(name_test, f)

print("✅ Tahap 2: Model XGBoost berhasil dilatih dan disimpan.")


✅ Tahap 2: Model XGBoost berhasil dilatih dan disimpan.


In [12]:
# === PREDIKSI TOP-3 + RIWAYAT PENYAKIT ===
import pickle
import pandas as pd
import numpy as np

# Load model dan data
with open("model/xgboost/xgb_multioutput_model.pkl", "rb") as f:
    model = pickle.load(f)
with open("model/xgboost/xgb_test_data.pkl", "rb") as f:
    X_test, y_test = pickle.load(f)
with open("model/xgboost/xgb_test_names.pkl", "rb") as f:
    name_test = pickle.load(f)
with open("model/xgboost/code_index_mapping.pkl", "rb") as f:
    code_to_index = pickle.load(f)

# Mapping index → nama penyakit
index_to_code = {v: k for k, v in code_to_index.items()}

# Fungsi prediksi Top-3
def predict_top3_risiko(x_input):
    probas = model.predict_proba([x_input])
    pred_probs = np.array([p[0][1] if len(p[0]) > 1 else 0.0 for p in probas])
    top3_idx = pred_probs.argsort()[-3:][::-1]
    return [(index_to_code[i], pred_probs[i]) for i in top3_idx]

# Load data historis untuk riwayat penyakit
df_condition = pd.read_csv("../data/csvdata/condition.csv")
df_encounter = pd.read_csv("../data/csvdata/encounter.csv")
df_code_ref = pd.read_csv("../data/csvdata/condition_code_ref.csv")
df_patient = pd.read_csv("../data/csvdata/patient.csv")

df_code_ref = df_code_ref.rename(columns={"CODE": "CODE_NAME"})
df_cond = pd.merge(df_condition, df_encounter[['ID', 'PATIENT_ID']], left_on='ENCOUNTER_ID', right_on='ID', how='left')
df_cond = pd.merge(df_cond, df_patient[['ID', 'NAME_GIVEN']], left_on='PATIENT_ID', right_on='ID', how='left')
df_cond = pd.merge(df_cond, df_code_ref[['SNOMED_CODE', 'CODE_NAME']], left_on='CODE', right_on='SNOMED_CODE', how='left')

# 🔍 Prediksi pasien ke-i
i = 123  # Ubah sesuai indeks pasien di test set
nama = name_test[i]
x_input = X_test[i]
top3 = predict_top3_risiko(x_input)
riwayat = df_cond[df_cond['NAME_GIVEN'] == nama]['CODE_NAME'].dropna().unique().tolist()

# Cetak hasil
print(f"🧑 Pasien: {nama}")
print("📋 Riwayat Penyakit Sebelumnya:")
for code in riwayat:
    print(f"  - {code}")
print("🔮 Top-3 Risiko Penyakit:")
for j, (code, score) in enumerate(top3, 1):
    print(f"  {j}. {code} ({score*100:.1f}%)")

🧑 Pasien: Majorie11
📋 Riwayat Penyakit Sebelumnya:
  - Appendicitis
  - History of appendectomy
  - Body mass index 30+ - obesity (finding)
  - Acute viral pharyngitis (disorder)
  - Acute bronchitis (disorder)
  - Viral sinusitis (disorder)
  - Sinusitis (disorder)
  - Normal pregnancy
  - Headache (finding)
  - Sputum finding (finding)
  - Nausea (finding)
  - Vomiting symptom (finding)
  - Fever (finding)
  - Loss of taste (finding)
  - Suspected COVID-19
  - COVID-19
🔮 Top-3 Risiko Penyakit:
  1. Respiratory distress (finding) (54.6%)
  2. Hypoxemia (disorder) (54.6%)
  3. Pneumonia (40.9%)
