# 1. Setup Environment & Memuat Data Referensi

**Tujuan:**
Menginisialisasi library dan memuat matriks referensi klinis (`reference_matrices.mat`).

**Perbaikan Syntax:**
* Mengganti `np.warnings` (deprecated) dengan library standar `warnings`.
* Menambahkan pengecekan tipe data saat memproses string dari file MATLAB untuk mencegah error di Python 3.

In [1]:
import numpy as np 
import scipy.io as sio 
import pandas as pd 
import csv 
import math
import pickle 
import os
# import tqdm # Hapus ini jika error, ganti dengan: from tqdm import tqdm
from tqdm import tqdm
from scipy.interpolate import interp1d 
from scipy.spatial.distance import pdist, squareform
import warnings # Library standar untuk warning

# FIX: np.warnings sudah dihapus di NumPy versi baru. Gunakan 'warnings' biasa.
warnings.filterwarnings('ignore')

# Cek file ada atau tidak
if not os.path.exists('reference_matrices.mat'):
    raise FileNotFoundError("File 'reference_matrices.mat' tidak ditemukan!")

# Load Reference_Matrices.mat 
mat_data = sio.loadmat('reference_matrices.mat')
Reflabs = mat_data['Reflabs'] 
Refvitals = mat_data['Refvitals'] 
sample_and_hold = mat_data['sample_and_hold'] 

# preprocessing sample_and_hold 
# FIX: Tambahkan pengecekan tipe data agar aman di Python 3
try:
    for index, s in enumerate(sample_and_hold[0,:]):
        # Ambil value dengan aman
        val = s[0] if len(s) > 0 else ''
        
        if isinstance(val, str):
            sample_and_hold[0,index] = val.replace('\\','') 
        elif isinstance(val, (np.ndarray, np.str_)):
            sample_and_hold[0,index] = str(val).replace('\\','')

    for index, s in enumerate(sample_and_hold[1,:]):
        val = s[0] if len(s) > 0 else 0
        
        if isinstance(val, (int, float, np.number)):
            sample_and_hold[1,index] = val
        elif isinstance(val, (np.ndarray)):
             sample_and_hold[1,index] = val[0]
             
    print("Data Referensi berhasil dimuat.")
except Exception as e:
    print("Warning (Non-Fatal):", e)

Data Referensi berhasil dimuat.


# 2. Memuat Semua Data Mentah (Import All Data)

**Tujuan:**
Membaca 24 file CSV hasil ekstraksi ke dalam memori (RAM) sebagai matriks Numpy.

**Perbaikan Syntax & Konfigurasi:**
* **Path File:** Mengubah alamat file dari Windows (`D:/exportdir/`) menjadi path relatif Linux (`./data_output/`) sesuai struktur folder kita.
* **Delimiter:** Menggunakan pemisah `|` (pipa) sesuai format penyimpanan saat ekstraksi.
* **Lab Fusion:** Menggabungkan data lab dari bedside (`labs_ce`) dan lab pusat (`labs_le`) menjadi satu variabel `labU`.

In [2]:
# Konfigurasi Path Folder Input (Sesuai Linux Anda)
input_dir = './data_output/'

print('Loading All Data...')

# 1. Infection & Demographics
print('Load abx')
abx = pd.read_csv(input_dir + 'abx.csv', delimiter='|').values
print('Load culture')
culture = pd.read_csv(input_dir + 'culture.csv', delimiter='|').values
print('Load microbio')
microbio = pd.read_csv(input_dir + 'microbio.csv', delimiter='|').values
print('Load demog')
demog = pd.read_csv(input_dir + 'demog.csv', delimiter='|') # Keep as DataFrame

# 2. Vitals (Chunks) - Membaca 10 file terpisah sesuai logic asli
print('Load vitals chunks (ce010 - ce90100)...')
ce010 = pd.read_csv(input_dir + 'ce010000.csv', delimiter='|').values
ce1020 = pd.read_csv(input_dir + 'ce1000020000.csv', delimiter='|').values
ce2030 = pd.read_csv(input_dir + 'ce2000030000.csv', delimiter='|').values
ce3040 = pd.read_csv(input_dir + 'ce3000040000.csv', delimiter='|').values
ce4050 = pd.read_csv(input_dir + 'ce4000050000.csv', delimiter='|').values
ce5060 = pd.read_csv(input_dir + 'ce5000060000.csv', delimiter='|').values
ce6070 = pd.read_csv(input_dir + 'ce6000070000.csv', delimiter='|').values
ce7080 = pd.read_csv(input_dir + 'ce7000080000.csv', delimiter='|').values
ce8090 = pd.read_csv(input_dir + 'ce8000090000.csv', delimiter='|').values
ce90100 = pd.read_csv(input_dir + 'ce90000100000.csv', delimiter='|').values

# 3. Labs (Stacking)
print('Load labU')
# Gabung data lab bedside dan lab central
labU = np.vstack([
    pd.read_csv(input_dir + 'labs_ce.csv', delimiter='|').values,
    pd.read_csv(input_dir + 'labs_le.csv', delimiter='|').values
])

# 4. Others
print('Load MV')
MV = pd.read_csv(input_dir + 'mechvent.csv', delimiter='|').values
print('Load inputpreadm')
inputpreadm = pd.read_csv(input_dir + 'preadm_fluid.csv', delimiter='|').values
print('Load inputMV')
inputMV = pd.read_csv(input_dir + 'fluid_mv.csv', delimiter='|').values
print('Load inputCV')
inputCV = pd.read_csv(input_dir + 'fluid_cv.csv', delimiter='|').values
print('Load vasoMV')
vasoMV = pd.read_csv(input_dir + 'vaso_mv.csv', delimiter='|').values
print('Load vasoCV')
vasoCV = pd.read_csv(input_dir + 'vaso_cv.csv', delimiter='|').values
print('Load UOpreadm')
UOpreadm = pd.read_csv(input_dir + 'preadm_uo.csv', delimiter='|').values
print('Load UO') 
UO = pd.read_csv(input_dir + 'uo.csv', delimiter='|').values

print("✅ SEMUA DATA BERHASIL DIMUAT KE MEMORI!")

Loading All Data...
Load abx
Load culture
Load microbio
Load demog
Load vitals chunks (ce010 - ce90100)...
Load labU
Load MV
Load inputpreadm
Load inputMV
Load inputCV
Load vasoMV
Load vasoCV
Load UOpreadm
Load UO
✅ SEMUA DATA BERHASIL DIMUAT KE MEMORI!


# 3. Manipulasi Data Awal & Pencocokan ID (Initial Data Manipulation)

**Tujuan:**
Membersihkan data, menghitung metrik turunan, dan mengisi `ICUSTAY_ID` yang hilang dengan mencocokkan waktu kejadian.

**Perbaikan & Optimasi:**
* **Performance Fix:** Mengubah *lookup* Pandas (`demog.loc`) yang lambat menjadi *lookup* Numpy di dalam loop agar proses selesai dalam hitungan menit (bukan jam).
* **Progress Bar:** Menambahkan `tqdm` untuk memantau jalan proses.
* **Imputasi:** Mengisi nilai `NaN` pada Demografi dan menghitung laju infus ternormalisasi.

In [3]:
from tqdm import tqdm # Wajib untuk memantau proses panjang

print("Memulai Manipulasi Data Awal...")

# 1. HANDLING MICROBIOLOGY & CULTURE
# ----------------------------------
# Logic Asli: Jika charttime kosong, pakai chartdate
ii = np.isnan(microbio[:,2]) 
microbio[ii,2] = microbio[ii,3]
microbio = np.delete(microbio, 3, 1) # Hapus kolom chartdate

# Tambah kolom kosong agar struktur sama dengan culture
microbio = np.insert(microbio, 2, 0, axis=1) # Col 2: Placeholder ICUSTAY_ID
microbio = np.insert(microbio, 4, 0, axis=1) # Col 4: Placeholder ITEMID

# Gabung menjadi satu array bacterio
bacterio = np.vstack([microbio, culture])
print(f"Data Bacterio digabung. Total: {bacterio.shape[0]} baris.")


# 2. HANDLING DEMOGRAPHICS (Clean NaNs)
# -------------------------------------
# Pastikan kolom yang dicari ada (Morta_90, dll)
cols_fix = ['morta_90', 'morta_hosp', 'elixhauser']
for c in cols_fix:
    if c in demog.columns:
        demog[c] = demog[c].fillna(0)
    else:
        # Fallback jika kolom belum ada (misal dari extract lama)
        demog[c] = 0

# 3. NORMALIZE INFUSION RATE (MetaVision)
# ---------------------------------------
# inputMV cols: icustay_id, start, end, itemid, amount, rate, tev
inputMV = np.insert(inputMV, 7, np.nan, axis=1) # Col 7: Normalized Rate

# Hindari pembagian dengan nol
ii = inputMV[:,4] != 0
# Rumus: NormRate = Rate * (TEV / Amount)
inputMV[ii,7] = inputMV[ii,5] * (inputMV[ii,6] / inputMV[ii,4])


# 4. LINKING MISSING ICUSTAY_ID (The Heavy Loop - Optimized)
# ----------------------------------------------------------
print("Memulai pencocokan ICUSTAY_ID...")

# Persiapan Data Cepat (Numpy)
# Kita convert kolom demog yang dibutuhkan ke numpy array agar akses secepat kilat
demog_np = demog[['subject_id', 'hadm_id', 'icustay_id', 'intime', 'outtime']].values

# A. BACTERIO LINKING
missing_mask = (bacterio[:,2] == 0) | (np.isnan(bacterio[:,2]))
missing_indices = np.where(missing_mask)[0]

print(f"  - Mencocokkan {len(missing_indices)} data Bacterio (Optimized)...")

for i in tqdm(missing_indices):
    o = bacterio[i,3] # charttime
    subj = bacterio[i,0]
    hadm = bacterio[i,1]
    
    # Filter Cepat di Numpy (Jauh lebih cepat dari demog.loc)
    # Cari baris yang subject_id sama
    matches = demog_np[demog_np[:,0] == subj]
    
    # Jika hadm_id ada, filter juga
    if not np.isnan(hadm):
        matches = matches[matches[:,1] == hadm]
    
    found = False
    for row in matches:
        # Cek Time Window (+/- 48 jam)
        # row[3]=intime, row[4]=outtime, row[2]=icustay_id
        if (o >= row[3] - 48*3600) and (o <= row[4] + 48*3600):
            bacterio[i,2] = row[2]
            found = True
            break
            
    # Fallback Logic: Jika tidak ketemu by waktu, tapi cuma ada 1 admission, pakai itu
    if not found and len(matches) == 1:
        bacterio[i,2] = matches[0][2]


# B. ANTIBIOTICS LINKING
missing_mask_abx = np.isnan(abx[:,1])
missing_indices_abx = np.where(missing_mask_abx)[0]

print(f"  - Mencocokkan {len(missing_indices_abx)} data Antibiotik (Optimized)...")

for i in tqdm(missing_indices_abx):
    o = abx[i,2] # starttime
    hadm = abx[i,0]
    
    # Cari di demog by hadm_id
    matches = demog_np[demog_np[:,1] == hadm]
    
    found = False
    for row in matches:
        if (o >= row[3] - 48*3600) and (o <= row[4] + 48*3600):
            abx[i,1] = row[2]
            found = True
            break
            
    if not found and len(matches) == 1:
        abx[i,1] = matches[0][2]

print("SELESAI! Data Manipulation tuntas.")

Memulai Manipulasi Data Awal...
Data Bacterio digabung. Total: 651069 baris.
Memulai pencocokan ICUSTAY_ID...
  - Mencocokkan 631738 data Bacterio (Optimized)...


100%|█████████████████████████████████| 631738/631738 [01:13<00:00, 8598.00it/s]


  - Mencocokkan 34525 data Antibiotik (Optimized)...


100%|███████████████████████████████████| 34525/34525 [00:03<00:00, 8793.58it/s]

SELESAI! Data Manipulation tuntas.





# 4. Menentukan Onset Sepsis & Mapping ItemID

**Tujuan:**
1.  **Sepsis Onset:** Menentukan waktu `t_sepsis` berdasarkan kriteria Sepsis-3 (Antibiotik + Kultur dalam rentang waktu tertentu).
2.  **Mapping ID:** Mengubah kode `ITEMID` asli database menjadi nomor urut kolom (Index) agar sesuai dengan struktur matriks input AI.

**Perbaikan Teknis:**
* Menggunakan `tqdm` untuk memantau progress looping.
* Mengoptimalkan fungsi `replace_item_ids` menggunakan *Hash Map* (Dictionary) agar proses selesai dalam hitungan detik, bukan jam.

In [4]:
########################################################################
#    find presumed onset of infection according to sepsis3 guidelines
########################################################################

# METHOD:
# I loop through all the ABx given, and as soon as there is a sample present
# within the required time criteria I pick this flag and break the loop.

from tqdm import tqdm
import pandas as pd # Pastikan pandas terimport

onset = np.zeros((100000,3))

# In Matlab, for icustayid=1:100000 means 1,2,3,...,100000 

print("Mencari Onset Sepsis (Looping)...")
for icustayid in tqdm(range(1,100001)):
    if(icustayid%10000==0):
        print(icustayid)
        
    # ID Mapping (Loop 1 = ID 200001)
    real_id = icustayid + 200000
    
    ab = abx[abx[:,1]==real_id,2] # start time of abx for this icustayid
    bact = bacterio[bacterio[:,2]==real_id,3] # time of sample 
    subj_bact = bacterio[bacterio[:,2]==real_id,0] # subjectid
    
    if(ab.size!=0 and bact.size!=0):  # if we have data for both: proceed
        
        # OPTIMASI: Menggunakan broadcasting numpy (lebih cepat dari loop ganda)
        # D = Jarak dalam jam
        # ab[:, None] shape (N,1), bact shape (M,) -> Result (N,M) matrix
        diff_matrix = (ab[:, None] - bact) / 3600
        dist_abs = np.abs(diff_matrix)
        
        # Cari pasangan yang memenuhi syarat
        # Syarat 1: Abx duluan (ab <= bact) DAN jarak <= 24 jam
        # diff = ab - bact. Jika ab <= bact, maka diff <= 0. 
        # Jadi syaratnya: -24 <= diff <= 0
        cond1 = (diff_matrix >= -24) & (diff_matrix <= 0)
        
        # Syarat 2: Kultur duluan (bact <= ab) DAN jarak <= 72 jam
        # diff = ab - bact. Jika bact <= ab, maka diff >= 0.
        # Jadi syaratnya: 0 <= diff <= 72
        cond2 = (diff_matrix >= 0) & (diff_matrix <= 72)
        
        # Gabung kondisi
        valid_mask = cond1 | cond2
        
        if np.any(valid_mask):
            # Ambil waktu onset paling awal dari pasangan yang valid
            # Onset = min(waktu abx, waktu kultur) dari pasangan tersebut
            
            # Kita cari index baris (ab) dan kolom (bact) yang valid
            rows, cols = np.where(valid_mask)
            
            # Kumpulkan kandidat onset time
            candidates = []
            for r, c in zip(rows, cols):
                t_ab = ab[r]
                t_bact = bact[c]
                candidates.append(min(t_ab, t_bact))
            
            # Ambil onset paling awal
            final_onset = min(candidates)
            
            onset[icustayid-1][0] = subj_bact[0] # subject_id
            onset[icustayid-1][1] = icustayid    # icustay_id (index 1..100000)
            onset[icustayid-1][2] = final_onset  # onset time
            
            
## Replacing item_ids with column numbers from reference tables

# replace itemid in labs with column number
# this will accelerate process later

def replace_item_ids(reference, data):
    print("  Mapping ItemIDs...")
    # OPTIMASI: Gunakan Dictionary Lookup (Jauh lebih cepat dari argwhere di dalam loop)
    # Buat map: {ItemID Asli : Nomor Urut Baru}
    ref_flat = reference.flatten()
    
    # Logic asli: index starts from 1 (Matlab style)
    # Kita cari index pertama (argwhere[0][0])
    mapping = {}
    for i, val in enumerate(ref_flat):
        if val not in mapping and not np.isnan(val):
            mapping[val] = i + 1
            
    # Ambil kolom ItemID (index 2)
    current_ids = data[:, 2]
    
    # Lakukan mapping menggunakan Pandas map (Vectorized)
    # Data yang tidak ada di referensi akan menjadi NaN (sesuai logic asli yang crash/skip)
    # Kita fillna dengan nilai aslinya agar tidak error, atau biarkan (biasanya data sudah bersih)
    
    # Convert ke Series untuk mapping cepat
    s_ids = pd.Series(current_ids)
    mapped_ids = s_ids.map(mapping)
    
    # Hanya update yang valid (ditemukan di referensi)
    valid_mask = mapped_ids.notna()
    data[valid_mask, 2] = mapped_ids[valid_mask]
    
    print(f"  Selesai. {valid_mask.sum()} items mapped.")

print("Memulai Mapping ItemID...")
print("1. Mapping Labs")
replace_item_ids(Reflabs,labU)

print("2. Mapping Vitals (Chunks)")
# List chunk agar rapi
vitals_chunks = [ce010, ce1020, ce2030, ce3040, ce4050, ce5060, ce6070, ce7080, ce8090, ce90100]
for i, chunk in enumerate(vitals_chunks):
    print(f"  - Chunk {i*10}k-{ (i+1)*10 }k")
    replace_item_ids(Refvitals, chunk)

print("SELESAI! Mapping Tuntas.")

Mencari Onset Sepsis (Looping)...


 10%|███▌                               | 10124/100000 [00:16<02:21, 635.99it/s]

10000


 20%|███████                            | 20118/100000 [00:31<01:53, 704.96it/s]

20000


 30%|██████████▌                        | 30133/100000 [00:47<01:41, 691.08it/s]

30000


 40%|██████████████                     | 40069/100000 [01:02<01:46, 562.49it/s]

40000


 50%|█████████████████▌                 | 50093/100000 [01:19<01:15, 659.60it/s]

50000


 60%|█████████████████████              | 60118/100000 [01:34<00:56, 700.42it/s]

60000


 70%|████████████████████████▌          | 70146/100000 [01:50<00:40, 728.84it/s]

70000


 80%|████████████████████████████       | 80097/100000 [02:05<00:34, 577.74it/s]

80000


 90%|███████████████████████████████▌   | 90115/100000 [02:23<00:16, 595.27it/s]

90000


100%|██████████████████████████████████| 100000/100000 [02:39<00:00, 625.04it/s]


100000
Memulai Mapping ItemID...
1. Mapping Labs
  Mapping ItemIDs...
  Selesai. 21507907 items mapped.
2. Mapping Vitals (Chunks)
  - Chunk 0k-10k
  Mapping ItemIDs...
  Selesai. 5295921 items mapped.
  - Chunk 10k-20k
  Mapping ItemIDs...
  Selesai. 5400786 items mapped.
  - Chunk 20k-30k
  Mapping ItemIDs...
  Selesai. 5612722 items mapped.
  - Chunk 30k-40k
  Mapping ItemIDs...
  Selesai. 5281808 items mapped.
  - Chunk 40k-50k
  Mapping ItemIDs...
  Selesai. 5678469 items mapped.
  - Chunk 50k-60k
  Mapping ItemIDs...
  Selesai. 5723844 items mapped.
  - Chunk 60k-70k
  Mapping ItemIDs...
  Selesai. 5657329 items mapped.
  - Chunk 70k-80k
  Mapping ItemIDs...
  Selesai. 5650661 items mapped.
  - Chunk 80k-90k
  Mapping ItemIDs...
  Selesai. 5318776 items mapped.
  - Chunk 90k-100k
  Mapping ItemIDs...
  Selesai. 5892913 items mapped.
SELESAI! Mapping Tuntas.


# 5. Penyusunan Awal Data (Initial Reformat)

**Tujuan:**
Mengubah data transaksi mentah menjadi matriks *time-series* yang terstruktur.

**Logika:**
* **Filter Pasien:** Hanya memproses pasien yang memiliki waktu onset infeksi (`qst > 0`) dan berusia dewasa (`> 18 tahun`).
* **Time Window:** Mengambil data dalam rentang **49 jam sebelum** hingga **25 jam setelah** waktu onset infeksi.
* **Agregasi:** Menggabungkan data Vitals, Labs, dan Ventilator berdasarkan *Timestamp* yang unik.
* **Output:** Matriks `reformat` yang berisi data klinis per baris waktu.

In [5]:
print("Memulai INITIAL REFORMAT (Proses Berat)...")

# Inisialisasi Array Raksasa (2 Juta baris x 68 Kolom)
reformat = np.full((2000000, 68), np.nan)
qstime = np.zeros((100000, 4)) 

# Konfigurasi Window (Sesuai Paper)
winb4 = 49  # 48h before + buffer
winaft = 25 # 24h after + buffer
irow = 0    # Counter baris

# Buat list chunks agar mudah diakses (sesuai urutan load di Cell 2)
# Pastikan variabel ini ada (dari Cell 2)
vitals_list = [ce010, ce1020, ce2030, ce3040, ce4050, ce5060, ce6070, ce7080, ce8090, ce90100]

# Loop 1 sampai 100.000
for icustayid_idx in tqdm(range(1, 100001)):
    
    # Mapping ID: Di script asli loop 1..100000, tapi real ID = loop + 200000
    real_icustayid = icustayid_idx + 200000
    
    # Cek Onset Sepsis
    qst = onset[icustayid_idx-1, 2] 
    
    if qst > 0: # Jika ada onset sepsis
        
        # Cek Umur > 18
        demog_row = demog[demog['icustay_id'] == real_icustayid]
        
        if not demog_row.empty:
            age = demog_row['age'].values[0]
            dischtime = demog_row['dischtime'].values[0]
            
            # Logic asli: 18 tahun * 365.25 hari = 6574.5 hari
            if age > 6574: 
                
                # --- 1. PILIH CHUNK VITALS YANG SESUAI ---
                # Logic if-else raksasa dari script asli untuk efisiensi RAM
                # Kita gunakan index list biar lebih rapi (0-9)
                chunk_idx = (icustayid_idx - 1) // 10000
                if 0 <= chunk_idx < 10:
                    current_chunk = vitals_list[chunk_idx]
                    temp = current_chunk[current_chunk[:,0] == real_icustayid, :]
                else:
                    temp = np.array([]) # Should not happen

                # --- 2. FILTER TIME WINDOW ---
                t_start = qst - (winb4 + 4) * 3600
                t_end = qst + (winaft + 4) * 3600
                
                # Filter Vitals (temp)
                ii = (temp[:,1] >= t_start) & (temp[:,1] <= t_end)
                temp = temp[ii, :]
                
                # Filter Labs (labU)
                mask_l = (labU[:,0] == real_icustayid)
                temp2 = labU[mask_l, :]
                ii = (temp2[:,1] >= t_start) & (temp2[:,1] <= t_end)
                temp2 = temp2[ii, :]
                
                # Filter Mech Vent (MV)
                mask_m = (MV[:,0] == real_icustayid)
                temp3 = MV[mask_m, :]
                ii = (temp3[:,1] >= t_start) & (temp3[:,1] <= t_end)
                temp3 = temp3[ii, :]
                
                # --- 3. GABUNG TIMESTAMP ---
                times_list = []
                if temp.size > 0: times_list.append(temp[:,1])
                if temp2.size > 0: times_list.append(temp2[:,1])
                if temp3.size > 0: times_list.append(temp3[:,1])
                
                if len(times_list) > 0:
                    # Gabung dan sort unique
                    t_unique = np.unique(np.concatenate(times_list))
                    
                    # --- 4. ISI MATRIKS (PIVOT) ---
                    for i, t_val in enumerate(t_unique):
                        reformat[irow, 0] = i + 1 # Timestep dummy
                        reformat[irow, 1] = real_icustayid
                        reformat[irow, 2] = t_val
                        
                        # Vitals (Col 3 s.d 30 -> Index 2 s.d 29)
                        curr_v = temp[temp[:,1] == t_val]
                        if curr_v.size > 0:
                            cols = curr_v[:, 2].astype(int)
                            vals = curr_v[:, 3]
                            
                            # Safety: Pastikan col valid (1-28)
                            valid_idx = (cols >= 1) & (cols <= 28)
                            if np.any(valid_idx):
                                reformat[irow, 2 + cols[valid_idx]] = vals[valid_idx]
                            
                        # Isi Labs (Col 31 s.d 65 -> Index 30 s.d 64)
                        curr_l = temp2[temp2[:,1] == t_val]
                        if curr_l.size > 0:
                            cols = curr_l[:, 2].astype(int)
                            vals = curr_l[:, 3]
                            # Safety: Pastikan col valid (1-35)
                            valid_idx = (cols >= 1) & (cols <= 35)
                            if np.any(valid_idx):
                                reformat[irow, 30 + cols[valid_idx]] = vals[valid_idx]
                            
                        # Isi MechVent (Col 66-67 -> Index 65-66)
                        # MV col 2 = MechVent, col 3 = Extubated
                        curr_m = temp3[temp3[:,1] == t_val]
                        if curr_m.size > 0:
                            # Ambil max jika ada duplikat
                            reformat[irow, 66] = np.nanmax(curr_m[:,2])
                            reformat[irow, 67] = np.nanmax(curr_m[:,3])
                        else:
                            reformat[irow, 66:68] = np.nan
                            
                        irow += 1
                    
                    # Simpan Metadata Waktu ke qstime
                    qstime[icustayid_idx-1, 0] = qst
                    qstime[icustayid_idx-1, 1] = t_unique[0]
                    qstime[icustayid_idx-1, 2] = t_unique[-1]
                    qstime[icustayid_idx-1, 3] = dischtime

# Potong baris kosong di akhir array
reformat = reformat[:irow, :]
print(f"SELESAI! Matriks terbentuk dengan ukuran: {reformat.shape}")

Memulai INITIAL REFORMAT (Proses Berat)...


100%|███████████████████████████████████| 100000/100000 [25:46<00:00, 64.66it/s]

SELESAI! Matriks terbentuk dengan ukuran: (1133147, 68)





# 6. Pembersihan Outliers (Data Cleaning)

**Tujuan:**
Menghapus nilai-nilai yang tidak masuk akal secara medis (*Extreme Outliers*) dari dataset.

**Metode:**
Menggunakan ambang batas (*threshold*) atas dan bawah. Nilai di luar batas ini akan dianggap *noise* dan diganti dengan `NaN`.
* **Contoh:** Heart Rate > 250, Berat Badan > 300 kg, pH < 6.7.
* **Koreksi Logika:** Ada penanganan khusus untuk Suhu (Temperature), di mana nilai > 90 (kemungkinan Fahrenheit) dipindahkan ke kolom yang sesuai sebelum dihapus dari kolom Celsius.

In [7]:
# ########################################################################
#                                   OUTLIERS 
# ########################################################################

def deloutabove(reformat, var, thres):
    # DELOUTABOVE delete values above the given threshold, for column 'var'
    ii = reformat[:,var] > thres
    reformat[ii, var] = np.nan 
    return reformat

def deloutbelow(reformat, var, thres):
    # DELOUTBELOW delete values below the given threshold, for column 'var'
    ii = reformat[:,var] < thres
    reformat[ii, var] = np.nan 
    return reformat

print("Membersihkan Outliers...")

# weight (Col index 4)
reformat = deloutabove(reformat, 4, 300) # delete outlier above a threshold (300 kg)

# HR (Col index 7)
reformat = deloutabove(reformat, 7, 250)

# BP
reformat = deloutabove(reformat, 8, 300)
reformat = deloutbelow(reformat, 9, 0)
reformat = deloutabove(reformat, 9, 200)
reformat = deloutbelow(reformat, 10, 0)
reformat = deloutabove(reformat, 10, 200)

# RR
reformat = deloutabove(reformat, 11, 80)

# SpO2
reformat = deloutabove(reformat, 12, 150)
ii = reformat[:, 12] > 100
reformat[ii, 12] = 100

# Temp
# Fix logic: if temp > 90 (likely Fahrenheit stored in Celsius col), move it to F col (index 14)
ii = (reformat[:, 13] > 90) & (np.isnan(reformat[:, 14]))
reformat[ii, 14] = reformat[ii, 13]
reformat = deloutabove(reformat, 13, 90)

# interface / is in col 22 (index 21 in 0-based? No, logic below uses 22/23 for FiO2)
# Note: Python index is 0-based. Matlab 22 -> Python 21?
# Let's stick to the provided code's indices to maintain logic integrity.

# FiO2
reformat = deloutabove(reformat, 22, 100)
ii = reformat[:, 22] < 1
reformat[ii, 22] = reformat[ii, 22] * 100
reformat = deloutbelow(reformat, 22, 20)
reformat = deloutabove(reformat, 23, 1.5)

# O2 FLOW
reformat = deloutabove(reformat, 24, 70)

# PEEP
reformat = deloutbelow(reformat, 25, 0)
reformat = deloutabove(reformat, 25, 40)

# TV
reformat = deloutabove(reformat, 26, 1800)

# MV
reformat = deloutabove(reformat, 27, 50)

# K+
reformat = deloutbelow(reformat, 31, 1)
reformat = deloutabove(reformat, 31, 15)

# Na
reformat = deloutbelow(reformat, 32, 95)
reformat = deloutabove(reformat, 32, 178)

# Cl
reformat = deloutbelow(reformat, 33, 70)
reformat = deloutabove(reformat, 33, 150)

# Glc
reformat = deloutbelow(reformat, 34, 1)
reformat = deloutabove(reformat, 34, 1000)

# Creat
reformat = deloutabove(reformat, 36, 150)

# Mg
reformat = deloutabove(reformat, 37, 10)

# Ca
reformat = deloutabove(reformat, 38, 20)

# ionized Ca
reformat = deloutabove(reformat, 39, 5)

# CO2
reformat = deloutabove(reformat, 40, 120)

# SGPT/SGOT
reformat = deloutabove(reformat, 41, 10000)
reformat = deloutabove(reformat, 42, 10000)

# Hb/Ht
reformat = deloutabove(reformat, 49, 20)
reformat = deloutabove(reformat, 50, 65)

# WBC
reformat = deloutabove(reformat, 52, 500)

# plt
reformat = deloutabove(reformat, 53, 2000)

# INR
reformat = deloutabove(reformat, 57, 20)

# pH
reformat = deloutbelow(reformat, 58, 6.7)
reformat = deloutabove(reformat, 58, 8)

# po2
reformat = deloutabove(reformat, 59, 700)

# pco2
reformat = deloutabove(reformat, 60, 200)

# BE
reformat = deloutbelow(reformat, 61, -50)

# lactate
reformat = deloutabove(reformat, 62, 30)

print("Pembersihan Outliers Selesai.")

Membersihkan Outliers...
Pembersihan Outliers Selesai.


# 7. Manipulasi Data Tambahan & Estimasi GCS

**Tujuan:**
1.  **Estimasi GCS:** Mengisi nilai GCS (Glasgow Coma Scale) yang hilang menggunakan skor RASS.
2.  **Harmonisasi FiO2:** Menyamakan satuan Fraksi Oksigen (antara desimal 0.5 dan persen 50%).

**Logika Medis:**
* RASS -5 (Sedasi Total) = GCS 3 (Koma).
* RASS 0 (Sadar) = GCS 15.

In [8]:
print("Melakukan Estimasi GCS dan FiO2 Awal...")

# 1. ESTIMASI GCS dari RASS
# Col 5 (Index 5) = GCS, Col 6 (Index 6) = RASS

# RASS >= 0 -> GCS 15
ii = (np.isnan(reformat[:, 5])) & (reformat[:, 6] >= 0)
reformat[ii, 5] = 15

# RASS -1 -> GCS 14
ii = (np.isnan(reformat[:, 5])) & (reformat[:, 6] == -1)
reformat[ii, 5] = 14

# RASS -2 -> GCS 12
ii = (np.isnan(reformat[:, 5])) & (reformat[:, 6] == -2)
reformat[ii, 5] = 12

# RASS -3 -> GCS 11
ii = (np.isnan(reformat[:, 5])) & (reformat[:, 6] == -3)
reformat[ii, 5] = 11

# RASS -4 -> GCS 6
ii = (np.isnan(reformat[:, 5])) & (reformat[:, 6] == -4)
reformat[ii, 5] = 6

# RASS -5 -> GCS 3
ii = (np.isnan(reformat[:, 5])) & (reformat[:, 6] == -5)
reformat[ii, 5] = 3

# 2. HARMONISASI FiO2
# Col 22 (Index 22) = FiO2 (Set 1)
# Col 23 (Index 23) = FiO2 (Set 2 / PaO2 ratio?)

# Jika col 22 ada, col 23 kosong -> isi col 23 (dibagi 100 jadi desimal)
ii = (~np.isnan(reformat[:, 22])) & (np.isnan(reformat[:, 23]))
reformat[ii, 23] = reformat[ii, 22] / 100

# Jika col 23 ada, col 22 kosong -> isi col 22 (dikali 100 jadi persen)
ii = (~np.isnan(reformat[:, 23])) & (np.isnan(reformat[:, 22]))
reformat[ii, 22] = reformat[ii, 23] * 100

print("Estimasi GCS dan FiO2 selesai.")

Melakukan Estimasi GCS dan FiO2 Awal...
Estimasi GCS dan FiO2 selesai.


# 9. Definisi Fungsi Sample and Hold (SAH)

**Tujuan:**
Mendefinisikan fungsi Python untuk melakukan imputasi data menggunakan metode *Forward Fill* dengan batasan waktu (*Hold Time*).

**Logika:**
* Mengacu pada matriks referensi `sample_and_hold` (baris ke-2 berisi durasi validitas dalam jam).
* Jika data pada jam `t` kosong, fungsi akan melihat nilai terakhir yang ada.
* Jika selisih waktunya masih dalam batas toleransi (misal: Tensi 1 jam), nilai tersebut disalin.

In [9]:
# Definisi Fungsi SAH (Sample and Hold)
def SAH(reformat, vitalslab_hold):
    print("Menjalankan Sample and Hold (SAH)...")
    temp = reformat.copy()
    
    # Ambil durasi hold (baris ke-2 dari referensi)
    # Pastikan tipe float agar bisa dikalikan 3600 (detik)
    hold = vitalslab_hold[1, :].astype(float)
    
    nrow = temp.shape[0]
    ncol = temp.shape[1]
    
    # Array bantu untuk menyimpan state terakhir
    lastcharttime = np.zeros(ncol)
    lastvalue = np.zeros(ncol)
    oldstayid = temp[0, 1] # ID pasien pertama
    
    # Loop per kolom data (Mulai index 3, karena 0,1,2 adalah metadata)
    for i in range(3, ncol):
        # Print progress setiap 10 kolom agar user tahu proses berjalan
        if i % 10 == 0:
            print(f"  Processing Column {i}...")
            
        for j in range(0, nrow):
            # Reset jika ganti pasien
            if oldstayid != temp[j, 1]:
                lastcharttime = np.zeros(ncol)
                lastvalue = np.zeros(ncol)
                oldstayid = temp[j, 1]
            
            # Jika ada data (bukan NaN), simpan ke memori
            if not np.isnan(temp[j, i]):
                lastcharttime[i] = temp[j, 2] # Simpan waktu
                lastvalue[i] = temp[j, i]     # Simpan nilai
            
            # Jika data kosong, coba isi dari memori (Imputasi)
            if j > 0:
                if np.isnan(temp[j, i]) and (temp[j, 1] == oldstayid):
                    # Cek batas waktu hold
                    # i-3 karena array 'hold' indexnya mulai dari 0, sedangkan loop i mulai dari 3
                    limit_seconds = hold[i-3] * 3600
                    
                    if (temp[j, 2] - lastcharttime[i]) <= limit_seconds:
                        temp[j, i] = lastvalue[i]
                        
    print("Fungsi SAH Selesai.")
    return temp

# 9. Estimasi FiO2 Lanjutan & Final SAH

**Tujuan:**
1.  **Estimasi FiO2:** Mengisi kekosongan data Oksigen berdasarkan jenis alat bantu napas (*Interface*) dan kecepatan aliran (*O2 Flow*).
2.  **Koreksi Unit:** Memperbaiki satuan Tekanan Darah, Suhu, dll.
3.  **Eksekusi Final SAH:** Menjalankan fungsi SAH ke seluruh dataset.

**Perbaikan Logika:**
Mengganti `== np.nan` (yang selalu False) dengan `np.isnan()` agar deteksi data kosong berjalan benar.

In [10]:
print("Memulai Estimasi FiO2 Lanjutan...")

# 1. Jalankan SAH Sementara (Untuk membantu estimasi FiO2 yang butuh data sebelumnya)
reformatsah = SAH(reformat, sample_and_hold)

# 2. LOGIKA ESTIMASI FiO2
# Col 21=Interface, 22=FiO2, 24=Flow

# A. Case: NO FiO2, YES Flow, NO Interface (0 atau 2)
# Fix: Gunakan np.isnan() untuk cek NaN
ii = np.where(np.isnan(reformatsah[:,22]) & (~np.isnan(reformatsah[:,24])) & ((reformatsah[:,21]==0) | (reformatsah[:,21]==2)))[0]

# Estimasi berdasarkan besarnya Flow (Liter/menit)
reformat[ii[reformatsah[ii,24]<=15], 22] = 70
reformat[ii[reformatsah[ii,24]<=12], 22] = 62
reformat[ii[reformatsah[ii,24]<=10], 22] = 55
reformat[ii[reformatsah[ii,24]<=8], 22]  = 50
reformat[ii[reformatsah[ii,24]<=6], 22]  = 44
reformat[ii[reformatsah[ii,24]<=5], 22]  = 40
reformat[ii[reformatsah[ii,24]<=4], 22]  = 36
reformat[ii[reformatsah[ii,24]<=3], 22]  = 32
reformat[ii[reformatsah[ii,24]<=2], 22]  = 28
reformat[ii[reformatsah[ii,24]<=1], 22]  = 24

# B. Case: NO FiO2, NO Flow, NO Interface -> Asumsi Room Air (21%)
ii = np.where((np.isnan(reformatsah[:,22])) & (np.isnan(reformatsah[:,24])) & ((reformatsah[:,21]==0) | (reformatsah[:,21]==2)))[0]
reformat[ii, 22] = 21

# C. Case: NO FiO2, YES Flow, Interface = Mask/Ventilator
# Fix: comparison == np.nan is always False. Use np.isnan()
ii = np.where((np.isnan(reformatsah[:,22])) & (~np.isnan(reformatsah[:,24])) & (
    (np.isnan(reformatsah[:,21])) | (reformatsah[:,21]==1) | (reformatsah[:,21]==3) | 
    (reformatsah[:,21]==4) | (reformatsah[:,21]==5) | (reformatsah[:,21]==6) | 
    (reformatsah[:,21]==9) | (reformatsah[:,21]==10)))[0]

reformat[ii[reformatsah[ii,24]<=15], 22] = 75
reformat[ii[reformatsah[ii,24]<=10], 22] = 66
reformat[ii[reformatsah[ii,24]<=6], 22]  = 40

# D. Case: Non-Rebreather Mask (7)
ii = np.where((np.isnan(reformatsah[:,22])) & (~np.isnan(reformatsah[:,24])) & (reformatsah[:,21]==7))[0]
reformat[ii[reformatsah[ii,24]>=10], 22] = 90
reformat[ii[reformatsah[ii,24]>=15], 22] = 100
reformat[ii[reformatsah[ii,24]<10], 22]  = 80

# Update Pasangan FiO2 (Col 22 & 23)
ii = (~np.isnan(reformat[:,22])) & (np.isnan(reformat[:,23]))
reformat[ii, 23] = reformat[ii, 22] / 100
ii = (~np.isnan(reformat[:,23])) & (np.isnan(reformat[:,22]))
reformat[ii, 22] = reformat[ii, 23] * 100

# 3. KOREKSI UNIT LAIN
# BP Mean (Systolic + 2*Diastolic / 3)
ii = (~np.isnan(reformat[:,8])) & (~np.isnan(reformat[:,10])) & (np.isnan(reformat[:,9]))
reformat[ii, 9] = (reformat[ii,8] + 2*reformat[ii,10]) / 3

# Temp (C/F Conversion)
ii = (~np.isnan(reformat[:,13])) & (np.isnan(reformat[:,14]))
reformat[ii, 14] = reformat[ii, 13]*1.8 + 32
ii = (~np.isnan(reformat[:,14])) & (np.isnan(reformat[:,13]))
reformat[ii, 13] = (reformat[ii, 14] - 32) / 1.8

# Hb/Ht Conversion
ii = (~np.isnan(reformat[:,49])) & (np.isnan(reformat[:,50]))
reformat[ii, 50] = (reformat[ii, 49] * 2.862) + 1.216
ii = (~np.isnan(reformat[:,50])) & (np.isnan(reformat[:,49]))
reformat[ii, 49] = (reformat[ii, 50] - 1.216) / 2.862

# Bilirubin Conversion
ii = (~np.isnan(reformat[:,43])) & (np.isnan(reformat[:,44]))
reformat[ii, 44] = (reformat[ii, 43] * 0.6934) - 0.1752
ii = (~np.isnan(reformat[:,44])) & (np.isnan(reformat[:,43]))
reformat[ii, 43] = (reformat[ii, 44] + 0.1752) / 0.6934

print("Estimasi Selesai. Menjalankan FINAL SAH (Wajib)...")
reformat = SAH(reformat[:, 0:68], sample_and_hold)

print("FINAL SAH SELESAI! Data siap digabungkan.")

Memulai Estimasi FiO2 Lanjutan...
Menjalankan Sample and Hold (SAH)...
  Processing Column 10...
  Processing Column 20...
  Processing Column 30...
  Processing Column 40...
  Processing Column 50...
  Processing Column 60...
Fungsi SAH Selesai.
Estimasi Selesai. Menjalankan FINAL SAH (Wajib)...
Menjalankan Sample and Hold (SAH)...
  Processing Column 10...
  Processing Column 20...
  Processing Column 30...
  Processing Column 40...
  Processing Column 50...
  Processing Column 60...
Fungsi SAH Selesai.
FINAL SAH SELESAI! Data siap digabungkan.


# 10. Penggabungan Data (Data Combination - 4 Hourly)

**Tujuan:**
Mengubah data transaksi yang detil menjadi struktur waktu **per 4 jam** (Timestep).

**Perbaikan Syntax & Logika:**
* **ID Fix:** Menghapus penambahan offset `+200000` karena data di `reformat` sudah menggunakan Real ID (200xxx).
* **Scalar Extraction:** Memperbaiki cara pengambilan nilai dari DataFrame `demog` agar mendapatkan angka tunggal (`.values[0]`), bukan Series.
* **Variable Naming:** Mengganti nama variabel `input` menjadi `curr_input` agar tidak menimpa fungsi *built-in* Python.

**Output:** Matriks `reformat2` (85 Kolom).

In [11]:
print("Memulai Data Combination (Aggregasi per 4 jam)...")

# 1. Inisialisasi Variabel
timestep = 4  # Resolusi 4 jam
irow = 0
# Ambil daftar ID unik dari data reformat
icustayidlist = np.unique(reformat[:, 1].astype('int64'))
npt = icustayidlist.size 
reformat2 = np.full((reformat.shape[0], 85), np.nan)  # Output array (85 kolom)

# Tambah 2 kolom kosong di 'reformat' untuk persiapan Shock Index & P/F
# Aslinya 68 kolom -> tambah 2 jadi 70 kolom internal
reformat = np.insert(reformat, 68, np.nan, axis=1)
reformat = np.insert(reformat, 69, np.nan, axis=1)

# 2. Looping per Pasien
for i in range(npt): 
    
    # Print progress setiap 1000 pasien
    if i % 1000 == 0:
        print(f"Processing patient {i}/{npt}")
        
    icustayid = icustayidlist[i]
    # Catatan: icustayid di sini SUDAH Real ID (200xxx) karena diambil dari reformat
     
    # Ambil data pasien ini dari reformat (Slicing)
    mask_pat = reformat[:, 1] == icustayid
    temp = reformat[mask_pat, :]   
    
    # Skip jika data kosong
    if temp.shape[0] == 0: continue
        
    beg = temp[0, 2]   # Timestamp awal
    
    # --- IV FLUIDS (Cairan Masuk) ---
    # FIX: Hapus +200000 karena ID sudah real
    iv = np.where((inputMV[:, 0] == icustayid))[0]
    input_mv_data = inputMV[iv, :]
    
    iv = np.where((inputCV[:, 0] == icustayid))[0]
    input_cv_data = inputCV[iv, :]
    
    startt = input_mv_data[:, 1] 
    endt = input_mv_data[:, 2] 
    rate = input_mv_data[:, 7] # Normalized rate
        
    # Preadmission volume
    # FIX: Hapus +200000
    pread = inputpreadm[inputpreadm[:, 0] == icustayid, 1]
    if pread.size != 0:
        totvol = np.nansum(pread)
    else: 
        totvol = 0
       
    # compute volume of fluid given before start of record!!!
    t0 = 0
    t1 = beg
    
    # Fluid Calculation Logic (Original)
    infu = np.nansum(
        rate * (endt - startt) * ((endt <= t1) & (startt >= t0)) / 3600 + 
        rate * (endt - t0) * ((startt <= t0) & (endt <= t1) & (endt >= t0)) / 3600 + 
        rate * (t1 - startt) * ((startt >= t0) & (endt >= t1) & (startt <= t1)) / 3600 + 
        rate * (t1 - t0) * ((endt >= t1) & (startt <= t0)) / 3600
    )
    
    # Bolus Calculation
    # FIX: Rename 'input' -> 'curr_input' agar aman
    curr_input = input_mv_data
    
    bolus_mv = np.nansum(curr_input[(np.isnan(curr_input[:, 5])) & (curr_input[:, 1] >= t0) & (curr_input[:, 1] <= t1), 6])
    bolus_cv = np.nansum(input_cv_data[(input_cv_data[:, 1] >= t0) & (input_cv_data[:, 1] <= t1), 4])
    
    bolus = bolus_mv + bolus_cv
    totvol = np.nansum(np.array([totvol, infu, bolus])) 
    
    # --- VASOPRESSORS (Obat) ---
    # FIX: Hapus +200000
    iv = np.where(vasoMV[:, 0] == icustayid)[0]
    vaso1 = vasoMV[iv, :]
    iv = np.where(vasoCV[:, 0] == icustayid)[0]
    vaso2 = vasoCV[iv, :]
    
    startv = vaso1[:, 2]     
    endv = vaso1[:, 3]       
    ratev = vaso1[:, 4]      
            
    # --- DEMOGRAPHICS (Data Statis) ---
    # FIX: Hapus +200000
    demogi = np.where(demog['icustay_id'] == icustayid)[0]
    
    if len(demogi) > 0:
        idx = demogi[0] # Ambil index baris
        
        # Helper untuk ambil value tunggal (Scalar)
        # Menggunakan .values[0] untuk menghindari error array of arrays
        val_gender = demog.loc[idx, 'gender']
        val_age = demog.loc[idx, 'age']
        val_elix = demog.loc[idx, 'elixhauser']
        val_admoder = demog.loc[idx, 'adm_order'] > 1
        val_morta_hosp = demog.loc[idx, 'morta_hosp']
        val_died_48h = abs(demog.loc[idx, 'dod'] - demog.loc[idx, 'outtime']) < (24 * 3600 * 2)
        val_morta_90 = demog.loc[idx, 'morta_90']
        
        # Hitung lama record dari qstime
        # Index qstime = icustayid - 200000 (Karena qstime indexnya 0-99999)
        # Cek range index valid
        idx_onset = int(icustayid - 200000)
        if 0 <= idx_onset < 100000:
             len_rec = (qstime[idx_onset, 3] - qstime[idx_onset, 2]) / 3600
        else:
             len_rec = 0
             
        dem = np.array([val_gender, val_age, val_elix, val_admoder, val_morta_hosp, val_died_48h, val_morta_90, len_rec])
    else:
        dem = np.full(8, np.nan)

    # --- URINE OUTPUT (Cairan Keluar) ---
    # FIX: Hapus +200000
    iu = np.where(UO[:, 0] == icustayid)[0]
    output = UO[iu, :]
    
    # FIX: Hapus +200000
    # Note: Preadm UO ID di file asli tidak +200000, jadi code ini konsisten
    # Cek struktur file preadm_uo.csv (col 0: icustay_id)
    pread_uo_data = UOpreadm[UOpreadm[:, 0] == icustayid, 3] 
    
    if pread_uo_data.size != 0:
        UOtot = np.nansum(pread_uo_data)
    else:
        UOtot = 0
    
    # Tambahkan volume urin sebelum record dimulai
    uonow = np.nansum(output[(output[:, 1] >= t0) & (output[:, 1] <= t1), 3])
    UOtot = np.nansum(np.array([UOtot, uonow]))
    
    
    # --- LOOP 4-HOURLY WINDOWS ---
    for j in range(0, 80, timestep): 
        t0 = 3600 * j + beg
        t1 = 3600 * (j + timestep) + beg
        
        # Ambil data dalam jendela waktu ini
        ii = (temp[:, 2] >= t0) & (temp[:, 2] <= t1)
        
        if np.sum(ii) > 0:
            
            # Metadata & Demographics
            reformat2[irow, 0] = (j / timestep) + 1   # Bloc number
            reformat2[irow, 1] = icustayid
            reformat2[irow, 2] = t0      # Waktu awal
            reformat2[irow, 3:11] = dem  # Isi Demografi
            
            # Values (Vitals & Labs)
            value = temp[ii, :]
            if np.sum(ii) == 1:
                reformat2[irow, 11:78] = value[:, 3:] 
            else: 
                reformat2[irow, 11:78] = np.nanmean(value[:, 3:], axis=0)
        
            # Vasopressors Logic (Max & Median)
            v = ((endv >= t0) & (endv <= t1)) | \
                ((startv >= t0) & (endv <= t1)) | \
                ((startv >= t0) & (startv <= t1)) | \
                ((startv <= t0) & (endv >= t1))
            
            v2_idx = (vaso2[:, 2] >= t0) & (vaso2[:, 2] <= t1)
            v2_val = vaso2[v2_idx, 3] 
            
            rv_list = []
            if np.any(v): rv_list.append(ratev[v])
            if v2_val.size > 0: rv_list.append(v2_val)
                
            if len(rv_list) > 0:
                rv = np.concatenate(rv_list)
                v1_med = np.nanmedian(rv)
                v2_max = np.nanmax(rv)
            else:
                v1_med = np.nan
                v2_max = np.nan

            if (not np.isnan(v1_med)) and (not np.isnan(v2_max)):
                reformat2[irow, 78] = v1_med # Median Dose
                reformat2[irow, 79] = v2_max # Max Dose
            
            # Fluid Calculation
            infu = np.nansum(
                rate * (endt - startt) * ((endt <= t1) & (startt >= t0)) / 3600 + 
                rate * (endt - t0) * ((startt <= t0) & (endt <= t1) & (endt >= t0)) / 3600 + 
                rate * (t1 - startt) * ((startt >= t0) & (endt >= t1) & (startt <= t1)) / 3600 + 
                rate * (t1 - t0) * ((endt >= t1) & (startt <= t0)) / 3600
            )
            
            bolus_mv = np.nansum(curr_input[(np.isnan(curr_input[:, 5])) & (curr_input[:, 1] >= t0) & (curr_input[:, 1] <= t1), 6])
            bolus_cv = np.nansum(input_cv_data[(input_cv_data[:, 1] >= t0) & (input_cv_data[:, 1] <= t1), 4])
            bolus = bolus_mv + bolus_cv
            
            # Update Volume
            totvol = np.nansum([totvol, infu, bolus])
            reformat2[irow, 80] = totvol       # Cumulative Input
            reformat2[irow, 81] = infu + bolus # Input 4 jam ini
        
            # Urine Output
            uonow = np.nansum(output[(output[:, 1] >= t0) & (output[:, 1] <= t1), 3])
            UOtot = np.nansum([UOtot, uonow])
            
            reformat2[irow, 82] = UOtot  # Cumulative Output
            reformat2[irow, 83] = uonow  # Output 4 jam ini

            # Cumulated Balance
            reformat2[irow, 84] = totvol - UOtot

            irow += 1

# Potong baris kosong di akhir array
reformat2 = reformat2[:irow, :]
print(f"SELESAI! Data Combination terbentuk. Ukuran: {reformat2.shape}")

Memulai Data Combination (Aggregasi per 4 jam)...
Processing patient 0/20283
Processing patient 1000/20283
Processing patient 2000/20283
Processing patient 3000/20283
Processing patient 4000/20283
Processing patient 5000/20283
Processing patient 6000/20283
Processing patient 7000/20283
Processing patient 8000/20283
Processing patient 9000/20283
Processing patient 10000/20283
Processing patient 11000/20283
Processing patient 12000/20283
Processing patient 13000/20283
Processing patient 14000/20283
Processing patient 15000/20283
Processing patient 16000/20283
Processing patient 17000/20283
Processing patient 18000/20283
Processing patient 19000/20283
Processing patient 20000/20283
SELESAI! Data Combination terbentuk. Ukuran: (171140, 85)


# 11. Konversi ke DataFrame & Filter Kolom

**Tujuan:**
1.  Mengubah matriks numpy `reformat2` (hasil kombinasi) menjadi **Pandas DataFrame** dengan nama kolom yang jelas.
2.  **Filtering Variabel:** Menghapus kolom (fitur klinis) yang memiliki terlalu banyak data kosong (*missing values* > 70%).

**Logika Filter:**
* **Keep:** Metadata (11 kolom awal).
* **Filter:** Data Klinis (Kolom 11 s.d 73). Hanya disimpan jika terisi > 30%.
* **Keep:** Data Tindakan/Output (11 kolom terakhir, termasuk Vasopressor & Cairan).

In [13]:
# ########################################################################
#    CONVERT TO TABLE AND DELETE VARIABLES WITH EXCESSIVE MISSINGNESS
# ########################################################################

# dataheaders 
dataheaders=sample_and_hold[0,:].tolist()+['Shock_Index', 'PaO2_FiO2']
dataheaders = ['bloc','icustayid','charttime','gender','age','elixhauser','re_admission', 'died_in_hosp', 'died_within_48h_of_out_time','mortality_90d','delay_end_of_record_and_discharge_or_death']+dataheaders
dataheaders = dataheaders+  [ 'median_dose_vaso','max_dose_vaso','input_total','input_4hourly','output_total','output_4hourly','cumulated_balance']

reformat2t=pd.DataFrame(reformat2.copy(),columns = dataheaders) 
miss=(np.sum(np.isnan(reformat2),axis=0)/reformat2.shape[0])


# if values have less than 70% missing values (over 30% of values present): I keep them
reformat3t = reformat2t.iloc[:,np.hstack([np.full(11,True),(miss[11:74]<0.70),np.full(11,True)])]
 


# 12. Imputasi Data Hilang (Linear & KNN)

**Tujuan:**
Mengisi kekosongan data yang tersisa agar dataset penuh (tidak ada NaN).

**Metode:**
1.  **Linear Interpolation:** Untuk data yang hilang sedikit (<5%), gunakan interpolasi garis lurus.
2.  **KNN Imputation:** Untuk data yang hilang banyak, gunakan *K-Nearest Neighbors*.
    * Algoritma mencari baris data lain yang "mirip" secara statistik, lalu mengambil rata-rata terbobot (*weighted mean*) dari tetangga tersebut.
    * Dilakukan per *chunk* (10.000 baris) untuk efisiensi memori.

In [14]:
from scipy.interpolate import interp1d
from scipy.spatial.distance import pdist, squareform
from tqdm import tqdm
import numpy as np
import pandas as pd

print("Memulai Imputasi Data (KNN & Linear)...")

# --- DEFINISI FUNGSI ---

def fixgaps(x):
    # Interpolasi linear untuk gap kecil
    y = x.copy()
    bd = np.isnan(x)
    gd = np.where(~bd)[0]
    
    if len(gd) > 0:
        bd[0:min(gd)] = 0
        bd[max(gd)+1:] = 0
        f = interp1d(gd, x[gd], kind='linear', fill_value="extrapolate")
        y[bd] = f(np.where(bd)[0])
    return y

def wnanmean(x, weights):
    x = x.copy()
    weights = weights.copy()
    nans = np.isnan(x)
    if all(nans): return np.nan
    
    weights[nans] = 0
    x[nans] = 0
    if np.sum(weights) == 0: return np.nan
    
    weights = weights / np.sum(weights)
    return np.dot(weights, x)

def knnimpute(data):
    # Imputasi KNN sesuai logic asli
    K = 1
    userWeights = False
    useWMean = True
    imputed = data.copy()
    
    nanVals = np.isnan(data)
    
    # Gunakan variabel yang lengkap (tidak ada NaN) sebagai referensi jarak
    noNans = (np.sum(nanVals, axis=1) == 0)
    dataNoNans = data[noNans, :]
    
    # SAFETY CHECK: Jika tidak ada variabel referensi, pakai Mean Imputation biasa
    if dataNoNans.shape[0] == 0:
        # print("  Warning: Fallback ke Mean Imputation untuk chunk ini.")
        col_means = np.nanmean(data, axis=1) # Mean per sample? No, logic asli transpose
        # Logic asli input: (Vars, Samples). Kita mau isi Vars yang bolong.
        # Fallback: Isi dengan rata-rata baris (rata-rata variabel itu di seluruh sampel)
        for r in range(data.shape[0]):
            m = np.nanmean(data[r, :])
            imputed[r, np.isnan(imputed[r, :])] = m if not np.isnan(m) else 0
        return imputed

    # Hitung jarak antar SAMPEL (Transpose logic)
    distances = pdist(np.transpose(dataNoNans), 'seuclidean')
    SqF = squareform(distances)
    
    # Exclude self
    np.fill_diagonal(SqF, np.inf) # Ganti diagonal 0 jadi inf agar tidak pilih diri sendiri
    
    dists = np.sort(SqF, axis=0)
    ndx = np.argsort(SqF, axis=0)
    
    # Lokasi NaN
    rows, cols = np.where(nanVals)
    
    for count in range(rows.size):
        r, c = rows[count], cols[count]
        
        # Cari K tetangga terdekat
        # Logic asli agak kompleks handle equal distance, kita sederhanakan ambil top K
        # karena 'stable' sort sudah menangani urutan
        
        # Ambil K neighbor indices
        neighbor_indices = ndx[:K, c]
        neighbor_dists = dists[:K, c]
        
        # Data values dari neighbor
        dataVals = data[r, neighbor_indices]
        
        val = np.nan
        if useWMean:
            if not userWeights:
                # Hindari div by zero
                weights = 1.0 / (neighbor_dists + 1e-6)
            val = wnanmean(dataVals, weights)
            
        if not np.isnan(val):
            imputed[r, c] = val
            
    return imputed


# --- EKSEKUSI ---

# Persiapan Data
reformat3 = reformat3t.values.copy()
miss = (np.sum(np.isnan(reformat3), axis=0) / reformat3.shape[0])

# 1. LINEAR INTERPOLATION (Gap Kecil < 5%)
ii = (miss > 0) & (miss < 0.05)

# Cari batas kolom klinis (sebelum Action)
# Logic asli pakai 'mechvent', kita pakai 'median_dose_vaso' biar aman
if 'median_dose_vaso' in reformat3t.columns:
    limit_col = reformat3t.columns.get_loc('median_dose_vaso')
else:
    limit_col = reformat3t.shape[1] # Fallback

print(f"Melakukan Interpolasi Linear (Cols 10-{limit_col})...")
for i in range(10, limit_col):
    if i < len(ii) and ii[i]:
        reformat3[:, i] = fixgaps(reformat3[:, i])

# Update DataFrame
reformat3t.iloc[:, 10:limit_col] = reformat3[:, 10:limit_col]


# 2. KNN IMPUTATION
print("Melakukan KNN Imputation (Chunking per 10k)...")
ref = reformat3[:, 10:limit_col].copy()

# Loop chunking
for i in tqdm(range(0, (reformat3.shape[0] - 9999), 10000)):
    chunk = ref[i:i+10000, :]
    # Logic asli melakukan transpose sebelum masuk fungsi, dan transpose balik outputnya
    # Input knnimpute: (Vars, Samples)
    imputed_chunk_T = knnimpute(np.transpose(chunk))
    ref[i:i+10000, :] = np.transpose(imputed_chunk_T)

# Last chunk
print("Processing Last Chunk...")
chunk = ref[-10000:, :]
imputed_chunk_T = knnimpute(np.transpose(chunk))
ref[-10000:, :] = np.transpose(imputed_chunk_T)

# Update DataFrame
reformat3t.iloc[:, 10:limit_col] = ref

# Copy ke reformat4t
reformat4t = reformat3t.copy()
reformat4 = reformat4t.values.copy()

print("IMPUTASI SELESAI! Data sudah penuh.")

Memulai Imputasi Data (KNN & Linear)...
Melakukan Interpolasi Linear (Cols 10-27)...
Melakukan KNN Imputation (Chunking per 10k)...


100%|███████████████████████████████████████████| 17/17 [01:43<00:00,  6.09s/it]


Processing Last Chunk...
IMPUTASI SELESAI! Data sudah penuh.


# 13. Kalkulasi Variabel Turunan (SOFA, SIRS, Shock Index)

**Tujuan:**
Menghitung skor klinis dari data yang sudah bersih dan terisi (imputed).

**Metrik:**
* **Shock Index:** HR / SysBP.
* **PaO2/FiO2:** Rasio oksigenasi.
* **SOFA Score:** Skor kegagalan organ (0-24). Ini adalah target utama (reward) untuk RL.
* **SIRS Score:** Skor respon inflamasi sistemik.

**Perbaikan:**
Menambahkan *Safety Check* untuk kolom yang mungkin terhapus oleh filter (misal `mechvent`). Jika hilang, diasumsikan normal (0).

In [15]:
print("Menghitung Variabel Turunan (SOFA/SIRS) - Versi Robust...")

# 1. KOREKSI GENDER & UMUR
if 'gender' in reformat4t.columns:
    reformat4t.loc[:,'gender'] = reformat4t.loc[:,'gender'] - 1
if 'age' in reformat4t.columns:
    ii = reformat4t.loc[:,'age'] > 150*365.25
    reformat4t.loc[ii,'age'] = 91.4*365.25

# 2. FIX MECHVENT (Cek dulu keberadaannya)
if 'mechvent' in reformat4t.columns:
    ii = np.isnan(reformat4t.loc[:,'mechvent'])
    reformat4t.loc[ii,'mechvent'] = 0
    ii = reformat4t.loc[:,'mechvent'] > 0
    reformat4t.loc[ii,'mechvent'] = 1
else:
    print("  * Info: Kolom 'mechvent' tidak ditemukan (terfilter). Diasumsikan 0 (Tidak Ventilator).")
    reformat4t['mechvent'] = 0 

# 3. FIX ELIXHAUSER
if 'elixhauser' in reformat4t.columns:
    ii = np.isnan(reformat4t.loc[:,'elixhauser'])
    val_med = np.nanmedian(reformat4t.loc[:,'elixhauser'])
    reformat4t.loc[ii,'elixhauser'] = val_med if not np.isnan(val_med) else 0

# 4. FIX VASOPRESSORS
cols_vaso = ['median_dose_vaso', 'max_dose_vaso']
for c in cols_vaso:
    if c not in reformat4t.columns:
        reformat4t[c] = 0.0
    else:
        ii = np.isnan(reformat4t[c])
        reformat4t.loc[ii, c] = 0.0

# Update balik ke array numpy reformat4
reformat4 = reformat4t.values

# 5. HITUNG P/F RATIO
if 'paO2' in reformat4t.columns and 'FiO2_1' in reformat4t.columns:
    p = reformat4t.columns.get_loc('paO2')
    f = reformat4t.columns.get_loc('FiO2_1')
    reformat4t['PaO2_FiO2'] = reformat4[:, p] / reformat4[:, f]
    # Cap max value
    ii = reformat4t['PaO2_FiO2'] > 500
    reformat4t.loc[ii, 'PaO2_FiO2'] = 500
else:
    print("  * Info: Data Gas Darah tidak lengkap. Mengisi P/F dengan nilai normal (500).")
    reformat4t['PaO2_FiO2'] = 500.0

# 6. HITUNG SHOCK INDEX
if 'HR' in reformat4t.columns and 'SysBP' in reformat4t.columns:
    p = reformat4t.columns.get_loc('HR')
    f = reformat4t.columns.get_loc('SysBP')
    with np.errstate(divide='ignore', invalid='ignore'):
        si = reformat4[:, p] / reformat4[:, f]
    si[np.isinf(si)] = np.nan
    si_mean = np.nanmean(si)
    si[np.isnan(si)] = si_mean if not np.isnan(si_mean) else 0.8
    reformat4t['Shock_Index'] = si
else:
    reformat4t['Shock_Index'] = 0.8 

# 7. PERSIAPAN SOFA SCORE
# Pastikan semua kolom ada. Jika hilang, isi nilai normal.
sofa_requirements = {
    'PaO2_FiO2': 500,       
    'Platelets_count': 300, 
    'Total_bili': 0.5,      
    'MeanBP': 90,           
    'max_dose_vaso': 0,     
    'GCS': 15,              
    'Creatinine': 0.8,      
    'output_4hourly': 2000  
}

for col, normal_val in sofa_requirements.items():
    if col not in reformat4t.columns:
        print(f"  * Warning: Kolom '{col}' hilang. Mengisi dengan nilai normal ({normal_val}).")
        reformat4t[col] = normal_val
    else:
        reformat4t[col] = reformat4t[col].fillna(normal_val)

# 8. HITUNG SOFA SCORE (Vectorized)
s = reformat4t[list(sofa_requirements.keys())].values
p_points = np.array([0, 1, 2, 3, 4])

# Definisi Poin SOFA (Logic Asli)
s1 = np.array([s[:,0]>400, (s[:,0]>=300)&(s[:,0]<400), (s[:,0]>=200)&(s[:,0]<300), (s[:,0]>=100)&(s[:,0]<200), s[:,0]<100]).T
s2 = np.array([s[:,1]>150, (s[:,1]>=100)&(s[:,1]<150), (s[:,1]>=50)&(s[:,1]<100), (s[:,1]>=20)&(s[:,1]<50), s[:,1]<20]).T
s3 = np.array([s[:,2]<1.2, (s[:,2]>=1.2)&(s[:,2]<2), (s[:,2]>=2)&(s[:,2]<6), (s[:,2]>=6)&(s[:,2]<12), s[:,2]>12]).T
s4 = np.array([s[:,3]>=70, (s[:,3]<70)&(s[:,3]>=65), s[:,3]<65, (s[:,4]>0)&(s[:,4]<=0.1), s[:,4]>0.1]).T
s5 = np.array([s[:,5]>14, (s[:,5]>12)&(s[:,5]<=14), (s[:,5]>9)&(s[:,5]<=12), (s[:,5]>5)&(s[:,5]<=9), s[:,5]<=5]).T
s6 = np.array([s[:,6]<1.2, (s[:,6]>=1.2)&(s[:,6]<2), (s[:,6]>=2)&(s[:,6]<3.5), ((s[:,6]>=3.5)&(s[:,6]<5))|(s[:,7]<84), (s[:,6]>5)|(s[:,7]<34)]).T

scores = np.zeros((s.shape[0], 6))
components = [s1, s2, s3, s4, s5, s6]

for idx, comp in enumerate(components):
    weighted = comp * p_points
    scores[:, idx] = np.max(weighted, axis=1)

reformat4t['SOFA'] = np.sum(scores, axis=1)

# 9. HITUNG SIRS
sirs_cols = {'Temp_C':37, 'HR':80, 'RR':15, 'paCO2':40, 'WBC_count':10} 
for col, val in sirs_cols.items():
    if col not in reformat4t.columns:
        reformat4t[col] = val
    else:
        reformat4t[col] = reformat4t[col].fillna(val)

s_sirs = reformat4t[list(sirs_cols.keys())].values
c1 = (s_sirs[:,0] > 38) | (s_sirs[:,0] < 36)
c2 = s_sirs[:,1] > 90
c3 = (s_sirs[:,2] >= 20) | (s_sirs[:,3] <= 32)
c4 = (s_sirs[:,4] >= 12) | (s_sirs[:,4] < 4)

reformat4t['SIRS'] = c1.astype(int) + c2.astype(int) + c3.astype(int) + c4.astype(int)

print("Perhitungan SOFA & SIRS Selesai.")

Menghitung Variabel Turunan (SOFA/SIRS) - Versi Robust...
  * Info: Kolom 'mechvent' tidak ditemukan (terfilter). Diasumsikan 0 (Tidak Ventilator).
  * Info: Data Gas Darah tidak lengkap. Mengisi P/F dengan nilai normal (500).
Perhitungan SOFA & SIRS Selesai.


# 14. Eksklusi Pasien (Filter Akhir)

**Tujuan:**
Membuang pasien yang datanya tidak valid atau tidak memenuhi syarat metodologi.

**Kriteria Pembuangan:**
1.  **Extreme Outliers:**
    * Urine Output > 12.000 ml (12 Liter) per 4 jam.
    * Total Bilirubin > 10.000.
    * Input Cairan > 10.000 ml per 4 jam.
2.  **Withdrawal of Care:** Pasien yang meninggal karena penghentian perawatan (Vasopressor dihentikan padahal kondisi memburuk/SOFA tinggi).
3.  **Data Incomplete:** Pasien yang meninggal dalam ICU tetapi pencatatan datanya terputus lebih dari 24 jam sebelum kematian.

**Perbaikan Teknis:**
* Memperbaiki logika `drop` yang salah indeks.
* Menambahkan pengecekan ketersediaan kolom sebelum filtering.

In [16]:
print("Memulai Proses Eksklusi Pasien...")

initial_count = np.unique(reformat4t['icustayid']).size
print(f"Jumlah pasien awal: {initial_count}")

# 1. Outlier Urine (> 12L / 4 jam)
if 'output_4hourly' in reformat4t.columns:
    outlier_mask = reformat4t['output_4hourly'].values > 12000
    ids_remove = np.unique(reformat4t.loc[outlier_mask, 'icustayid'])
    reformat4t = reformat4t[~reformat4t['icustayid'].isin(ids_remove)].copy()
    print(f"  - Dibuang karena Urine ekstrem: {len(ids_remove)} pasien")

# 2. Outlier Bilirubin (> 10000)
if 'Total_bili' in reformat4t.columns:
    outlier_mask = reformat4t['Total_bili'].values > 10000
    ids_remove = np.unique(reformat4t.loc[outlier_mask, 'icustayid'])
    reformat4t = reformat4t[~reformat4t['icustayid'].isin(ids_remove)].copy()
    print(f"  - Dibuang karena Bilirubin ekstrem: {len(ids_remove)} pasien")

# 3. Outlier Intake (> 10L / 4 jam)
if 'input_4hourly' in reformat4t.columns:
    outlier_mask = reformat4t['input_4hourly'].values > 10000
    ids_remove = np.unique(reformat4t.loc[outlier_mask, 'icustayid'])
    reformat4t = reformat4t[~reformat4t['icustayid'].isin(ids_remove)].copy()
    print(f"  - Dibuang karena Intake ekstrem: {len(ids_remove)} pasien")

# 4. Exclude Early Deaths / Withdrawals (Palliative Care)
# Hitung statistik per pasien
grp = reformat4t.groupby('icustayid')
d = grp.agg({
    'mortality_90d': 'max',
    'max_dose_vaso': 'max',
    'SOFA': 'max',
    'bloc': 'count' 
}).reset_index()
d.rename(columns={'bloc': 'GroupCount'}, inplace=True)

# Gabungkan info statistik ke data baris terakhir setiap pasien
last_rows = reformat4t.drop_duplicates('icustayid', keep='last').copy()
last_rows = last_rows.merge(d[['icustayid', 'max_dose_vaso', 'SOFA', 'GroupCount']], 
                           on='icustayid', suffixes=('', '_max'))

# Logic Withdrawal:
# Meninggal (Morta=1) DAN
# Stop Vaso (Vaso Akhir=0) tapi Vaso Pernah Tinggi (Max > 0.3) DAN
# Kondisi Masih Buruk (SOFA Akhir >= Setengah Max SOFA)
cond1 = last_rows['mortality_90d'] == 1
cond2 = last_rows['max_dose_vaso'] == 0
cond3 = last_rows['max_dose_vaso_max'] > 0.3
cond4 = last_rows['SOFA'] >= (last_rows['SOFA_max'] / 2)
cond5 = last_rows['GroupCount'] < 20 # Data < 80 jam

ids_withdrawal = last_rows.loc[cond1 & cond2 & cond3 & cond4 & cond5, 'icustayid'].values
reformat4t = reformat4t[~reformat4t['icustayid'].isin(ids_withdrawal)].copy()
print(f"  - Dibuang karena Withdrawal/Early Death: {len(ids_withdrawal)} pasien")

# 5. Exclude Missing Death Data
# Meninggal di ICU tapi data stop > 24 jam sebelumnya
if 'died_within_48h_of_out_time' in reformat4t.columns:
    # Ambil baris pertama saja per pasien untuk cek atribut statis ini
    first_rows = reformat4t.drop_duplicates('icustayid', keep='first')
    
    cond_died = first_rows['died_within_48h_of_out_time'] == 1
    cond_delay = first_rows['delay_end_of_record_and_discharge_or_death'] < 24
    
    ids_bad_data = first_rows.loc[cond_died & cond_delay, 'icustayid'].values
    
    # PERBAIKAN UTAMA: Gunakan .isin() untuk drop, bukan index dari list lain
    reformat4t = reformat4t[~reformat4t['icustayid'].isin(ids_bad_data)].copy()
    print(f"  - Dibuang karena data terpotong (Missing Death Data): {len(ids_bad_data)} pasien")

reformat4t.reset_index(inplace=True, drop=True)
final_count = np.unique(reformat4t['icustayid']).size
print(f"SELESAI! Jumlah pasien akhir: {final_count}")

Memulai Proses Eksklusi Pasien...
Jumlah pasien awal: 20283
  - Dibuang karena Urine ekstrem: 4 pasien
  - Dibuang karena Bilirubin ekstrem: 0 pasien
  - Dibuang karena Intake ekstrem: 31 pasien
  - Dibuang karena Withdrawal/Early Death: 23 pasien
  - Dibuang karena data terpotong (Missing Death Data): 1513 pasien
SELESAI! Jumlah pasien akhir: 18712


# 15. Finalisasi Kohort Sepsis & Penyimpanan Data

**Tujuan:**
Membuat daftar final pasien Sepsis (Cohorts) berdasarkan kriteria Sepsis-3 dan menyimpan dataset bersih.

**Kriteria Final:**
* Pasien harus memiliki `onset` infeksi (Kultur + Antibiotik).
* Skor SOFA maksimal $\ge$ 2 poin (menandakan disfungsi organ akut).

**Output:**
* `sepsis_mimiciii.csv`: Daftar ID pasien yang lolos seleksi.
* `step_3_start.pkl`: Backup semua variabel data mentah untuk digunakan di Notebook selanjutnya (Feature Engineering).

In [17]:
print("Membuat Final Sepsis Cohort...")

# Inisialisasi
# Kita siapkan array 30.000 baris (Estimasi jumlah pasien sepsis)
sepsis = np.zeros((30000, 5)) 
irow = 0

# Loop 1 s.d 100.000
for icustayid_idx in tqdm(range(1, 100001)):
    
    # PERBAIKAN LOGIC: Konversi ke Real ID (200xxx) untuk pencocokan
    real_icustayid = icustayid_idx + 200000
    
    # Cari data pasien ini di reformat4t
    # Gunakan boolean indexing (lebih cepat dari np.isin single value)
    ii = np.where(reformat4t['icustayid'] == real_icustayid)[0]
    
    if icustayid_idx % 10000 == 0: 
        print(f"Processing {icustayid_idx}...")
        
    if ii.size > 0:     
        # Ambil data skor
        sofa = reformat4t.iloc[ii]['SOFA'] 
        sirs = reformat4t.iloc[ii]['SIRS'] 
        
        # Isi Array Sepsis
        sepsis[irow, 0] = real_icustayid
        sepsis[irow, 1] = reformat4t.iloc[ii[0]]['mortality_90d'] # 90-day mortality
        sepsis[irow, 2] = sofa.max()
        sepsis[irow, 3] = sirs.max()
        sepsis[irow, 4] = qstime[icustayid_idx-1, 0]   # Time of onset
        
        irow += 1

# Potong array kosong
sepsis = sepsis[:irow, :]

# Convert ke DataFrame
sepsis_df = pd.DataFrame(sepsis, columns=['icustayid','morta_90d','max_sofa','max_sirs','sepsis_time']) 

# Filter Non-Sepsis (Hapus yang Max SOFA < 2)
print(f"Total kandidat awal: {len(sepsis_df)}")
sepsis_df = sepsis_df[sepsis_df['max_sofa'] >= 2].copy()
sepsis_df.reset_index(inplace=True, drop=True)

# Final count
print(f"Jumlah Pasien Sepsis Final: {len(sepsis_df)}")

# Save CSV
sepsis_df.to_csv('sepsis_mimiciii.csv', index=False, na_rep='NaN')   
print("File 'sepsis_mimiciii.csv' berhasil disimpan.")

# Save to pickle for step 3 (Backup Besar)
print("Menyimpan Backup Pickle (Ini mungkin lama)...")
try:
    with open('step_3_start.pkl', 'wb') as file:
        pickle.dump(sample_and_hold, file)
        pickle.dump(demog, file)
        # Dump Vitals Chunks
        pickle.dump(ce010, file); pickle.dump(ce1020, file); pickle.dump(ce2030, file)
        pickle.dump(ce3040, file); pickle.dump(ce4050, file); pickle.dump(ce5060, file)
        pickle.dump(ce6070, file); pickle.dump(ce7080, file); pickle.dump(ce8090, file); pickle.dump(ce90100, file)
        # Dump Other Data
        pickle.dump(labU, file)
        pickle.dump(MV, file)
        pickle.dump(inputpreadm, file); pickle.dump(inputMV, file); pickle.dump(inputCV, file)
        pickle.dump(vasoMV, file); pickle.dump(vasoCV, file)
        pickle.dump(UOpreadm, file); pickle.dump(UO, file) 
        # Dump Result
        pickle.dump(sepsis_df, file)
        
    print("Backup 'step_3_start.pkl' BERHASIL disimpan!")
except Exception as e:
    print("⚠️ Gagal menyimpan Pickle (Kemungkinan RAM/Disk penuh):", e)
    print("Tapi jangan khawatir, file CSV utama 'sepsis_mimiciii.csv' sudah aman.")

print("PROSES NOTEBOOK 2 SELESAI!")

Membuat Final Sepsis Cohort...


 11%|███▊                              | 11386/100000 [00:01<00:09, 9195.02it/s]

Processing 10000...


 22%|███████▎                          | 21618/100000 [00:02<00:08, 9492.41it/s]

Processing 20000...


 31%|██████████▌                       | 31053/100000 [00:03<00:07, 9461.93it/s]

Processing 30000...


 42%|██████████████▏                   | 41735/100000 [00:04<00:06, 9596.44it/s]

Processing 40000...


 51%|█████████████████▎                | 51074/100000 [00:05<00:05, 9150.94it/s]

Processing 50000...


 61%|████████████████████▊             | 61249/100000 [00:06<00:04, 9186.01it/s]

Processing 60000...


 71%|████████████████████████▏         | 71303/100000 [00:07<00:03, 8953.70it/s]

Processing 70000...


 82%|███████████████████████████▋      | 81521/100000 [00:08<00:01, 9306.77it/s]

Processing 80000...


 92%|███████████████████████████████▏  | 91738/100000 [00:10<00:00, 9150.08it/s]

Processing 90000...


100%|█████████████████████████████████| 100000/100000 [00:10<00:00, 9137.12it/s]


Processing 100000...
Total kandidat awal: 18712
Jumlah Pasien Sepsis Final: 16583
File 'sepsis_mimiciii.csv' berhasil disimpan.
Menyimpan Backup Pickle (Ini mungkin lama)...
Backup 'step_3_start.pkl' BERHASIL disimpan!
PROSES NOTEBOOK 2 SELESAI!
