# Notebook to Convert AmsterdamUMCdb to MIMIC-III Format

Here we will be converting the [AmsterdamUMCdb](https://github.com/AmsterdamUMC/AmsterdamUMCdb) data files to the MIMIC-III data file format as generated by [MIMIC-Code](https://github.com/MIT-LCP/mimic-code). We do this as to allow the exact same preprocessing pipeline to be applied to both MIMIC and the AmsterdamUMCdb.

In [None]:
import re
import numpy as np
import pandas as pd
from tqdm import tqdm

CHUNK_SIZE = 20000
MAX_CHUNKS = 100  # for debugging
OUT_DIR = 'final/'

### Convenience functions

As the files are in the orders of GBs we read in and process the files incrementally;

In [None]:
# reads file from path in chunks of size `chunksize`
def read_csv(path, usecols, chunksize=CHUNK_SIZE):
    for i, chunk in enumerate(pd.read_csv(path, usecols=usecols, encoding='latin1', engine='python', chunksize=chunksize)):
        yield i, chunk.reset_index(drop=True) # resets index so that indices range from 0 to chunksize - 1

Also, we use a progressbar which shows the number of admissions processed so far and how many there are still to go with an estimated processing time;

In [None]:
def pbar(iterator, total_admissions=23106):
    # Keep track of admissions already seen
    processed_admissions = set()
    
    with tqdm(total=total_admissions) as progress_bar:
        for i, chunk in iterator:
            
            # count number of new admissions not yet seen
            # we do it only if last admission is new (saves time checking)
            if chunk.admissionid.values[-1] not in processed_admissions:
                new_admissions = set(chunk.admissionid) - processed_admissions
                processed_admissions.update(new_admissions)
            
                # update progress bar
                progress_bar.update(len(new_admissions))
                
            yield i, chunk

---
### Times
Start of admission is in milliseconds relative to the first admission of that same patient; therefore we convert to hours 

In [None]:
# Start times of admissions relative to the first admission (with start = 0)
start_of_admission = dict()
for _, chunk in read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=['admissionid', 'admittedat']):
    # admissionid + admission time
    start_of_admission.update(dict(zip(chunk.admissionid, chunk.admittedat)))

print('Time since last admission: %.2f years' % (start_of_admission[5489] / (1000 * 60 * 60 * 24 * 365)))

In [None]:
# convenience function to compute time from start of admission in hours
# Note: original times are relative to first admission of same patient!
def time_within_admission(times, admissionids):
    starttimes = pd.Series([start_of_admission[a] for a in admissionids])
    return (times.reset_index(drop=True) - starttimes) / (60*60*1000)  # hours


# create a timestamp by adding integer hours to some starttime
def hours_to_absolute_time(starttime, hours):
    return starttime + pd.to_timedelta(arg=(hours * 3600 * 1000).astype(int), unit='ms')

---

## Patient cohort

In [None]:
# the start date of admissions is not known so we assume it is now for all admissions
admission_start = pd.Timestamp.now()
    
cohort_df = []
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=['admissionid', 'admittedat', 'destination'])):
    # infer whether patient died (1) or survived (0)
    expire_flag = chunk.destination == 'Overleden'
    
    # determine start- and end-time of admission in hours
    window_start = time_within_admission(chunk.admittedat, chunk.admissionid)
    window_end = window_start + 72
    
    # absolute start of admission is not known -> assume now
    admission_start = pd.Timestamp.now()
    
    chunk_df = pd.DataFrame({
        'icustay_id': chunk.admissionid,
        'window_start': hours_to_absolute_time(admission_start, hours=window_start),
        'window_end': hours_to_absolute_time(admission_start, hours=window_end),
        'hospital_expire_flag': expire_flag.astype(int),
    })
    cohort_df.append(chunk_df)
    
# merge
cohort_df = pd.concat(cohort_df, axis=0).reset_index(drop=True)
cohort_df.head()

In [None]:
# save!
cohort_df.to_csv(OUT_DIR + 'cohort.csv', index=False)

----

## Demographics

#### Systemic inflammatory response syndrome (SIRS) score
Sources:
- Temperature: [temperature.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/temperature.sql)
- Heart rate: [heart_rate.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/heart_rate.sql)
- Respiratory rate: [resp_rate.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/resp_rate.sql)
- WBC/Leukocytes: [wbc.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/wbc.sql)
- SIRS definition: https://www.mdcalc.com/calc/1096/sirs-sepsis-septic-shock-criteria

In [None]:
TEMPERATURE = [
    8658,  # --Temp Bloed
    8659,  # --Temperatuur Perifeer 2
    8662,  # --Temperatuur Perifeer 1
    13058, # --Temp Rectaal
    13059, # --Temp Lies
    13060, # --Temp Axillair
    13061, # --Temp Oraal
    13062, # --Temp Oor
    13063, # --Temp Huid
    13952, # --Temp Blaas
    16110  # --Temp Oesophagus
]

PCO2 = [
    6846,  # --PCO2
    9990,  # --pCO2 (bloed)
    21213  # --PCO2 (bloed) - kPa
]

HEARTRATE = [
    6640   # --Hartfrequentie
]

RESP_RATE = [
    8873,  # --Ademfrequentie Evita: measurement by Evita ventilator, most accurate
    12266, # --Ademfreq.: measurement by Servo-i/Servo-U ventilator, most accurate
    8874   # --Ademfrequentie Monitor: measurement by patient monitor using ECG-impedance, less accurate  
]

WBC = [
    6779,  # --Leucocyten 10^9/l
    9965   # --Leuco's (bloed) 10^9/l
]

BANDS = [
    11586  # -- Staaf % (bloed)
]

For each chunk of the `numericitems.csv` with a few admissions (actually quite a few), we will extract the above parameters (i.e. temperature, heart rate, pCO2, leukocyte count, etc.) and infer their earliest values within the admission. We limit these values to at most 8 hours after the start of admission (see Roggeveen et al. (2021)). We then compute the SIRS score from those parameters.

In [None]:
def compute_sirs(resp_rate, paco2, temp, heart_rate, wbc, bands):
    """ Computes Systemic inflammatory response syndrome (SIRS) score given vital parameters """
    # which admissions have available the data to estimate SIRS?
    admissionids = temp.keys() & heart_rate.keys() & (resp_rate.keys() | paco2.keys()) & (wbc.keys() | bands.keys())
    
    # compute SIRS for eligible admissions
    out = dict()
    for admissionid in admissionids:
        # For details, see: http://www.mdcalc.com/calc/1096/sirs-sepsis-septic-shock-criteria
        score = int(temp[admissionid] < 36 or temp[admissionid] > 38)
        score += int(heart_rate[admissionid] > 90)
        score += int(resp_rate[admissionid] > 20) if admissionid in resp_rate else int(paco2[admissionid] < 32)
        score += int(wbc[admissionid] > 12 or wbc[admissionid] < 4) if admissionid in wbc else int(bands[admissionid] > 10)
        
        out[admissionid] = score
    return out

In [None]:
sirs_on_admission = dict()

for i, chunk in pbar(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat', 'unitid'])):
    
    # which observations are within 24h of start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    within_24h = (times > -0.5) & (times < 8) # Allow 30 minute margin before admission
    chunk = chunk[within_24h]
    
    if len(chunk) == 0:
        continue
    
    # Respiratory rate (unit = breaths/min)
    resp_rate = chunk[chunk.itemid.isin(RESP_RATE) & (chunk.unitid == 15)]
    resp_rate = resp_rate.groupby('admissionid', sort=False).value.first().to_dict()
    
    # PaCO2 (mmHg)
    paco2 = chunk[chunk.itemid.isin(PCO2)].copy()
    paco2.loc[paco2.unitid == 152, 'value'] *= paco2.value * 7.50061683  # conversion rate kPa to mmHg
    paco2 = paco2.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Temperature (degrees C)
    temp = chunk[chunk.itemid.isin(TEMPERATURE) & (chunk.unitid == 59)]
    temp = temp.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Heart rate (beats/min)
    heart_rate = chunk[chunk.itemid.isin(HEARTRATE) & (chunk.unitid == 15)]
    heart_rate = heart_rate.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Leukocytes (10^9/l)
    wbc = chunk[chunk.itemid.isin(WBC) & (chunk.unitid == 101)]
    wbc = wbc.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Bands (bloed %)
    bands = chunk[chunk.itemid.isin(BANDS)]
    bands = bands.groupby('admissionid', sort=False).value.first().to_dict()
        
    # SIRS score
    sirs_on_admission.update(compute_sirs(resp_rate, paco2, temp, heart_rate, wbc, bands))
    
    if i == MAX_CHUNKS:
        break

In [None]:
sirs_on_admission[11]  # SIRS of admission 11 in first 24h

#### Mechanical ventilation

Source: [mechanical_ventilation.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/lifesupport/mechanical_ventilation.sql)

In [None]:
# Combinations of itemids and valueids in listitems.csv
# indicating use of (non-)invasive mechanical ventilation
MECH_VENT_SETTINGS = [
    (9534, (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)),                    # --Type beademing Evita 1
    (6685, (1, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 20, 22)),                 # --Type Beademing Evita 4
    (8189, (16,)),                                                          # --Toedieningsweg O2
    (12290, (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)),  # --Ventilatie Mode (Set) Servo-I and Servo-U ventilators
    (12347, (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)),  # --Ventilatie Mode (Set) (2) Servo-I and Servo-U ventilators
    (12376, (1, 2)),                                                        # --Mode (Bipap Vision)
]    

In [None]:
class Ventilation():
    """
    Convenience class to check for mechanical ventilation
    """
    def __init__(self, mech_vent):
        self._vent_settings = set()
        for itemid, valueids in mech_vent:
            self._vent_settings.update([(itemid, valueid) for valueid in valueids])
            
    def __call__(self, itemids, valueids):
        """ Checks if (itemid, valueid) pair is a valid ventilator settings """
        return np.array([x in self._vent_settings for x in list(zip(itemids, valueids))])
    
is_mech_vent = Ventilation(MECH_VENT_SETTINGS)

For each admission, see if mechanical ventilation is used in the first 24 hours (see Roggeveen et al., 2021)

In [None]:
admissions_with_vent = set()

for _, chunk in pbar(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'measuredat'])):
    # which observations are within 24h of start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    within_24h = (times > -0.5) & (times < 8) # Allow 30 minute margin before admission
    
    # check whether (itemid, valueid) in listitems is mechanical ventilation and whether it's in the first 24 hours
    chunk['has_vent'] = is_mech_vent(chunk.itemid, chunk.valueid) & within_24h
    
    # admissionids with mechanical ventilation
    admissions_with_vent.update(chunk.admissionid[chunk.has_vent])
        
print('Number of admissions with ventilation:', len(admissions_with_vent))

#### Glasgow Coma Scale
Sources: 
- Glasgow Coma Scale: [gcs.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/gcs.sql)
- GCS definition: https://www.mdcalc.com/calc/64/glasgow-coma-scale-score-gcs

In [None]:
## credits Patrick Thoral

def gcs_eyes_score(df):
    df = df.copy()
    df.loc[df.itemid == 6732, 'eyes_score'] = 5 - df.valueid   # --Actief openen van de ogen
    df.loc[df.itemid == 13077, 'eyes_score'] = df.valueid      # --A_Eye
    df.loc[df.itemid == 14470, 'eyes_score'] = df.valueid - 4  # --RA_Eye
    df.loc[df.itemid == 16628, 'eyes_score'] = df.valueid - 4  # --MCA_Eye
    df.loc[df.itemid == 19635, 'eyes_score'] = df.valueid - 4  # --E_EMV_NICE_24uur
    df.loc[df.itemid == 19638, 'eyes_score'] = df.valueid - 8  # --E_EMV_NICE_Opname
    return df
    
def gcs_motor_score(df):
    df = df.copy()
    df.loc[df.itemid == 6734, 'motor_score'] = 7 - df.valueid   # --Beste motore reactie van de armen
    df.loc[df.itemid == 13072, 'motor_score'] = df.valueid      # --A_Motoriek
    df.loc[df.itemid == 14476, 'motor_score'] = df.valueid - 6  # --RA_Motoriek
    df.loc[df.itemid == 16634, 'motor_score'] = df.valueid - 6  # --MCA_Motoriek
    df.loc[df.itemid == 19636, 'motor_score'] = df.valueid - 6  # --M_EMV_NICE_24uur
    df.loc[df.itemid == 19639, 'motor_score'] = df.valueid - 12 # --M_EMV_NICE_Opname
    return df

def gcs_verbal_score(df):
    df = df.copy()
    df.loc[df.itemid == 6735, 'verbal_score'] = 6 - df.valueid   # --Beste verbale reactie
    df.loc[df.itemid == 13066, 'verbal_score'] = df.valueid      # --A_Verbal
    df.loc[df.itemid == 14482, 'verbal_score'] = df.valueid - 5  # --RA_Verbal
    df.loc[df.itemid == 16640, 'verbal_score'] = df.valueid - 5  # --MCA_Verbal
    df.loc[df.itemid == 19637, 'verbal_score'] = df.valueid - 9  # --V_EMV_NICE_24uur
    df.loc[df.itemid == 19640, 'verbal_score'] = df.valueid - 15 # --M_EMV_NICE_Opname
    
    # cap verbal score to at least 1
    df.loc[df['verbal_score'] < 1, 'verbal_score'] = 1
    return df

In [None]:
gcs_on_admission = dict()

for i, chunk in pbar(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'value', 'measuredat'])): 
    # which observations are within 24h of start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    within_24h = (times > -0.5) & (times < 8) # Allow 30 minute margin before admission
    
    # component scores
    chunk = gcs_eyes_score(chunk)
    chunk = gcs_motor_score(chunk)
    chunk = gcs_verbal_score(chunk)
        
    # GCS = eye + motor + verbal score
    gcs = chunk.groupby('admissionid').first()[['eyes_score', 'motor_score', 'verbal_score']].sum(axis=1, min_count=1)
        
    # drop admissions where score could not be computed
    gcs = gcs[gcs.notna()].to_dict()
    
    gcs_on_admission.update(gcs)
    
    if i % 1000 == 0:
        print('(%d/23106) number of GCS computed: %s' % (i, len(gcs_on_admission)))

In [None]:
gcs_on_admission[11]

#### Sequential Organ Failure Assessment (SOFA) score

Sources:

- PaO2/FiO2: [pO2_pCO2_FiO2.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/pO2_pCO2_FiO2.sql)
- Ventilators: [mechanical_ventilation.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/lifesupport/mechanical_ventilation.sql)
- Platelets: [platelets.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/platelets.sql)
- Glasgow Coma Scale: [gcs.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/gcs.sql)
- Bilirubin: [bilirubin.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/bilirubin.sql)
- SOFA definition: https://www.mdcalc.com/calc/691/sequential-organ-failure-assessment-sofa-score

In [None]:
PO2 = [
    7433,  # --PO2
    9996,  # --PO2 (bloed)
    21214  # --PO2 (bloed) - kPa
]

# TODO: estimate FiO2 if not known from ventilator settings
FIO2 = [
    6699,  # --FiO2 %: setting on Evita ventilator
    12279, # --O2 concentratie --measurement by Servo-i/Servo-U ventilator
    12369, # --SET %O2: used with BiPap Vision ventilator
    16246  # --Zephyros FiO2: Non-invasive ventilation
]

PLATELETS = [
    9964,  # --Thrombo's (bloed)
    6797,  # --Thrombocyten
    10409, # --Thrombo's citr. bloed (bloed)
    14252  # --Thrombo CD61 (bloed)   
]

BILIRUBIN = [
    6813,  # --Bili Totaal
    9945   # --Bilirubine (bloed)
]

MEANBP = [
    6642,  # --ABP gemiddeld
    6679,  # --Niet invasieve bloeddruk gemiddeld
    8843   # --ABP gemiddeld II
]

CREATININE = [
    6836,  # --Kreatinine µmol/l (erroneously documented as µmol)
    9941,  # --Kreatinine (bloed) µmol/l
    14216  # --KREAT enzym. (bloed) µmol/l   
]

In [None]:
def compute_sofa(po2, fio2, vent, platelets, gcs, bilirubin, meanbp, creatinine):
    """ Computes Sequential Organ Failure Assessment (SOFA) score given vital parameters 
    For details, see: https://files.asprtracie.hhs.gov/documents/aspr-tracie-sofa-score-fact-sheet.pdf
    """
    # which admissions have available the data to estimate SOFA?
    admissionids = po2.keys() & fio2.keys() & vent.keys() & platelets.keys() & bilirubin.keys() & meanbp.keys() & creatinine.keys() & gcs.keys()
    
    # compute SIRS for eligible admissions
    out = dict()
    for admissionid in admissionids:
        score = 0
        
        # respiratory
        pf_ratio = po2[admissionid] / fio2[admissionid]
        if pf_ratio < 100 and vent[admissionid]:
            score += 4
        elif pf_ratio < 200 and vent[admissionid]:
            score += 3
        elif pf_ratio < 300:
            score += 2
        elif pf_ratio < 400:
            score += 1
            
        # coagulation
        coag = platelets[admissionid]
        if coag < 20:
            score += 4
        elif coag < 50:
            score += 3
        elif coag < 100:
            score += 2
        elif coag < 150:
            score += 1
            
        # liver
        bili = bilirubin[admissionid] 
        if bili >= 12:
            score += 4
        elif 6 <= bili < 12:
            score += 3
        elif 2 <= bili < 6:
            score += 2
        elif 1.2 <= bili < 2:
            score += 1
            
        # cardiovascular
        # TODO: add vasopressor agents
        blood_pressure = meanbp[admissionid] 
        if blood_pressure < 70:
            score += 1
        
        # central nervous system
        gcs_score = gcs[admissionid]
        if gcs_score < 6:
            score += 4
        elif 6 <= gcs_score <= 9:
            score += 3
        elif 10 <= gcs_score <= 12:
            score += 2
        elif 13 <= gcs_score <= 14:
            score += 1
            
        # renal
        creat = creatinine[admissionid]
        if creat > 440:
            score += 4
        elif 300 <= creat < 440:
            score += 3
        elif 170 <= creat < 300:
            score += 2
        elif 110 <= creat < 170:
            score += 1
        
        out[admissionid] = score
    return out
    

In [None]:
sofa_on_admission = dict()

for i, chunk in pbar(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat', 'unitid', 'unit'])):
    # which observations are within 8h after start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    chunk = chunk[(times > -0.5) & (times < 8)].copy()  # Allow 30 minute margin before admission
    
    if len(chunk) == 0:
        continue
    
    # PO2 (mmHg)
    po2 = chunk[chunk.itemid.isin(PO2)].copy()
    po2.loc[po2.unitid == 152, 'value'] *= po2.value * 7.50061683  # conversion rate kPa to mmHg
    po2 = po2.groupby('admissionid', sort=False).value.first().to_dict()
    if len(po2) == 0:
        continue
    
    # FiO2 (%)
    fio2 = chunk[chunk.itemid.isin(FIO2)]
    fio2 = fio2.groupby('admissionid', sort=False).value.first().to_dict()
    if len(fio2) == 0:
        continue
    
    # Mechanical ventilation
    vent = chunk.admissionid.isin(admissions_with_vent).to_dict()
    
    # Platelets (10^9/l)
    platelets = chunk[chunk.itemid.isin(PLATELETS)]
    platelets = platelets.groupby('admissionid', sort=False).value.first().to_dict()
    if len(platelets) == 0:
        continue
    
    # Bilirubin (µmol/l)
    bilirubin = chunk[chunk.itemid.isin(BILIRUBIN)]
    bilirubin = bilirubin.groupby('admissionid', sort=False).value.first().to_dict()
    if len(bilirubin) == 0:
        continue
    
    # MAP (mmHg)
    meanbp = chunk[chunk.itemid.isin(MEANBP)]
    meanbp = meanbp.groupby('admissionid', sort=False).value.first().to_dict()
    if len(meanbp) == 0:
        continue
    
    # Creatinine (µmol/l)
    creatinine = chunk[chunk.itemid.isin(CREATININE)]
    creatinine = creatinine.groupby('admissionid', sort=False).value.first().to_dict()
    
    sofa_on_admission.update(compute_sofa(po2, fio2, vent, platelets, gcs_on_admission, bilirubin, meanbp, creatinine))
    
    if i % 500 == 0:
        print('(%d/23106) number of SOFA computed: %s' % (i, len(sofa_on_admission)))

#### Age, Height, Weight, Gender

In [None]:
# Age/weight/height groups to their centroids (best guess)
AGE_GROUPS = {
    '18-39': 30,
    '40-49': 45,
    '50-59': 55,
    '60-69': 65,
    '70-79': 75, 
    '80+': 85  # Arbitrary
}

WEIGHT_GROUPS = {
    '59-': 55,
    '60-69': 65, 
    '70-79': 75,
    '80-89': 85,
    '90-99': 95,
    '100-109': 105,
    '110+': 115,
    'N/A': -1  # To be replaced with Dutch average weight (for men and women separately)
}
  
HEIGHT_GROUPS = {
    '159-': 155,
    '160-169': 165,
    '170-179': 175,
    '180-189': 185, 
    '190+': 195, 
    'N/A': -1  # Average height of men and women in NL
}

Using the above *mapper* functions we can now map from grouped demographics (`agegroup`, `heightgroup`, `weightgroup`) to a real value and when unknown use per-gender guesses of weight and height (better than a naive average over the population)

### Demographics DataFrame

In [None]:
demo_cols = ['admissionid', 'gender', 'agegroup', 'heightgroup', 'weightgroup']

demo_df = []
for i, chunk in read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=demo_cols, chunksize=100000000):
    demo_chunk = pd.DataFrame({
        'icustay_id': chunk.admissionid,
        'is_male': chunk.gender == 'Man',
        'age': chunk.agegroup.transform(lambda x: AGE_GROUPS[x] if x in AGE_GROUPS else np.NaN),
        'height': chunk.heightgroup.transform(lambda x: HEIGHT_GROUPS[x] if x in HEIGHT_GROUPS else np.NaN),
        'weight': chunk.weightgroup.transform(lambda x: WEIGHT_GROUPS[x] if x in WEIGHT_GROUPS else np.NaN),
        'vent': chunk.admissionid.isin(admissions_with_vent),
        'sirs': chunk.admissionid.transform(lambda x: sirs_on_admission[x] if x in sirs_on_admission else np.NaN),  # NaN if couldn't be estimated
        'sofa': chunk.admissionid.transform(lambda x: sofa_on_admission[x] if x in sofa_on_admission else np.NaN)
    })
    demo_df.append(demo_chunk)
    
demo_df = pd.concat(demo_df, axis=0).reset_index(drop=True)

In [None]:
# save!
demo_df.to_csv(OUT_DIR + 'demographics_cohort.csv')

---

## Vitals

Sources:
- Vital parameters: [amsterdamumcdb/sql/common](https://github.com/AmsterdamUMC/AmsterdamUMCdb/tree/master/amsterdamumcdb/sql/common)

In [None]:
VITALS = {
    'HeartRate':  [6640],                                      # -- 'Hartfrequentie'
    'SysBP':      [6641, 6678, 8841],                          # -- 'ABP systolisch', 'Niet invasieve bloeddruk systolisch', 'ABP systolisch II', 
    'DiasBP':     [6643, 6680, 8842],                          # -- 'ABP diastolisch', 'Niet invasieve bloeddruk diastolisch', 'ABP diastolisch II'
    'MeanBP':     [6642, 6679, 8843],                          # -- 'ABP gemiddeld', 'Niet invasieve bloeddruk gemiddeld', 'ABP gemiddeld II'
    'Glucose':    [6833],                                      # -- 'Glucose Bloed'
    'SpO2':       [12311],                                     # -- 'O2-Saturatie (bloed)'
    'TempC':      [8658, 8659, 8662, 13058, 13059,             # -- 'Temp Bloed', 'Temperatuur Perifeer 2', 'Temperatuur Perifeer 1', 'Temp Rectaal', 'Temp Lies',
                   13060, 13061, 13062, 13063, 13952, 16110],  # -- 'Temp Axillair', 'Temp Oraal', 'Temp Oor', 'Temp Huid', 'Temp Blaas', 'Temp Oesophagus' 
    'RespRate':   [8874, 8873, 9654, 7726, 12266],             # -- 'Ademfrequentie Monitor', 'Ademfrequentie Evita', 'Ademfreq.'
    'SvO2':       [8507, 12534, 12311],                        # -- 'SVO2', 'SvO2 Vigilance', 'O2-Saturatie (bloed)
}


In [None]:
vitals_df = []

for i, chunk in pbar(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat'])):
    
    for vital, vital_ids in VITALS.items():
        # Mask entries of vital
        is_vital = chunk.itemid.isin(vital_ids)
        
        if is_vital.any():
            
            # Store measurements of vital in new DataFrame
            vital_df = pd.DataFrame({
                'icustay_id': chunk.admissionid[is_vital].values,
                'charttime': time_within_admission(chunk.measuredat[is_vital], chunk.admissionid[is_vital]),
                'vital_id': vital,
                'valuenum': chunk.value[is_vital].values,
            })
            
            vitals_df.append(vital_df)
        
    if i > MAX_CHUNKS:
        break
        
# Merge vital DataFrames
vitals_df = pd.concat(vitals_df, axis=0).reset_index(drop=True)
vitals_df.head()

In [None]:
vitals_df.value_counts('vital_id')

In [None]:
# save!
vitals_df.to_csv(OUT_DIR + 'vitals_cohort.csv')

---

## FiO2 (Fraction of Inspired Oxygen)

Sources:

- FiO2: [pO2_FiO2_estimated.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/common/pO2_FiO2_estimated.sql)

In [None]:
# FiO2 settings on respiratory support
OXYGEN_DEVICES = [
    6699,  # --FiO2 %: setting on Evita ventilator
    12279, # --O2 concentratie --measurement by Servo-i/Servo-U ventilator
    12369, # --SET %O2: used with BiPap Vision ventilator
    16246  # --Zephyros FiO2: Non-invasive ventilation
]

# Oxygen Flow settings without respiratory support
FIO2_ESTIMATED = [
    8845,  # -- O2 l/min
    10387, # --Zuurstof toediening (bloed)
    18587, # --Zuurstof toediening
]

To estimate FiO2 , we must first extract ventilator settings for each admission over time from `listitems`;

In [None]:
vent_settings = dict()

for i, chunk in pbar(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'measuredat'])):
    # limit rows to oxygen devices
    chunk = chunk[chunk.itemid == 8189].copy() # -- Toedieningsweg (Oxygen device)
    
    # Add (admissionid, measuredat):valueid pairs to vent_settings
    vent_settings.update(chunk.set_index(['admissionid', 'measuredat'])['valueid'].to_dict())  

In [None]:
vent_settings[(7, 33480000)]

Knowing the ventilator settings, we can estimate fiO2 when we need to;

In [None]:
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat'])):
    # limit to oxygen flow items
    chunk = chunk[chunk.itemid.isin(OXYGEN_DEVICES + FIO2_ESTIMATED)]
    
    # Easy case: use oxygen settings set by ventilator
    fio2_device = chunk.value[chunk.itemid.isin(OXYGEN_DEVICES)].fillna(0.21) # 0.21 -> natural air
    
    # Harder case: estimate fio2
    # 1) query `vent_settings` to get setting's valueid
    valueids = chunk[['admissionid', 'measuredat']].apply(tuple, axis=1).transform(lambda x: vent_settings[x] if x in vent_settings else -1)
    
    option1 = valueids.isin([2, 7])
    
    # TODO: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/common/pO2_FiO2_estimated.sql
    
    break
    
    
    
    
    


---

## Lab Results
Sources: 

- Kalium, Platelets, etc.: [AmsterdamUMCdb/common](https://github.com/AmsterdamUMC/AmsterdamUMCdb/tree/master/amsterdamumcdb/sql/common)
- lots of hard work, blood, sweat and tears

In [None]:
# Missing: PT
LABRESULTS = {
    'Calcium':     [6817, 9933],                     # -- 'Calcium', 'Calcium totaal (bloed)'
    'IonCalcium':  [10267],                          # -- 'Ca-ion (7.4) (bloed)'
    'ASAT':        [11990],                          # -- 'ASAT (bloed)'
    # Remark: PTT is not equivalent to aPTT nor PT, however used as a proxy in Roggeveen et al.
    'PTT':         [11944, 11948, 17982],            # -- 'APTT (bloed)', 'APTT Gecorrigeerd (bloed)', 'APTT (bloed)'
    'Potassium':   [6835, 9927, 9556, 10285],        # -- 'Kalium', 'Kalium (bloed)', 'Kalium Astrup', 'K (onv.ISE) (bloed)'
    'Platelet':    [9964, 6797, 10409, 14252],       # -- 'Thrombocyten', "Thrombo's (bloed)", "Thrombo's citr. bloed (bloed)", 'Thrombo CD61 (bloed)'
    'AnionGap':    [9559, 8492],                     # -- 'Anion-Gap (bloed)', 'AnGap'
    'PaO2':        [7433, 9996, 21214],              # -- 'PaO2', 'PaO2 (bloed)', 'PaO2 (bloed) - kPa'
    'ALAT':        [6800, 11978],                    # -- 'ALAT', 'ALAT (bloed)'
    'WBC':         [6779, 9965],                     # -- 'Leucocyten', "Leuco's (bloed)"
    'Bilirubin':   [9945, 6813],                     # -- 'Bilirubine (bloed)', 'Bili Totaal'
    'Sodium':      [12233, 9555, 9924, 10284],       # -- 'Natrium (overig)', 'Natrium Astrup', 'Natrium (bloed)', 'Na (onv.ISE) (bloed)'
    'Chloride':    [14413],                          # -- Cl (onv.ISE) (bloed)
    'Magnesium':   [9952],                           # -- 'Magnesium (bloed)'
    'Lactate':     [10053],                          # -- 'Lactaat (bloed)'
    'PaCO2':       [6846, 9990, 21213],              # -- 'PCO2', 'PCO2 (bloed)', 'PCO2 (bloed) - kPa'
    'Glucose':     [6833, 9947],                     # -- 'Glucose Bloed', 'Glucose (bloed)'
    'Creatinine':  [9941, 14216],                    # -- 'Kreatinine (bloed)', 'KREAT enzym. (bloed)'
    'Bicarbinate': [6810, 9992],                     # -- 'HCO3 (astrups)', 'Act.HCO3 (bloed)'
    # Remark: ureum is not equivalent to BUN, but used as a proxy in Roggeveen et al.
    'BUN':         [9943],                           # -- 'Ureum (bloed)'
    'pH':          [6848, 12310],                    # -- 'pH', 'pH (bloed)'
    'Albumin':     [9937],                           # -- 'Alb.Chem (bloed)'
    'Bands':       [11586],                          # -- Staaf % (bloed)
    'Hemoglobin':  [9960, 10286, 6778],              # -- 'Hb (bloed)', 'Hb(v.Bgs) (bloed)', 'Hemoglobine'
    'BaseExcess':  [9994],                           # -- 'B.E. (bloed)'
    'Hematocrit':  [6777, 11423, 11545],             # -- 'Hematocriet', 'Ht (bloed)', 'Ht(v.Bgs) (bloed)'
    'DDimer':     [10393, 12163]                    # -- 'D-dimeren', '	D-DIMEER (bloed)'
}

In [None]:
labs_df = []

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat', 'unitid'])):
    
    for lab, lab_ids in LABRESULTS.items():
        # Mask entries of one lab test
        is_lab = chunk.itemid.isin(lab_ids)
        
        if is_lab.any():
            
            # Rate conversion form kPa to mmHg
            chunk.loc[chunk.unitid == 152, 'value'] *= chunk.value * 7.50061683
            
            # Store lab measurements in new DataFrame
            lab_df = pd.DataFrame({
                'icustay_id': chunk.admissionid[is_lab].values,
                'charttime': time_within_admission(chunk.measuredat[is_lab], chunk.admissionid[is_lab]),
                'lab_id': lab,
                'valuenum': chunk.value[is_lab].values,
            })
            
            labs_df.append(lab_df)
        
    if i > MAX_CHUNKS:
        break
        
# Merge lab result DataFrames
labs_df = pd.concat(labs_df, axis=0).reset_index(drop=True)
labs_df.head()

In [None]:
labs_df.value_counts('lab_id')

In [None]:
# save!
labs_df.to_csv(OUT_DIR + 'labs_cohort.csv')

---
## Urine Output

In [None]:
URINE_OUTPUTS = [
    8794,  # UrineCAD
    8796,  # UrineSupraPubis
    8798,  # UrineSpontaan
    8800,  # UrineIncontinentie
    8803,  # UrineUP
    10743, # Nefrodrain li Uit
    10745, # Nefrodrain re Uit
    19921, # UrineSplint Li
    19922  # UrineSplint Re
]

In [None]:
urine_outputs_df = []

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat'])):
    
    # which entries have urine output
    is_urine_output = chunk.itemid.isin(URINE_OUTPUTS)

    if is_urine_output.any():
        
        # Store lab measurements in new DataFrame
        urine_output_df = pd.DataFrame({
            'icustay_id': chunk.admissionid[is_urine_output].values,
            'charttime': time_within_admission(chunk.measuredat[is_urine_output], chunk.admissionid[is_urine_output]),
            'value': chunk.value[is_urine_output].values,
        })

        urine_outputs_df.append(urine_output_df)
        
    if i > MAX_CHUNKS:
        break
        
# Merge urine output DataFrames
urine_outputs_df = pd.concat(urine_outputs_df, axis=0).reset_index(drop=True)
urine_outputs_df.head()

---

## Vasopressors

Sources:
- Overview common vasopressor drugs for treatment of sepsis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7333107/
- Vasopressors in UMCdb: [vasopressors_inotropes.ipynb](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/concepts/lifesupport/vasopressors_inotropes.ipynb), [vasopressors_inotropes.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/common/vasopressors_inotropes.sql)
- Norepinephrine-equivalents: [get_vassopressor_mv.sql](https://github.com/LucaMD/SRL/blob/master/SEPSIS/MIMIC_sql/get_vassopressor_mv.sql)

Exclude *inotropes* which may be used to increase heart rate

In [None]:
VASOPRESSORS = [
    7179,  # -- Dopamine (Inotropin)
    # 7178, -- Dobutamine (Dobutrex)           -> inotropic
    6818,  # -- Adrenaline (Epinefrine)        -> inotropic but also known to induce vasoconstriction
    7229,  # -- Noradrenaline (Norepinefrine)   
    # 7135, -- Isoprenaline (Isuprel)          -> inotropic
    # 7196, -- Enoximon (Perfan)               -> inotropic
    12467, # -- Terlipressine (Glypressin)
    13490, # -- Methyleenblauw IV (Methylthionide chloride)
    19929  # -- Fenylefrine
]

#### Extracting weight of patients
To compute the vasopressor dose in mcg/kg/min we need to know the weight of the patient in kg. We determine the approximate weight of the patient using the `weightgroup` entry in `admissions.csv` and the weight definitions in `WEIGHT_GROUPS` above.

In [None]:
patient_weight = dict()

for i, chunk in read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=['admissionid', 'weightgroup'], chunksize=10000000): # -> do all at once, this file is super small
    # Convert weightgroups, e.g. '70-79', to approximate weight (75)
    weights = chunk.weightgroup.transform(lambda adm: WEIGHT_GROUPS[adm] if adm in WEIGHT_GROUPS else 80)  # Assume 80 if unknown
    patient_weight = dict(zip(chunk.admissionid, weights))

In [None]:
patient_weight[2] # in kgs

#### Computing vasopressor rate in mcg/kg/min

- ordercategoryid
    - 65 -> continuous IV perfusor
- doserateperkg:
    - Whether dose is already per kg -> 0/1
- doseunitid:
    - 10 -> mg
    - 11 -> µg
- doserateunitid:
    - 4 -> min
    - 5 -> uur

In [None]:
vaso_cols = ['admissionid', 'itemid', 'start', 'stop', 'ordercategoryid', 'administeredunit', 'dose', 'doserateperkg', 'doseunitid', 'doserateunitid', 'rate']

vasopressors_df = []

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/drugitems.csv", usecols=vaso_cols)):
    # convert starttime and endtime to hours since start of admission
    chunk['start'] = time_within_admission(chunk.start, chunk.admissionid)
    chunk['stop'] = time_within_admission(chunk.stop, chunk.admissionid)
    
    # extract drugitems corresponding to vasopressors via continuous IV
    chunk = chunk[chunk.itemid.isin(VASOPRESSORS) & (chunk.rate > 0) & (chunk.ordercategoryid == 65)].copy()   
      
    # rate conversion
    per_kg = chunk.doserateperkg == 1
    is_mg = chunk.doseunitid == 10 
    per_hour = chunk.doserateunitid == 5
    weight_kg = chunk.admissionid.transform(lambda adm: patient_weight[adm] if adm in patient_weight else 80)
        
    chunk.loc[~per_kg, 'dose'] = chunk.dose / weight_kg
    chunk.loc[is_mg, 'dose'] = 1000 * chunk.dose
    chunk.loc[per_hour, 'dose'] = chunk.dose / 60
    
    # çonvert to norepinephrine equivalent dose
    # TODO: Terlipressine
    # TODO: Methyleenblauw IV
    chunk.loc[chunk.itemid == 7179, 'dose'] = 0.01 * chunk.dose   # Dopamine
    chunk.loc[chunk.itemid == 19929, 'dose'] = 0.45 * chunk.dose  # Fenylefrine
    
    # absolute start of admission is not known -> assume now
    admission_start = pd.Timestamp.now()

    # convert to DataFrame
    vasopressor_df = pd.DataFrame({
        'icustay_id': chunk.admissionid,
        'starttime': hours_to_absolute_time(admission_start, hours=chunk.start),
        'endtime': hours_to_absolute_time(admission_start, hours=chunk.stop),
        'mcgkgmin': chunk.dose
    })
    vasopressors_df.append(vasopressor_df)
    
    if i > MAX_CHUNKS:
        break
    
# merge
vasopressors_df = pd.concat(vasopressors_df, axis=0).reset_index(drop=True)
vasopressors_df.head()

In [None]:
# save!
vasopressors_df.to_csv(OUT_DIR + 'vassopressors_mv_cohort.csv')

---

## Fluids

In [None]:
IV_CATEGORIES = [
    # 25, # Injecties Haematologisch
    55, # Infuus - Crystalloid
    # 27, # Injecties Overig
    # 23, # Injecties CZS/Sedatie/Analgetica
    # 67, # Injecties Hormonen/Vitaminen/Mineralen
    65, # 2. Spuitpompen
    # 24, # Injecties Circulatie/Diuretica
    15, # Injecties Antimicrobiele middelen
    17, # Infuus - Colloid
    61, # Infuus - Bloedproducten
]

In [None]:
fluid_cols = ['admissionid', 'itemid', 'item', 'start', 'stop', 'ordercategory', 'ordercategoryid', 'administered', 'administeredunitid']

# absolute start of admission is not known we assume it is now :)
admission_start = pd.Timestamp.now()

iv_fluids_df = []

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/drugitems.csv", usecols=fluid_cols)):
    # convert starttime and endtime to hours since start of admission
    chunk['start'] = time_within_admission(chunk.start, chunk.admissionid)
    chunk['stop'] = time_within_admission(chunk.stop, chunk.admissionid)
        
    # extract drugitems corresponding to intravenous infusion in ml
    chunk = chunk[chunk.ordercategoryid.isin(IV_CATEGORIES) & (chunk.administeredunitid == 6)].copy()
        
    # TODO: tonicity

    # convert to DataFrame
    iv_fluid_df = pd.DataFrame({
        'icustay_id': chunk.admissionid,
        'starttime': hours_to_absolute_time(admission_start, hours=chunk.start),
        'endtime': hours_to_absolute_time(admission_start, hours=chunk.stop),
        'itemid': -1, # -1 as we'll be using the MIMIC preprocessing
        'ordercategoryname': chunk.ordercategory,
        'amountuom': 'ml',
        'amount': chunk.administered
    })
    iv_fluids_df.append(iv_fluid_df)
    
    if i > MAX_CHUNKS:
        break
    
# merge
iv_fluids_df = pd.concat(iv_fluids_df, axis=0).reset_index(drop=True)
iv_fluids_df.head()

In [None]:
# save!
iv_fluids_df.to_csv(OUT_DIR + 'inputevents_mv_cohort.csv')

--- 
### Statistics

In [None]:
print('Mortality rate: %.3f' % cohort_df.hospital_expire_flag.mean())