# Notebook to Convert AmsterdamUMCdb to MIMIC-III Format

Here we will be converting the [AmsterdamUMCdb](https://github.com/AmsterdamUMC/AmsterdamUMCdb) data files to the MIMIC-III data file format as generated by [MIMIC-Code](https://github.com/MIT-LCP/mimic-code). We do this as to allow the exact same preprocessing pipeline to be applied to both MIMIC and the AmsterdamUMCdb.

In [1]:
import re
import numpy as np
import pandas as pd
from tqdm import tqdm

# for debugging
MAX_CHUNKS = 100
CHUNK_SIZE = 10000

### Convenience functions

As the files are in the orders of GBs we read in and process the files incrementally;

In [2]:
# reads file from path in chunks of size `chunksize`
def read_csv(path, usecols, chunksize=CHUNK_SIZE):
    for i, chunk in enumerate(pd.read_csv(path, usecols=usecols, encoding='latin1', engine='python', chunksize=chunksize)):
        yield i, chunk.reset_index(drop=True) # resets index so that indices range from 0 to chunksize - 1

---

## Sepsis-3 cohort
#### Start of admission
Start of admission is in milliseconds relative to the first admission of that same patient; therefore we convert to hours 

In [3]:
# Start times of admissions relative to the first admission (with start = 0)
start_of_admission = dict()
for _, chunk in read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=['admissionid', 'admittedat']):
    # admissionid + admission time
    start_of_admission.update(dict(zip(chunk.admissionid, chunk.admittedat)))

print('Time since last admission: %.2f years' % (start_of_admission[14] / (1000 * 60 * 60 * 24 * 365)))

Time since last admission: 6.62 years


In [4]:
# convenience function to compute time in current admission in hours
def time_within_admission(times, admissionids):
    starttimes = pd.Series([start_of_admission[a] for a in admissionids])
    return (times.reset_index(drop=True) - starttimes) / (60*60*1000)  # hours

### Suspicion of infection
#### Definitions

In [5]:
# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/diagnosis/reason_for_admission.sql
IS_SURGICAL = [
    13116, # D_Thoraxchirurgie_CABG en Klepchirurgie
    16671, # DMC_Thoraxchirurgie_CABG en Klepchirurgie
    13117, # D_Thoraxchirurgie_Cardio anders
    16672, # DMC_Thoraxchirurgie_Cardio anders
    13118, # D_Thoraxchirurgie_Aorta chirurgie
    16670, # DMC_Thoraxchirurgie_Aorta chirurgie
    13119, # D_Thoraxchirurgie_Pulmonale chirurgie
    16673, # DMC_Thoraxchirurgie_Pulmonale chirurgie
    13121, # D_Algemene chirurgie_Buikchirurgie
    16643, # DMC_Algemene chirurgie_Buikchirurgie
    13123, # D_Algemene chirurgie_Endocrinologische chirurgie
    16644, # DMC_Algemene chirurgie_Endocrinologische chirurgie
    13145, # D_Algemene chirurgie_KNO/Overige
    16645, # DMC_Algemene chirurgie_KNO/Overige
    13125, # D_Algemene chirurgie_Orthopedische chirurgie
    16646, # DMC_Algemene chirurgie_Orthopedische chirurgie
    13122, # D_Algemene chirurgie_Transplantatie chirurgie
    16647, # DMC_Algemene chirurgie_Transplantatie chirurgie
    13124, # D_Algemene chirurgie_Trauma
    16648, # DMC_Algemene chirurgie_Trauma
    13126, # D_Algemene chirurgie_Urogenitaal
    16649, # DMC_Algemene chirurgie_Urogenitaal
    13120, # D_Algemene chirurgie_Vaatchirurgie
    16650, # DMC_Algemene chirurgie_Vaatchirurgie
    13128, # D_Neurochirurgie _Vasculair chirurgisch
    16661, # DMC_Neurochirurgie _Vasculair chirurgisch
    13129, # D_Neurochirurgie _Tumor chirurgie
    16660, # DMC_Neurochirurgie _Tumor chirurgie
    13130, # D_Neurochirurgie_Overige
    16662, # DMC_Neurochirurgie_Overige
    18596, # Apache II Operatief  Gastr-intenstinaal
    18597, # Apache II Operatief Cardiovasculair
    18598, # Apache II Operatief Hematologisch
    18599, # Apache II Operatief Metabolisme
    18600, # Apache II Operatief Neurologisch
    18601, # Apache II Operatief Renaal
    18602, # Apache II Operatief Respiratoir
    17008, # APACHEIV Post-operative cardiovascular
    17009, # APACHEIV Post-operative gastro-intestinal
    17010, # APACHEIV Post-operative genitourinary
    17011, # APACHEIV Post-operative hematology
    17012, # APACHEIV Post-operative metabolic
    17013, # APACHEIV Post-operative musculoskeletal /skin
    17014, # APACHEIV Post-operative neurologic
    17015, # APACHEIV Post-operative respiratory
    17016, # APACHEIV Post-operative transplant
    17017  # APACHEIV Post-operative trauma
]

SEPSIS_ANTIBIOTICS = [
    6834, # Amikacine (Amukin)
    6847, # Amoxicilline (Clamoxyl/Flemoxin)
    6871, # Benzylpenicilline (Penicilline)
    6917, # Ceftazidim (Fortum)
    # 6919, # Cefotaxim (Claforan) -> prophylaxis
    6948, # Ciprofloxacine (Ciproxin)
    6953, # Rifampicine (Rifadin)
    6958, # Clindamycine (Dalacin)
    7044, # Tobramycine (Obracin)
    # 7064, # Vancomycine -> prophylaxis for valve surgery
    7123, # Imipenem (Tienam)
    7185, # Doxycycline (Vibramycine)
    # 7187, # Metronidazol (Flagyl) -> often used for GI surgical prophylaxis
    # 7208, # Erythromycine (Erythrocine) -> often used for gastroparesis
    7227, # Flucloxacilline (Stafoxil/Floxapen)
    7231, # Fluconazol (Diflucan)
    7232, # Ganciclovir (Cymevene)
    7233, # Flucytosine (Ancotil)
    7235, # Gentamicine (Garamycin)
    7243, # Foscarnet trinatrium (Foscavir)
    7450, # Amfotericine B (Fungizone)
    # 7504, # X nader te bepalen # non-stock medication
    8127, # Meropenem (Meronem)
    8229, # Myambutol (ethambutol)
    8374, # Kinine dihydrocloride
    # 8375, # Immunoglobuline (Nanogam) -> not anbiotic
    # 8394, # Co-Trimoxazol (Bactrimel) -> often prophylactic (unless high dose)
    8547, # Voriconazol(VFEND)
    # 9029, # Amoxicilline/Clavulaanzuur (Augmentin) -> often used for ENT surgical prophylaxis
    9030, # Aztreonam (Azactam)
    9047, # Chlooramfenicol
    # 9075, # Fusidinezuur (Fucidin) -> prophylaxis
    9128, # Piperacilline (Pipcil)
    9133, # Ceftriaxon (Rocephin)
    # 9151, # Cefuroxim (Zinacef) -> often used for GI/transplant surgical prophylaxis
    # 9152, # Cefazoline (Kefzol) -> prophylaxis for cardiac surgery
    9458, # Caspofungine
    9542, # Itraconazol (Trisporal)
    # 9602, # Tetanusimmunoglobuline -> prophylaxis/not antibiotic
    12398, # Levofloxacine (Tavanic)
    12772, # Amfotericine B lipidencomplex  (Abelcet)
    15739, # Ecalta (Anidulafungine)
    16367, # Research Anidulafungin/placebo
    16368, # Research Caspofungin/placebo
    18675, # Amfotericine B in liposomen (Ambisome )
    19137, # Linezolid (Zyvoxid)
    19764, # Tigecycline (Tygacil)
    19773, # Daptomycine (Cubicin)
    20175  # Colistine
]

# Other antibiotics may be administered during sepsis treatment
# (but drug culture must be drawn!)
OTHER_ANTIBIOTICS = [
    7064, # Vancomycine -> prophylaxis for valve surgery
    7187, # Metronidazol (Flagyl) -> often used for GI surgical prophylaxis
    8394, # Co-Trimoxazol (Bactrimel) -> often prophylactic (unless high dose)
    9029, # Amoxicilline/Clavulaanzuur (Augmentin) -> often used for ENT surgical prophylaxis
    9151, # Cefuroxim (Zinacef) -> often used for GI surgical prophylaxis
    9152  # Cefazoline (Kefzol) -> prophylaxis
]

# Culture tests
CULTURE_TESTS = [
    9189, # Bloedkweken afnemen
    9190, # Cathetertipkweek afnemen
    9198, # Rectumkweek afnemen -> often used routinely
    9200, # Wondkweek afnemen
    9202, # Ascitesvochtkweek afnemen
    9205 # Legionella sneltest (urine)
]

# surgical admissions vs non-surgical admissions
re_sepsis_surg = r'sepsis|pneumoni|GI perforation|perforation/rupture|infection|abscess|GI Vascular ' \
                 r'ischemia|diverticular|appendectomy|peritonitis '

re_sepsis_med = r'sepsis|septic|infect|pneumoni|cholangitis|pancr|endocarditis|meningitis|GI ' \
                r'perforation|abces|abscess|darm ischaemie|GI vascular|fasciitis' + \
                r'|inflammatory|peritonitis'

#### 1. Were antibiotics administered during first 24h of admission?

In [6]:
antibiotics_df = []
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/drugitems.csv", usecols=['admissionid', 'itemid', 'start'])):
    # were antibiotics administered within 24 hours?
    # remark: reason_for_admission.sql assumed admission to be the first 
    chunk['sepsis_antibiotics_bool'] = chunk.itemid.isin(SEPSIS_ANTIBIOTICS) & (time_within_admission(chunk.start, chunk.admissionid) < 24)  # first 24 hours
    chunk['other_antibiotics_bool'] = chunk.itemid.isin(OTHER_ANTIBIOTICS) & (time_within_admission(chunk.start, chunk.admissionid) < 24)    # Note to self: Check also for culture!
    
    # drop rows without antibiotics to save memory :)
    chunk = chunk.loc[chunk.sepsis_antibiotics_bool | chunk.other_antibiotics_bool, :]
    antibiotics_df.append(chunk)
    
    if i > MAX_CHUNKS:
        break

antibiotics_df = pd.concat(antibiotics_df, axis=0)

251it [00:26,  9.53it/s]


#### 2. Was a positive blood culture found within first 6h?

In [7]:
culture_df = []
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/procedureorderitems.csv", usecols=['admissionid', 'itemid', 'registeredat'])):
    # culture found?
    chunk['positive_blood_culture'] = chunk.itemid.isin(CULTURE_TESTS) & (time_within_admission(chunk.registeredat, chunk.admissionid) < 6)  # first 6 hours
    
    # drop rows with all False to save memory :)
    chunk = chunk.loc[chunk.positive_blood_culture, :]
    culture_df.append(chunk)
    
    if i > MAX_CHUNKS:
        break

culture_df = pd.concat(culture_df, axis=0)

219it [00:14, 14.80it/s]


#### 3. Was there an infection/sepsis-related diagnosis?

In [8]:
# Based on code base from the AmsterdamUMCdb repository
# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/diagnosis/reason_for_admission.sql

def get_diagnosis(df):
    # drop prefixes from diagnosis name if APCAHE II/IV
    mask = df.itemid.isin([18669, 18671])
    df.loc[mask, 'value'] = df[mask].value.str.split(' - ').str[1]
    df['diagnosis'] = df.value

def is_surgical(df):
    df['surgical_adm'] = df.itemid.isin(IS_SURGICAL) | ((df.itemid == 18669) & ((df.valueid > 0) & (df.valueid <= 27))) | ((df.itemid == 18671) & ((df.valueid > 221) & (df.valueid <= 452)))

def sepsis_at_admission(df):
    df['known_no_sepsis'] = (df.itemid == 15808) & (df.valueid == 2)  # 15808 = 'opname sepsis'; 2 = explicitly no sepsis (to drop later)
    df['sepsis_on_admission'] = (df.itemid == 15808) & (df.valueid == 1)

In [10]:
diagnoses_df = []
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'value', 'updatedat'])):   
    # was sepsis explicitly diagnosed?
    sepsis_at_admission(chunk) 
    
    # diagnosis related to infection/sepsis?
    get_diagnosis(chunk)
    is_surgical(chunk)
    surg_diag = chunk.surgical_adm & chunk.diagnosis.str.contains(re_sepsis_surg, na=False, flags=re.IGNORECASE)  # surgical admission with sepsis
    med_diag = ~chunk.surgical_adm & chunk.diagnosis.str.contains(re_sepsis_med, na=False, flags=re.IGNORECASE)   # medical admission with sepsis
    chunk['suspected_infection'] = surg_diag | med_diag
    
    # drop rows that aren't needed to save :)
    chunk = chunk.loc[chunk.suspected_infection | chunk.known_no_sepsis | chunk.sepsis_on_admission, :]
    diagnoses_df.append(chunk)
    
    if i > MAX_CHUNKS:
        break

diagnoses_df = pd.concat(diagnoses_df, axis=0)

251it [00:30,  8.21it/s]


<br>

#### Get admissions with suspected infections

In [11]:
# admission ids with antibiotics
sepsis_antibiotics = set(antibiotics_df.admissionid[antibiotics_df.sepsis_antibiotics_bool].unique())
other_antibiotics = set(antibiotics_df.admissionid[antibiotics_df.other_antibiotics_bool].unique())

# admission ids with positive blood cultures
pos_blood_culture = set(culture_df.admissionid)

# admission ids with explicit diagnosis or suspicion of sepsis
sepsis_on_admission = set(diagnoses_df.admissionid[diagnoses_df.sepsis_on_admission].unique())
susp_infection = set(diagnoses_df.admissionid[diagnoses_df.suspected_infection].unique())
known_no_sepsis = set(diagnoses_df.admissionid[diagnoses_df.known_no_sepsis].unique())

# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/cohorts.py
# suspicion of infection is quantified as either a known diagnosis of a major injection after/during a medical/surgical 
# admission OR when sepsis-related antibiotics are administered OR when a positive blood culture is found
adm_susp_infection = (susp_infection | sepsis_on_admission | sepsis_antibiotics | (other_antibiotics & pos_blood_culture)) - known_no_sepsis

In [12]:
print('With suspected infection:', len(adm_susp_infection))

With suspected infection: 2649


#### 4. Estimate infection time

We define the suspected infection time as the earliest moment of administering antibiotics, diagnosis of sepsis, or where a positive blood culture is found:

In [14]:
susp_infection_time_poe = {}
for admissionid in tqdm(adm_susp_infection):
    # determine earliest time where infection was suspected
    # i.e. earliest time of antibiotics OR culture found OR sepsis diagnosis
    times = antibiotics_df.start[antibiotics_df.admissionid == admissionid].tolist()  # empty if no antibiotics received
    times += culture_df.registeredat[culture_df.admissionid == admissionid].tolist()  # empty if no positive culture found
    times += diagnoses_df.updatedat[diagnoses_df.admissionid == admissionid].tolist() # empty if no explicit documented diagnosis
    min_time = min(times)
    
    # Compute suspected infection time in hours relative to start of admission
    starttime = start_of_admission[admissionid]
    susp_infection_time_poe[admissionid] = (min_time - starttime) / (1000 * 60 * 60)
    
susp_infection_time_poe[8216]

100%|████████████████████████████████████████████████████████████████████████████| 2649/2649 [00:01<00:00, 1871.53it/s]


14.366666666666667

### qSOFA $\geq2$

TODO: SOFA (instead of qSOFA)

#### Definitions

In [15]:
def gcs_eye_score(df):
    df = df.copy()
    df.loc[df.itemid == 6732, 'eyes_score'] = 5 - df.valueid   # --Actief openen van de ogen
    df.loc[df.itemid == 13077, 'eyes_score'] = df.valueid      # --A_Eye
    df.loc[df.itemid == 14470, 'eyes_score'] = df.valueid - 4  # --RA_Eye
    df.loc[df.itemid == 16628, 'eyes_score'] = df.valueid - 4  # --MCA_Eye
    df.loc[df.itemid == 19635, 'eyes_score'] = df.valueid - 4  # --E_EMV_NICE_24uur
    df.loc[df.itemid == 19638, 'eyes_score'] = df.valueid - 8  # --E_EMV_NICE_Opname
    return df
    
def gcs_motor_score(df):
    df = df.copy()
    df.loc[df.itemid == 6734, 'motor_score'] = 7 - df.valueid   # --Beste motore reactie van de armen
    df.loc[df.itemid == 13072, 'motor_score'] = df.valueid      # --A_Motoriek
    df.loc[df.itemid == 14476, 'motor_score'] = df.valueid - 6  # --RA_Motoriek
    df.loc[df.itemid == 16634, 'motor_score'] = df.valueid - 6  # --MCA_Motoriek
    df.loc[df.itemid == 19636, 'motor_score'] = df.valueid - 6  # --M_EMV_NICE_24uur
    df.loc[df.itemid == 19639, 'motor_score'] = df.valueid - 12 # --M_EMV_NICE_Opname
    return df

def gcs_verbal_score(df):
    df = df.copy()
    df.loc[df.itemid == 6735, 'verbal_score'] = 6 - df.valueid   # --Beste verbale reactie
    df.loc[df.itemid == 13066, 'verbal_score'] = df.valueid      # --A_Verbal
    df.loc[df.itemid == 14482, 'verbal_score'] = df.valueid - 5  # --RA_Verbal
    df.loc[df.itemid == 16640, 'verbal_score'] = df.valueid - 5  # --MCA_Verbal
    df.loc[df.itemid == 19637, 'verbal_score'] = df.valueid - 9  # --V_EMV_NICE_24uur
    df.loc[df.itemid == 19640, 'verbal_score'] = df.valueid - 15 # --M_EMV_NICE_Opname
    
    # cap verbal score at 1
    df.loc[df['verbal_score'] < 1, 'verbal_score'] = 1
    return df

#### Glasgow Coma Scale

In [19]:
# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/gcs.sql
gcs = {}
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'value', 'measuredat', 'registeredby'])): 
    # Compute score with motor/verbal/eye scores in first 24 hours
    measurement_times = time_within_admission(chunk.measuredat, chunk.admissionid)
    chunk = chunk[(measurement_times >= 0) & (measurement_times <= 24)]
    
    # component scores: eye / motor / verbal
    chunk = gcs_eye_score(chunk)
    chunk = gcs_motor_score(chunk)
    chunk = gcs_verbal_score(chunk)
    
    # GCS = eye-score + motor-score + verbal-score (when possible)
    scores = chunk.groupby(['admissionid', 'measuredat']).mean().reset_index() # Aligns rows of component scores (there is only one value per measuredat for each score)
    chunk['gcs'] = scores.eyes_score + scores.motor_score + scores.verbal_score
    
    # drop admissions where score could not be computed
    chunk = chunk[chunk['gcs'].notna()]
    
    gcs.update(dict(zip(chunk.admissionid, chunk['gcs'])))
    
    if i > MAX_CHUNKS:
        break
    

251it [00:22, 11.01it/s]


In [21]:
[gcs[i] for i in list(gcs.keys())[:10]]

[3.0, 15.0, 15.0, 15.0, 14.0, 15.0, 15.0, 3.0, 15.0, 15.0]

#### Respiratory rate

In [None]:
# itemids of respiratory rate measurements
RESP_RATE = [
    8874, 
    8873, 
    9654, 
    7726, 
    12266
]

In [None]:
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat'])):
    # Compute average resp rate in first 24 hours
    measurement_times = time_within_admission(chunk.measuredat, chunk.admissionid)
    chunk = chunk[(measurement_times >= 0) & (measurement_times <= 24)]
    
    chunk.value[chunk['itemid'].isin(RESP_RATE)]
    
    if i > MAX_CHUNKS:
        break



#### Blood pressure

### Putting it together

----

## Demographics

#### Systemic inflammatory response syndrome (SIRS) score
Sources:
- Temperature: [temperature.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/temperature.sql)
- Heart rate: [heart_rate.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/heart_rate.sql)
- Respiratory rate: [resp_rate.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/resp_rate.sql)
- WBC/Leukocytes: [wbc.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/wbc.sql)
- SIRS definition: https://www.mdcalc.com/calc/1096/sirs-sepsis-septic-shock-criteria

In [5]:
TEMPERATURE = [
    8658,  # --Temp Bloed
    8659,  # --Temperatuur Perifeer 2
    8662,  # --Temperatuur Perifeer 1
    13058, # --Temp Rectaal
    13059, # --Temp Lies
    13060, # --Temp Axillair
    13061, # --Temp Oraal
    13062, # --Temp Oor
    13063, # --Temp Huid
    13952, # --Temp Blaas
    16110  # --Temp Oesophagus
]

PCO2 = [
    6846,  # --PCO2
    9990,  # --pCO2 (bloed)
    21213  # --PCO2 (bloed) - kPa
]

HEARTRATE = [
    6640   # --Hartfrequentie
]

RESP_RATE = [
    8873,  # --Ademfrequentie Evita: measurement by Evita ventilator, most accurate
    12266, # --Ademfreq.: measurement by Servo-i/Servo-U ventilator, most accurate
    8874   # --Ademfrequentie Monitor: measurement by patient monitor using ECG-impedance, less accurate  
]

WBC = [
    6779,  # --Leucocyten 10^9/l
    9965   # --Leuco's (bloed) 10^9/l
]

BANDS = [
    11586  # -- Staaf % (bloed)
]

In [35]:
def compute_sirs(resp_rate, paco2, temp, heart_rate, wbc, bands):
    """ Computes Systemic inflammatory response syndrome (SIRS) score given vital parameters """
    # which admissions have available the data to estimate SIRS?
    admissionids = temp.keys() & heart_rate.keys() & (resp_rate.keys() | paco2.keys()) & (wbc.keys() | bands.keys())
    
    # compute SIRS for eligible admissions
    out = dict()
    for admissionid in admissionids:
        # For details, see: http://www.mdcalc.com/calc/1096/sirs-sepsis-septic-shock-criteria
        score = int(temp[admissionid] < 36 or temp[admissionid] > 38)
        score += int(heart_rate[admissionid] > 90)
        score += int(resp_rate[admissionid] > 20) if admissionid in resp_rate else int(paco2[admissionid] < 32)
        score += int(wbc[admissionid] > 12 or wbc[admissionid] < 4) if admissionid in wbc else int(bands[admissionid] > 10)
        
        out[admissionid] = score
    return out

In [25]:
sirs_on_admission = dict()

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat', 'unitid'])):
    # which observations are within 24h of start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    within_24h = (times > -0.5) & (times < 24) # Allow 30 minute margin before admission
    
    # Respiratory rate (unit = breaths/min)
    resp_rate = chunk[chunk.itemid.isin(RESP_RATE) & (chunk.unitid == 15) & within_24h]
    resp_rate = resp_rate.groupby('admissionid', sort=False).value.first().to_dict()
    
    # PaCO2 (mmHg)
    paco2 = chunk[chunk.itemid.isin(PCO2) & within_24h].copy()
    paco2.loc[paco2.unitid == 152, 'value'] *= paco2.value * 7.50061683  # conversion rate kPa to mmHg
    paco2 = paco2.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Temperature (degrees C)
    temp = chunk[chunk.itemid.isin(TEMPERATURE) & (chunk.unitid == 59) & within_24h]
    temp = temp.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Heart rate (beats/min)
    heart_rate = chunk[chunk.itemid.isin(HEARTRATE) & (chunk.unitid == 15) & within_24h]
    heart_rate = heart_rate.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Leukocytes (10^9/l)
    wbc = chunk[chunk.itemid.isin(WBC) & (chunk.unitid == 101) & within_24h]
    wbc = wbc.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Bands (bloed %)
    bands = chunk[chunk.itemid.isin(BANDS) & within_24h]
    bands = bands.groupby('admissionid', sort=False).value.first().to_dict()
        
    # SIRS score
    sirs_on_admission.update(compute_sirs(resp_rate, paco2, temp, heart_rate, wbc, bands))
        
    if i > MAX_CHUNKS:
        break

101it [00:08, 11.40it/s]


In [26]:
sirs_on_admission[11]  # SIRS of admission 11 in first 24h

2

#### Mechanical ventilation

Source: [mechanical_ventilation.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/lifesupport/mechanical_ventilation.sql)

In [27]:
# Combinations of itemids and valueids in listitems.csv
# indicating use of (non-)invasive mechanical ventilation
MECH_VENT_SETTINGS = [
    (9534, (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)),                    # --Type beademing Evita 1
    (6685, (1, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 20, 22)),                 # --Type Beademing Evita 4
    (8189, (16,)),                                                          # --Toedieningsweg O2
    (12290, (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)),  # --Ventilatie Mode (Set) Servo-I and Servo-U ventilators
    (12347, (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)),  # --Ventilatie Mode (Set) (2) Servo-I and Servo-U ventilators
    (12376, (1, 2)),                                                        # --Mode (Bipap Vision)
]    

In [28]:
class Ventilation():
    """
    Convenience class to check for mechanical ventilation
    """
    def __init__(self, mech_vent):
        self._vent_settings = set()
        for itemid, valueids in mech_vent:
            self._vent_settings.update([(itemid, valueid) for valueid in valueids])
            
    def __call__(self, itemids, valueids):
        """ Checks if (itemid, valueid) pair is a valid ventilator settings """
        return np.array([x in self._vent_settings for x in list(zip(itemids, valueids))])
    
is_mech_vent = Ventilation(MECH_VENT_SETTINGS)

For each admission, see if mechanical ventilation is used in the first 24 hours (see Roggeveen et al., 2021)

In [29]:
admissions_with_vent = set()

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'measuredat'])):
    # which observations are within 24h of start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    within_24h = (times > -0.5) & (times < 24) # Allow 30 minute margin before admission
    
    # check whether (itemid, valueid) in listitems is mechanical ventilation and whether it's in the first 24 hours
    chunk['has_vent'] = is_mech_vent(chunk.itemid, chunk.valueid) & within_24h
    
    # admissionids with mechanical ventilation
    admissions_with_vent.update(chunk.admissionid[chunk.has_vent])
    
    if i > MAX_CHUNKS:
        break
        
print('Number of admissions with ventilation:', len(admissions_with_vent))

101it [00:07, 13.02it/s]

Number of admissions with ventilation: 526





#### Sequential Organ Failure Assessment (SOFA) score

Sources:

- PaO2/FiO2: [pO2_pCO2_FiO2.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/pO2_pCO2_FiO2.sql)
- Mechanical ventilation: [mechanical_ventilation.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/lifesupport/mechanical_ventilation.sql)
- Platelets: [platelets.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/platelets.sql)
- Glasgow Coma Scale: [gcs.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/gcs.sql)
- Bilirubin: [bilirubin.sql](https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/b6f295fbfbd9a1f8f9cfa71dab58c670120543d8/amsterdamumcdb/sql/common/bilirubin.sql)
- SOFA definition: https://www.mdcalc.com/calc/691/sequential-organ-failure-assessment-sofa-score

In [30]:
PO2 = [
    7433,  # --PO2
    9996,  # --PO2 (bloed)
    21214  # --PO2 (bloed) - kPa
]

# TODO: estimate FiO2 if not known from ventilator settings
FIO2 = [
    6699,  # --FiO2 %: setting on Evita ventilator
    12279, # --O2 concentratie --measurement by Servo-i/Servo-U ventilator
    12369, # --SET %O2: used with BiPap Vision ventilator
    16246  # --Zephyros FiO2: Non-invasive ventilation
]

PLATELETS = [
    9964,  # --Thrombo's (bloed)
    6797,  # --Thrombocyten
    10409, # --Thrombo's citr. bloed (bloed)
    14252  # --Thrombo CD61 (bloed)   
]

BILIRUBIN = [
    6813,  # --Bili Totaal
    9945   # --Bilirubine (bloed)
]

MEANBP = [
    6642,  # --ABP gemiddeld
    6679,  # --Niet invasieve bloeddruk gemiddeld
    8843   # --ABP gemiddeld II
]

CREATININE = [
    6836,  # --Kreatinine µmol/l (erroneously documented as µmol)
    9941,  # --Kreatinine (bloed) µmol/l
    14216  # --KREAT enzym. (bloed) µmol/l   
]

In [38]:
def compute_sofa(po2, fio2, vent, platelets, gcs, bilirubin, meanbp, creatinine):
    """ Computes Sequential Organ Failure Assessment (SOFA) score given vital parameters """
    # which admissions have available the data to estimate SOFA?
    #admissionids = temp.keys() & heart_rate.keys() & (resp_rate.keys() | paco2.keys()) & (wbc.keys() | bands.keys())
    
    # compute SIRS for eligible admissions
    out = dict()
    #TODO
#     for admissionid in admissionids:
#         # For details, see: http://www.mdcalc.com/calc/1096/sirs-sepsis-septic-shock-criteria
#         score = int(temp[admissionid] < 36 or temp[admissionid] > 38)
#         score += int(heart_rate[admissionid] > 90)
#         score += int(resp_rate[admissionid] > 20) if admissionid in resp_rate else int(paco2[admissionid] < 32)
#         score += int(wbc[admissionid] > 12 or wbc[admissionid] < 4) if admissionid in wbc else int(bands[admissionid] > 10)
        
#         out[admissionid] = score
    return out
    

In [39]:
sofa_on_admission = dict()

for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/numericitems.csv", usecols=['admissionid', 'itemid', 'value', 'measuredat', 'unitid', 'unit'])):
    # which observations are within 24h of start admission?
    times = time_within_admission(chunk.measuredat, chunk.admissionid)
    within_24h = (times > -0.5) & (times < 24) # Allow 30 minute margin before admission
    
    # PO2 (mmHg)
    po2 = chunk[chunk.itemid.isin(PO2) & within_24h].copy()
    po2.loc[po2.unitid == 152, 'value'] *= po2.value * 7.50061683  # conversion rate kPa to mmHg
    po2 = po2.groupby('admissionid', sort=False).value.first().to_dict()
    
    # FiO2 (%)
    fio2 = chunk[chunk.itemid.isin(FIO2) & within_24h]
    fio2 = fio2.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Mechanical ventilation
    vent = chunk.admissionid.isin(admissions_with_vent).to_dict()
    
    # Platelets
    platelets = chunk[chunk.itemid.isin(PLATELETS) & within_24h]
    platelets = platelets.groupby('admissionid', sort=False).value.first().to_dict()
    
    # GCS: TODO
    gcs = None
    
    # Bilirubin
    bilirubin = chunk[chunk.itemid.isin(BILIRUBIN) & within_24h]
    bilirubin = bilirubin.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Mean arterial blood pressure
    meanbp = chunk[chunk.itemid.isin(MEANBP) & within_24h]
    meanbp = meanbp.groupby('admissionid', sort=False).value.first().to_dict()
    
    # Creatinine
    creatinine = chunk[chunk.itemid.isin(CREATININE) & within_24h]
    creatinine = creatinine.groupby('admissionid', sort=False).value.first().to_dict()
    
    sofa_on_admission.update(compute_sofa(po2, fio2, vent, platelets, gcs, bilirubin, meanbp, creatinine))
    
    if i > MAX_CHUNKS:
        break

101it [00:09, 10.89it/s]


#### Age, Height, Weight, Gender

In [40]:
# Age/weight/height groups to their centroids (best guess)
AGE_GROUPS = {
    '18-39': 30,
    '40-49': 45,
    '50-59': 55,
    '60-69': 65,
    '70-79': 75, 
    '80+': 85  # Arbitrary
}

WEIGHT_GROUPS = {
    '59-': 55,
    '60-69': 65, 
    '70-79': 75,
    '80-89': 85,
    '90-99': 95,
    '100-109': 105,
    '110+': 115,
    'N/A': -1  # To be replaced with Dutch average weight (for men and women separately)
}
  
HEIGHT_GROUPS = {
    '159-': 155,
    '160-169': 165,
    '170-179': 175,
    '180-189': 185, 
    '190+': 195, 
    'N/A': -1  # Average height of men and women in NL
}

Using the above *mapper* functions we can now map from grouped demographics (`agegroup`, `heightgroup`, `weightgroup`) to a real value and when unknown use per-gender guesses of weight and height (better than a naive average over the population)

### Demographics DataFrame

In [43]:
demo_cols = ['admissionid', 'gender', 'agegroup', 'heightgroup', 'weightgroup']

demo_df = []
for i, chunk in read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=demo_cols, chunksize=100000000):
    demo_chunk = pd.DataFrame({
        'icustay_id': chunk.admissionid,
        'is_male': chunk.gender == 'Man',
        'age': chunk.agegroup.transform(lambda x: AGE_GROUPS[x] if x in AGE_GROUPS else np.NaN),
        'height': chunk.heightgroup.transform(lambda x: HEIGHT_GROUPS[x] if x in HEIGHT_GROUPS else np.NaN),
        'weight': chunk.weightgroup.transform(lambda x: WEIGHT_GROUPS[x] if x in WEIGHT_GROUPS else np.NaN),
        'vent': chunk.admissionid.isin(admissions_with_vent),
        'sirs': chunk.admissionid.transform(lambda x: sirs_on_admission[x] if x in sirs_on_admission else np.NaN),  # NaN if couldn't be estimated
        'sofa': chunk.admissionid.transform(lambda x: sofa_on_admission[x] if x in sofa_on_admission else np.NaN)
    })
    demo_df.append(demo_chunk)
    
demo_df = pd.concat(demo_df, axis=0).reset_index(drop=True)

In [44]:
demo_df

Unnamed: 0,icustay_id,is_male,age,height,weight,vent,sirs,sofa
0,0,False,85,165.0,65.0,True,2.0,
1,1,True,65,175.0,75.0,True,1.0,
2,2,True,65,185.0,95.0,True,1.0,
3,3,True,55,185.0,95.0,True,2.0,
4,4,True,75,175.0,75.0,True,,
...,...,...,...,...,...,...,...,...
23101,23548,False,45,185.0,75.0,False,,
23102,23549,False,45,165.0,65.0,False,,
23103,23550,True,85,175.0,55.0,False,,
23104,23551,False,45,165.0,65.0,False,,
