# Notebook to Convert AmsterdamUMCdb to MIMIC-III Format

Here we will be converting the [AmsterdamUMCdb](https://github.com/AmsterdamUMC/AmsterdamUMCdb) data files to the MIMIC-III data file format as generated by [MIMIC-Code](https://github.com/MIT-LCP/mimic-code). We do this as to allow the exact same preprocessing pipeline to be applied to both MIMIC and the AmsterdamUMCdb.

In [1]:
import re
import numpy as np
import pandas as pd
from tqdm import tqdm

MAX_CHUNKS = 250 # for debugging

### Convenience functions

As the files are in the orders of GBs we read in the files incrementally;

In [2]:
# reads file from path in chunks of size `chunksize`
def read_csv(path, usecols, chunksize=10000):
    for i, chunk in enumerate(pd.read_csv(path, usecols=usecols, encoding='latin1', engine='python', chunksize=chunksize)):
        yield i, chunk

---

## Sepsis-3 cohort
#### Starttime of admissions
Start times are relative to the first admission (with start time 0 for the first admission);

In [3]:
time_of_admission = dict()
for _, chunk in read_csv(r"D:/AmsterdamUMCdb-v1.0.2/admissions.csv", usecols=['admissionid', 'admittedat']):
    # add new admissionid:admitted-at pairs
    time_of_admission.update(dict(zip(chunk.admissionid, chunk.admittedat)))

print('Time since last admission of patient of admission 4: %.2f years' % (time_of_admission[14] / (1000 * 60 * 60 * 24 * 365)))

Time since last admission of patient of admission 4: 6.62 years


In [4]:
# convenience function to query admission time with a list of admissions
def start_of_admission(times):
    return pd.Series([time_of_admission[t] for t in times])

### Suspicion of infection
#### Definitions

In [5]:
# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/diagnosis/reason_for_admission.sql
IS_SURGICAL = [
    13116, # D_Thoraxchirurgie_CABG en Klepchirurgie
    16671, # DMC_Thoraxchirurgie_CABG en Klepchirurgie
    13117, # D_Thoraxchirurgie_Cardio anders
    16672, # DMC_Thoraxchirurgie_Cardio anders
    13118, # D_Thoraxchirurgie_Aorta chirurgie
    16670, # DMC_Thoraxchirurgie_Aorta chirurgie
    13119, # D_Thoraxchirurgie_Pulmonale chirurgie
    16673, # DMC_Thoraxchirurgie_Pulmonale chirurgie
    13121, # D_Algemene chirurgie_Buikchirurgie
    16643, # DMC_Algemene chirurgie_Buikchirurgie
    13123, # D_Algemene chirurgie_Endocrinologische chirurgie
    16644, # DMC_Algemene chirurgie_Endocrinologische chirurgie
    13145, # D_Algemene chirurgie_KNO/Overige
    16645, # DMC_Algemene chirurgie_KNO/Overige
    13125, # D_Algemene chirurgie_Orthopedische chirurgie
    16646, # DMC_Algemene chirurgie_Orthopedische chirurgie
    13122, # D_Algemene chirurgie_Transplantatie chirurgie
    16647, # DMC_Algemene chirurgie_Transplantatie chirurgie
    13124, # D_Algemene chirurgie_Trauma
    16648, # DMC_Algemene chirurgie_Trauma
    13126, # D_Algemene chirurgie_Urogenitaal
    16649, # DMC_Algemene chirurgie_Urogenitaal
    13120, # D_Algemene chirurgie_Vaatchirurgie
    16650, # DMC_Algemene chirurgie_Vaatchirurgie
    13128, # D_Neurochirurgie _Vasculair chirurgisch
    16661, # DMC_Neurochirurgie _Vasculair chirurgisch
    13129, # D_Neurochirurgie _Tumor chirurgie
    16660, # DMC_Neurochirurgie _Tumor chirurgie
    13130, # D_Neurochirurgie_Overige
    16662, # DMC_Neurochirurgie_Overige
    18596, # Apache II Operatief  Gastr-intenstinaal
    18597, # Apache II Operatief Cardiovasculair
    18598, # Apache II Operatief Hematologisch
    18599, # Apache II Operatief Metabolisme
    18600, # Apache II Operatief Neurologisch
    18601, # Apache II Operatief Renaal
    18602, # Apache II Operatief Respiratoir
    17008, # APACHEIV Post-operative cardiovascular
    17009, # APACHEIV Post-operative gastro-intestinal
    17010, # APACHEIV Post-operative genitourinary
    17011, # APACHEIV Post-operative hematology
    17012, # APACHEIV Post-operative metabolic
    17013, # APACHEIV Post-operative musculoskeletal /skin
    17014, # APACHEIV Post-operative neurologic
    17015, # APACHEIV Post-operative respiratory
    17016, # APACHEIV Post-operative transplant
    17017  # APACHEIV Post-operative trauma
]

SEPSIS_ANTIBIOTICS = [
    6834, # Amikacine (Amukin)
    6847, # Amoxicilline (Clamoxyl/Flemoxin)
    6871, # Benzylpenicilline (Penicilline)
    6917, # Ceftazidim (Fortum)
    # 6919, # Cefotaxim (Claforan) -> prophylaxis
    6948, # Ciprofloxacine (Ciproxin)
    6953, # Rifampicine (Rifadin)
    6958, # Clindamycine (Dalacin)
    7044, # Tobramycine (Obracin)
    # 7064, # Vancomycine -> prophylaxis for valve surgery
    7123, # Imipenem (Tienam)
    7185, # Doxycycline (Vibramycine)
    # 7187, # Metronidazol (Flagyl) -> often used for GI surgical prophylaxis
    # 7208, # Erythromycine (Erythrocine) -> often used for gastroparesis
    7227, # Flucloxacilline (Stafoxil/Floxapen)
    7231, # Fluconazol (Diflucan)
    7232, # Ganciclovir (Cymevene)
    7233, # Flucytosine (Ancotil)
    7235, # Gentamicine (Garamycin)
    7243, # Foscarnet trinatrium (Foscavir)
    7450, # Amfotericine B (Fungizone)
    # 7504, # X nader te bepalen # non-stock medication
    8127, # Meropenem (Meronem)
    8229, # Myambutol (ethambutol)
    8374, # Kinine dihydrocloride
    # 8375, # Immunoglobuline (Nanogam) -> not anbiotic
    # 8394, # Co-Trimoxazol (Bactrimel) -> often prophylactic (unless high dose)
    8547, # Voriconazol(VFEND)
    # 9029, # Amoxicilline/Clavulaanzuur (Augmentin) -> often used for ENT surgical prophylaxis
    9030, # Aztreonam (Azactam)
    9047, # Chlooramfenicol
    # 9075, # Fusidinezuur (Fucidin) -> prophylaxis
    9128, # Piperacilline (Pipcil)
    9133, # Ceftriaxon (Rocephin)
    # 9151, # Cefuroxim (Zinacef) -> often used for GI/transplant surgical prophylaxis
    # 9152, # Cefazoline (Kefzol) -> prophylaxis for cardiac surgery
    9458, # Caspofungine
    9542, # Itraconazol (Trisporal)
    # 9602, # Tetanusimmunoglobuline -> prophylaxis/not antibiotic
    12398, # Levofloxacine (Tavanic)
    12772, # Amfotericine B lipidencomplex  (Abelcet)
    15739, # Ecalta (Anidulafungine)
    16367, # Research Anidulafungin/placebo
    16368, # Research Caspofungin/placebo
    18675, # Amfotericine B in liposomen (Ambisome )
    19137, # Linezolid (Zyvoxid)
    19764, # Tigecycline (Tygacil)
    19773, # Daptomycine (Cubicin)
    20175  # Colistine
]

# Other antibiotics may be administered during sepsis treatment
# (but drug culture must be drawn!)
OTHER_ANTIBIOTICS = [
    7064, # Vancomycine -> prophylaxis for valve surgery
    7187, # Metronidazol (Flagyl) -> often used for GI surgical prophylaxis
    8394, # Co-Trimoxazol (Bactrimel) -> often prophylactic (unless high dose)
    9029, # Amoxicilline/Clavulaanzuur (Augmentin) -> often used for ENT surgical prophylaxis
    9151, # Cefuroxim (Zinacef) -> often used for GI surgical prophylaxis
    9152  # Cefazoline (Kefzol) -> prophylaxis
]

# Culture tests
CULTURE_TESTS = [
    9189, # Bloedkweken afnemen
    9190, # Cathetertipkweek afnemen
    9198, # Rectumkweek afnemen -> often used routinely
    9200, # Wondkweek afnemen
    9202, # Ascitesvochtkweek afnemen
    9205 # Legionella sneltest (urine)
]

# surgical admissions vs non-surgical admissions
re_sepsis_surg = r'sepsis|pneumoni|GI perforation|perforation/rupture|infection|abscess|GI Vascular ' \
                 r'ischemia|diverticular|appendectomy|peritonitis '

re_sepsis_med = r'sepsis|septic|infect|pneumoni|cholangitis|pancr|endocarditis|meningitis|GI ' \
                r'perforation|abces|abscess|darm ischaemie|GI vascular|fasciitis' + \
                r'|inflammatory|peritonitis'

APACHE = [18669, 18671]

#### 1. Were antibiotics administered during first 24h of admission?

In [6]:
antibiotics_df = []
for i, df in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/drugitems.csv", usecols=['admissionid', 'itemid', 'start'])):
    # were antibiotics administered within 24 hours?
    # Remark: reason_for_admission.sql assumed admission to be the first 
    df['sepsis_antibiotics_bool'] = df.itemid.isin(SEPSIS_ANTIBIOTICS) & ((df.start - start_of_admission(df.admissionid)) < 24*60*60*1000)  # first 24 hours (in ms)
    df['other_antibiotics_bool'] = df.itemid.isin(OTHER_ANTIBIOTICS) & ((df.start - start_of_admission(df.admissionid)) < 24*60*60*1000)    # Note to self: Check also for culture!
    
    # drop rows without antibiotics to save memory :)
    df = df.loc[df.sepsis_antibiotics_bool | df.other_antibiotics_bool, :]
    antibiotics_df.append(df)
    
    if i > MAX_CHUNKS:
        break

antibiotics_df = pd.concat(antibiotics_df, axis=0)

251it [00:31,  7.99it/s]


#### 2. Was a positive blood culture found within first 6h?

In [None]:
culture_df = []
for i, df in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/procedureorderitems.csv", usecols=['admissionid', 'itemid', 'registeredat'])):
    # culture found?
    df['positive_blood_culture'] = df.itemid.isin(CULTURE_TESTS) & ((df.registeredat - start_of_admission(df.admissionid)) < 6*60*60*1000)  # first 6 hours (in ms)
    
    # drop rows with all False to save memory :)
    df = df.loc[df.positive_blood_culture, :]
    culture_df.append(df)
    
    if i > MAX_CHUNKS:
        break

culture_df = pd.concat(culture_df, axis=0)

178it [00:13, 13.07it/s]

#### 3. Was there an infection/sepsis-related diagnosis?

In [None]:
# Based on code base from the AmsterdamUMCdb repository
# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/sql/diagnosis/reason_for_admission.sql

def get_diagnosis(df):
    # drop prefixes from diagnosis name if APCAHE II/IV
    mask = df.itemid.isin(APACHE)
    df.loc[mask, 'value'] = df[mask].value.str.split(' - ').str[1]
    df['diagnosis'] = df.value

def is_surgical(df):
    df['surgical_adm'] = df.itemid.isin(IS_SURGICAL) | ((df.itemid == 18669) & ((df.valueid > 0) & (df.valueid <= 27))) | ((df.itemid == 18671) & ((df.valueid > 221) & (df.valueid <= 452)))

def sepsis_at_admission(df):
    # 15808 -> 'opname sepsis'
    df['known_no_sepsis'] = (df.itemid == 15808) & (df.valueid == 2)  # 15808 = 'opname sepsis'; 2 = explicitly no sepsis (to drop later)
    df['sepsis_on_admission'] = (df.itemid == 15808) & (df.valueid == 1)

In [None]:
diagnoses_df = []
for i, df in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'value', 'updatedat'])):   
    # was sepsis explicitly diagnosed?
    sepsis_at_admission(df) 
    
    # diagnosis related to infection/sepsis?
    get_diagnosis(df)
    is_surgical(df)
    surg_diag = df.surgical_adm & df.diagnosis.str.contains(re_sepsis_surg, na=False, flags=re.IGNORECASE)  # surgical admission with sepsis
    med_diag = ~df.surgical_adm & df.diagnosis.str.contains(re_sepsis_med, na=False, flags=re.IGNORECASE)   # medical admission with sepsis
    df['suspected_infection'] = surg_diag | med_diag
    
    # drop rows with all False to save memory :)
    df = df.loc[df.suspected_infection | df.known_no_sepsis | df.sepsis_on_admission, :]
    diagnoses_df.append(df)
    
    if i > MAX_CHUNKS:
        break

diagnoses_df = pd.concat(diagnoses_df, axis=0)

#### Get admissions with suspected infections

In [None]:
# admission ids with antibiotics
sepsis_antibiotics = set(antibiotics_df.admissionid[antibiotics_df.sepsis_antibiotics_bool].unique())
other_antibiotics = set(antibiotics_df.admissionid[antibiotics_df.other_antibiotics_bool].unique())

# admission ids with positive blood cultures
pos_blood_culture = set(culture_df.admissionid)

# admission ids with explicit diagnosis or suspicion of sepsis
sepsis_on_admission = set(diagnoses_df.admissionid[diagnoses_df.sepsis_on_admission].unique())
susp_infection = set(diagnoses_df.admissionid[diagnoses_df.suspected_infection].unique())
known_no_sepsis = set(diagnoses_df.admissionid[diagnoses_df.known_no_sepsis].unique())

# see: https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/amsterdamumcdb/cohorts.py
# suspicion of infection is quantified as either a known diagnosis of a major injection after/during a medical/surgical 
# admission OR when sepsis-related antibiotics are administered OR when a positive blood culture is found
adm_susp_infection = (susp_infection | sepsis_on_admission | sepsis_antibiotics | (other_antibiotics & pos_blood_culture)) - known_no_sepsis

In [None]:
print('Total admissions:', len(antibiotics_df.admissionid.unique()))
print('With suspected infection:', len(adm_susp_infection))

#### Estimate infection time

We define the suspected infection time as the earliest moment of administering antibiotics, diagnosis of sepsis, or where a positive blood culture is found:

In [None]:
susp_infection_time_poe = {}
for adm_id in tqdm(adm_susp_infection):
    # determine earliest time where infection was suspected 
    # i.e. earliest time of antibiotics OR culture found OR diagnosis
    times = antibiotics_df.start[antibiotics_df.admissionid == adm_id].tolist()  # empty if no antibiotics received
    times += culture_df.registeredat[culture_df.admissionid == adm_id].tolist()  # empty if no positive culture found
    times += diagnoses_df.updatedat[diagnoses_df.admissionid == adm_id].tolist() # empty if no explicit documented diagnosis
    susp_infection_time_poe[adm_id] = min(times)
    
susp_infection_time_poe[8216]

### qSOFA $\geq2$

TODO: SOFA (instead of qSOFA)

#### Definitions

In [12]:
def gcs_eye_score(df):
    df['eyes_score'] = np.NaN
    df[df.itemid == 6732, 'eyes_score'] = 5 - df.valueid   # --Actief openen van de ogen
    df[df.itemid == 13077, 'eyes_score'] = df.valueid      # --A_Eye
    df[df.itemid == 14470, 'eyes_score'] = df.valueid - 4  # --RA_Eye
    df[df.itemid == 16628, 'eyes_score'] = df.valueid - 4  # --MCA_Eye
    df[df.itemid == 19635, 'eyes_score'] = df.valueid - 4  # --E_EMV_NICE_24uur
    df[df.itemid == 19638, 'eyes_score'] = df.valueid - 8  # --E_EMV_NICE_Opname  
    
def gcs_motor_score(df):
    df['motor_score'] = np.NaN
    df[df.itemid == 6734, 'motor_score'] = 7 - df.valueid   # --Beste motore reactie van de armen
    df[df.itemid == 13072, 'motor_score'] = df.valueid      # --A_Motoriek
    df[df.itemid == 14476, 'motor_score'] = df.valueid - 6  # --RA_Motoriek
    df[df.itemid == 16634, 'motor_score'] = df.valueid - 6  # --MCA_Motoriek
    df[df.itemid == 19636, 'motor_score'] = df.valueid - 6  # --M_EMV_NICE_24uur
    df[df.itemid == 19639, 'motor_score'] = df.valueid - 12 # --M_EMV_NICE_Opname

def gcs_verbal_score(df):
    df['verbal_score'] = np.NaN
    df[df.itemid == 6735, 'verbal_score'] = 6 - df.valueid   # --Beste verbale reactie
    df[df.itemid == 13066, 'verbal_score'] = df.valueid      # --A_Verbal
    df[df.itemid == 14482, 'verbal_score'] = df.valueid - 5  # --RA_Verbal
    df[df.itemid == 16640, 'verbal_score'] = df.valueid - 5  # --MCA_Verbal
    df[df.itemid == 19637, 'verbal_score'] = df.valueid - 9  # --V_EMV_NICE_24uur
    df[df.itemid == 19640, 'verbal_score'] = df.valueid - 15 # --M_EMV_NICE_Opname

#### Glasgow Coma Scale

In [None]:
gcs_df = []
for i, chunk in tqdm(read_csv(r"D:/AmsterdamUMCdb-v1.0.2/listitems.csv", usecols=['admissionid', 'itemid', 'valueid', 'value', 'measuredat'])): 
    # 
    (eyes.measuredat - a.admittedat) <= 1000*60*60*24 AND (eyes.measuredat - a.admittedat) >= 0
    
    gcs_eye_score(chunk)
    gcs_motor_score(chunk)
    gcs_verbal_score(chunk)

### Putting it together