# Preprocess MIMIC-III

This notebook will preprocess MIMIC-III for use with [AllenNLP-multi-label](https://github.com/semantic-health/allennlp-multi-label) and [Deep Patient Cohorts](https://github.com/semantic-health/deep-patient-cohorts).

In [1]:
import random
import numpy as np

RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

## Load the data

First, download the `"NOTEEVENTS.csv"` and `"DIAGNOSES_ICD.csv"` files from https://physionet.org/content/mimiciii, then set their filepaths below

In [2]:
noteevents_filepath = ""
diagnoses_icd_filepath = ""

From these files, we will create a table containing the text of the discharge summaries and their corresponding ICD9 codes

In [None]:
import pandas as pd
from deep_patient_cohorts.common.utils import reformat_icd_code

noteevents = pd.read_csv(
    noteevents_filepath,
    index_col="HADM_ID",
    usecols=["HADM_ID", "TEXT", "CATEGORY"],
    dtype={"HADM_ID": str, "CATEGORY": str},
    # Remove whitespace, newlines and tabs from TEXT column data.
    converters={"TEXT": lambda text: " ".join(str(text).strip().split())},
)
# Filter out anything that isn't a discharge summary
noteevents = noteevents[noteevents["CATEGORY"] == "Discharge summary"].drop("CATEGORY", axis=1)

diagnoses_icd = pd.read_csv(
    diagnoses_icd_filepath,
    index_col="HADM_ID",
    usecols=["HADM_ID", "ICD9_CODE"],
    dtype={"HADM_ID": str},
    # MIMIC-III ICD9 codes need to be formatted
    converters={"ICD9_CODE": reformat_icd_code}
)
# Convert long format data to wide-ish format data
diagnoses_icd = diagnoses_icd.groupby('HADM_ID')['ICD9_CODE'].apply(list)

df = noteevents.join(diagnoses_icd)
df

### Map ICD9 codes to SNOWMED CT Identifiers

Then, download the [ICD-9-CM Diagnostic Codes to SNOMED CT Map](https://www.nlm.nih.gov/research/umls/mapping_projects/icd9cm_to_snomedct.html) and set the paths below to the "1-to-1" and "1-to-many" text files below

In [None]:
icd9cm_to_snowmed_1to1_filepath = ""
icd9cm_to_snowmed_1tom_filepath = ""

We will use these files to create a mapping from ICD9 codes to their corresponding SNOWMED CIDs

In [None]:
import itertools
from pprint import pprint

snowmed_1to1 = pd.read_csv(
    icd9cm_to_snowmed_1to1_filepath,
    sep="\t",
    usecols=["ICD_CODE", "SNOMED_CID"],
    dtype={"ICD_CODE": str, "SNOMED_CID": str}
).dropna().groupby('ICD_CODE')['SNOMED_CID'].apply(list)

snowmed_1tom = pd.read_csv(
    icd9cm_to_snowmed_1tom_filepath,
    sep="\t",
    usecols=["ICD_CODE", "SNOMED_CID"],
    dtype={"ICD_CODE": str, "SNOMED_CID": str}
).dropna().groupby('ICD_CODE')['SNOMED_CID'].apply(list)

for key, value in snowmed_1tom.items():
    if key in snowmed_1to1:
        snowmed_1to1[key].extend(**value)
    else:
        snowmed_1to1[key] = value
        
icd_to_snowmed_map = snowmed_1to1.to_dict()
pprint(dict(itertools.islice(icd_to_snowmed_map.items(), 10)))

In [None]:
from typing import List

def icd9_to_snowmed(icd9_codes: List[str]):
    return list(set(itertools.chain.from_iterable(icd_to_snowmed_map[code] for code in icd9_codes if code in icd_to_snowmed_map)))

df['SNOMED_CID'] = df['ICD9_CODE'].map(icd9_to_snowmed)

### Rollup the SNOWMED CT Identifiers

Next, you will need to download a SNOWMED release, and set the filepath below to the full "Relationship_Full" text file

In [None]:
relationships_filepath = ""

In [None]:
from pathlib import Path

IS_A_CID = "116680003"

snowmed_relationships = pd.read_csv(
    relationships_filepath,
    sep="\t",
    usecols=["sourceId", "destinationId", "typeId"],
    dtype={"sourceId": str, "destinationId": str, "typeId": str},
)

snowmed_relationships = snowmed_relationships[snowmed_relationships["typeId"] == IS_A_CID]
snowmed_relationships

This is a hack to get the IDs of all the concepts under the Clinical Finding > Disease branch of SNOWMED

In [None]:
# This was copy/pasted from: https://uts.nlm.nih.gov/snomedctBrowser.html;jsessionid=59F3153C7129757C034B75CE3340662C#A2880798;1;0;AUI;undefined;SNOMEDCT_US;undefined;false;
# A much more desirable solution would be to automatically determine children from the SNOWMED release files.
disease_concept_children = """
AIDS-associated disorder [
	420721002	
]
Abscess [
	128477000	
]
Acute disease [
	2704003	
]
Allosensitization [
	702642003	
]
Angioedema and/or urticaria [
	404177007	
]
Biphasic disease [
	409701001	
]
Chronic disease [
	27624003	
]
Communication disorder [
	278919001	
]
Complication [
	116223007	
]
Congenital disease [
	66091009	
]
Cyst [
	441457006	
]
Degenerative disorder [
	362975008	
]
Developmental disorder [
	5294002	
]
Disease due to Annelida [
	105714009	
]
Disease due to Arthropod [
	68843000	
]
Disease of presumed infectious origin [
	78885002	
]
Disorder associated with menstruation AND/OR menopause [
	106002000	
]
Disorder by body site [
	123946008	
]
Disorder caused by synthetic cathinone [
	762503004	
]
Disorder characterized by edema [
	118654009	
]
Disorder characterized by fever [
	416113008	
]
Disorder characterized by pain [
	373673007	
]
Disorder due to infection [
	40733004	
]
Disorder due to medically assisted reproduction [
	724462007	
]
Disorder in remission [
	765205004	
]
Disorder of cellular component of blood [
	414022008	
]
Disorder of fetus or newborn [
	414025005	
]
Disorder of hematopoietic morphology [
	414026006	
]
Disorder of hemostatic system [
	362970003	
]
Disorder of immune function [
	414029004	
]
Disorder of labor / delivery [
	362972006	
]
Disorder of pigmentation [
	414032001	
]
Disorder of pregnancy [
	173300003	
]
Disorder of puerperium [
	362973001	
]
Disorder related to transplantation [
	429054002	
]
Drug-related disorder [
	87858002	
]
Effect of foreign body [
	105620001	
]
Endometriosis (clinical) [
	129103003	
]
Endosalpingiosis [
	55850004	
]
Endotoxicosis [
	370515002	
]
Enterotoxemia [
	370514003	
]
Environment related disease [
	8504008	
]
Epigenetic disorder [
	761868000	
]
Exsanguination [
	48149007	
]
Familial disease [
	111941005	
]
Fibromatosis [
	723976005	
]
Fistula [
	428794004	
]
Foreign body [
	125670008	
]
Functional disorder [
	386585008	
]
Functional quadriplegia [
	123391000119106	
]
Genetic disease [
	782964007	
]
Hematoma [
	385494008	
]
Hyperproteinemia [
	37064009	
]
Hypertensive complication [
	449759005	
]
Hyperviscosity syndrome [
	11888009	
]
Idiopathic disease [
	41969006	
]
Inflammatory disorder [
	128139000	
]
Ingestion of toxic substance [
	32941000119104	
]
Late onset cerebellar ataxia [
	230232005	
]
Lipomatosis [
	402693001	
]
Maltreatment syndromes [
	213015009	
]
Mental disorder [
	74732009	
]
Metabolic disease [
	75934005	
]
Morning vomiting [
	12860001000004105	
]
Neoplasm and/or hamartoma [
	399981008	
]
Nutritional deficiency associated condition [
	363246002	
]
Nutritional disorder [
	2492009	
]
Obesity [
	414916001	
]
Obesity associated disorder [
	363247006	
]
Paraneoplastic syndrome [
	49783001	
]
Poisoning [
	75478009	
]
Polyp [
	441456002	
]
Post intensive care unit syndrome [
	456211000124100	
]
Post vaccination fever [
	123471000119103	
]
Post-infectious disorder [
	123976001	
]
Proteus syndrome [
	23150001	
]
Radiation-induced disorder [
	85983004	
]
Scar [
	275322007	
]
Self-induced disease [
	77434001	
]
Sequela [
	362977000	
]
Seroma [
	715068009	
]
Sexual disorder [
	231532002	
]
Sleep disorder [
	39898005	
]
Spontaneous hemorrhage [
	405538007	
]
Subacute disease [
	19342008	
]
Substance abuse [
	66214007	
]
Systemic disease [
	56019007	
]
Traumatic AND/OR non-traumatic injury [
	417163006	
]
Traumatic hemorrhage [
	110149000	
]
Travel-related illness [
	459771000124107	
]
Ulcer [
	429040005	
]
Vertiginous syndrome [
	87118001	
]"""

In [None]:
disease_ids = [line.split("\t")[1] for line in disease_concept_children.split("\n") if line and len(line.split("\t")) == 3]

rollup_map = {}
for cid in disease_ids:
    children = snowmed_relationships[snowmed_relationships["destinationId"] == cid]["sourceId"].unique().tolist()
    rollup_map.update(dict.fromkeys(children, cid))

Finally, we will roll up all the SNOWMED CIDs to the level of Clinical Findings > Disease

In [None]:
def snowmed_cid_rollup(cids: List[str]):
    return [rollup_map[cid] for cid in cids if cid in rollup_map]

df['SNOMED_DISEASE_CID'] = df['SNOMED_CID'].map(snowmed_cid_rollup)
df

Below, we can check the fraction of notes that were mapped to at least one SNOWMED Disease CID

> Note, 100% of notes are labelled with one or more ICD9 code

In [None]:
print(f"{sum([True for labels in df['SNOMED_DISEASE_CID'] if labels]) / len(df) * 100:.2f}%")

## Save the Data

First, we partion the dataset into stratified train, validation and test splits. Then we will save it in a format that is suitable for use with [AllenNLP-multi-label](https://github.com/semantic-health/allennlp-multi-label) and [Deep Patient Cohorts](https://github.com/semantic-health/deep-patient-cohorts).

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.model_selection import iterative_train_test_split

text = df[["TEXT"]].values
# Binarize the multi-label outputs.
labels = df["SNOMED_DISEASE_CID"].values
lb = MultiLabelBinarizer()
labels = lb.fit_transform(labels)
print(f"There are {labels.shape[1]} unique SNOWMED Disease CIDs.")

train_size = 0.7
valid_size = 0.1
test_size = 0.2

# https://datascience.stackexchange.com/a/53161
X_train, y_train, X_test, y_test = iterative_train_test_split(text, labels, test_size=1 - train_size)
X_valid, y_valid, X_test, y_test = iterative_train_test_split(X_test, y_test, test_size=test_size/(test_size + valid_size))

X_train = X_train.ravel()
X_valid = X_valid.ravel()
X_test = X_test.ravel()

# Save the data with the literal SNOWMED CIDs as labels
y_train = lb.inverse_transform(y_train)
y_valid = lb.inverse_transform(y_valid)
y_test = lb.inverse_transform(y_test)

Finally, set the filepath to save the data

In [None]:
output_dir = ""

In [None]:
import json

output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)

with open(output_dir / "train.jsonl", "w") as f:
    for t, l in zip(X_train, y_train):
        f.write(f"{json.dumps({'text': t, 'labels': l})}\n")
with open(output_dir / "valid.jsonl", "w") as f:
    for t, l in zip(X_valid, y_valid):
        f.write(f"{json.dumps({'text': t, 'labels': l})}\n")
with open(output_dir / "test.jsonl", "w") as f:
    for t, l in zip(X_test, y_test):
        f.write(f"{json.dumps({'text': t, 'labels': l})}\n")