# Preprocess MIMIC-III

This notebook will preprocess MIMIC-III for use with [AllenNLP-multi-label](https://github.com/semantic-health/allennlp-multi-label) and [Deep Patient Cohorts](https://github.com/semantic-health/deep-patient-cohorts).

In [None]:
import random
import numpy as np

RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

## Load the data

First, download the `"NOTEEVENTS.csv"` and `"DIAGNOSES_ICD.csv"` files from https://physionet.org/content/mimiciii, then set their filepaths below

In [None]:
noteevents_filepath = ""
diagnoses_icd_filepath = ""

From these files, we will create a table containing the text of the discharge summaries and their corresponding ICD9 codes

In [None]:
import pandas as pd
from deep_patient_cohorts.common.utils import reformat_icd_code

noteevents = pd.read_csv(
    noteevents_filepath,
    index_col="HADM_ID",
    usecols=["HADM_ID", "TEXT", "CATEGORY"],
    dtype={"HADM_ID": str, "CATEGORY": str},
    # Remove whitespace, newlines and tabs from TEXT column data.
    converters={"TEXT": lambda text: " ".join(str(text).strip().split())},
)

diagnoses_icd = pd.read_csv(
    diagnoses_icd_filepath,
    index_col="HADM_ID",
    usecols=["HADM_ID", "ICD9_CODE"],
    dtype={"HADM_ID": str},
    # MIMIC-III ICD9 codes need to be formatted
    converters={"ICD9_CODE": reformat_icd_code}
)

# Postprocessing steps...
# ...filter out anything that isn't a discharge summary
noteevents = noteevents[noteevents["CATEGORY"] == "Discharge summary"].drop("CATEGORY", axis=1)
# ...convert long format data to wide-ish format data
noteevents = noteevents.groupby('HADM_ID')['TEXT'].apply(list).to_frame()
diagnoses_icd = diagnoses_icd.groupby('HADM_ID')['ICD9_CODE'].apply(list).to_frame()
# ...if there is more than one note per HADM_ID, take the longest
noteevents["TEXT"] = noteevents["TEXT"].map(lambda x: max(x, key=len))

df = noteevents.join(diagnoses_icd)
df

In [None]:
from itertools import chain
labels = list(chain.from_iterable(df["ICD9_CODE"]))

In [None]:
from collections import Counter
a = ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'e', 'e', 'e', 'e', 'e']
counts = Counter(labels)
df = pd.DataFrame.from_dict(counts, orient='index')
# df.plot(kind='bar')

print(len(counts))
print(len([value for value in list(counts.values()) if value <=10]))

Next, we remove sections that aren't relevant for predicting the patients diagnosis. Note, this may take some time

In [None]:
from clinical_sectionizer import TextSectionizer
from tqdm import tqdm

blacklist = {
    'allergy',
    'family_history',
    'hiv_screening',
    'other',
    'education',
    'past_medical_history',
    'patient_instructions',
    'patient_instructions',
    'sexual_and_social_history'
    'signature',
    #'labs_and_studies'
}
sectionizer = TextSectionizer()
processed_text = []
for text in tqdm(df["TEXT"].tolist()):
    processed_text.append(
        "\n".join([section for (title, _, section) in sectionizer(text) if title not in blacklist])
    )
df["TEXT"] = processed_text
df

### Map ICD9 codes to SNOWMED CT Identifiers

Then, download the [ICD-9-CM Diagnostic Codes to SNOMED CT Map](https://www.nlm.nih.gov/research/umls/mapping_projects/icd9cm_to_snomedct.html) and set the paths below to the `ICD9CM_SNOMED_MAP_1TO1_20****.txt` and `ICD9CM_SNOMED_MAP_1TOM_20****.txt` text files below

In [None]:
icd9cm_to_snowmed_1to1_filepath = ""
icd9cm_to_snowmed_1tom_filepath = ""

We will use these files to create a mapping from ICD9 codes to their corresponding SNOWMED CIDs

In [None]:
import itertools
from pprint import pprint

snowmed_1to1 = pd.read_csv(
    icd9cm_to_snowmed_1to1_filepath,
    sep="\t",
    usecols=["ICD_CODE", "SNOMED_CID"],
    dtype={"ICD_CODE": str, "SNOMED_CID": str}
).dropna().groupby('ICD_CODE')['SNOMED_CID'].apply(list)

snowmed_1tom = pd.read_csv(
    icd9cm_to_snowmed_1tom_filepath,
    sep="\t",
    usecols=["ICD_CODE", "SNOMED_CID"],
    dtype={"ICD_CODE": str, "SNOMED_CID": str}
).dropna().groupby('ICD_CODE')['SNOMED_CID'].apply(list)

for key, value in snowmed_1tom.items():
    if key in snowmed_1to1:
        snowmed_1to1[key].extend(**value)
    else:
        snowmed_1to1[key] = value
        
icd_to_snowmed_map = snowmed_1to1.to_dict()
pprint(dict(itertools.islice(icd_to_snowmed_map.items(), 10)))

In [None]:
from typing import List

def icd9_to_snowmed(icd9_codes: List[str]):
    return list(set(itertools.chain.from_iterable(icd_to_snowmed_map[code] for code in icd9_codes if code in icd_to_snowmed_map)))

df['SNOMED_CID'] = df['ICD9_CODE'].map(icd9_to_snowmed)
df

### Rollup the SNOWMED CT Identifiers

Next, you will need to download a SNOWMED release, and set the filepath below to the full "Relationship_Full" text file

In [None]:
relationships_filepath = ""

In [None]:
from pathlib import Path

IS_A_CID = "116680003"

snowmed_relationships = pd.read_csv(
    relationships_filepath,
    sep="\t",
    usecols=["sourceId", "destinationId", "typeId"],
    dtype={"sourceId": str, "destinationId": str, "typeId": str},
)

snowmed_relationships = snowmed_relationships[snowmed_relationships["typeId"] == IS_A_CID]
snowmed_relationships

This is a hack to get the IDs of all the concepts under the Clinical Finding > Disease branch of SNOWMED

In [None]:
# This was copy/pasted from: https://uts.nlm.nih.gov/snomedctBrowser.html;jsessionid=59F3153C7129757C034B75CE3340662C#A2880798;1;0;AUI;undefined;SNOMEDCT_US;undefined;false;
# A much more desirable solution would be to automatically determine children from the SNOWMED release files.
heart_disease_children = """
Abnormal fetal heart beat first noted during labor AND/OR delivery in liveborn infant [
	22271007	
]
Acute heart disease [
	127337006	
]
Anomalous bands of heart [
	204407002	
]
Athlete's heart [
	233931008	
]
Cardiac abnormality due to heart abscess [
	461089003	
]
Cardiac arrhythmia [
	698247007	
]
Cardiac complication [
	40172005	
]
Cardiac complication of anesthesia during the puerperium [
	717959008	
]
Cardiac complication of procedure [
	362999008	
]
Cardiac disease in pregnancy [
	274121004	
]
Cardiac disorder due to typhoid fever [
	1084791000119106	
]
Cardiac glycogen phosphorylase kinase deficiency [
	297253000	
]
Chronic heart disease [
	128238001	
]
Cyanotic congenital heart disease [
	12770006	
]
Disorder of atrium following procedure [
	735584007	
]
Disorder of cardiac function [
	105981003	
]
Disorder of cardiac ventricle [
	415991003	
]
Disorder of coronary artery [
	414024009	
]
Disorder of left atrium [
	472763005	
]
Disorder of right atrium [
	472762000	
]
Disorder of transplanted heart [
	429257001	
]
Endocardial disease [
	123596001	
]
Fetal heart disorder [
	430901004	
]
Healed bacterial endocarditis [
	123597005	
]
Heart disease co-occurrent with human immunodeficiency virus infection [
	713298006	
]
Heart disease due to ionizing radiation [
	430401005	
]
Heart disease in mother complicating pregnancy, childbirth AND/OR puerperium [
	78381004	
]
Heart valve disorder [
	368009	
]
Hypertensive heart disease [
	64715009	
]
Induced termination of pregnancy complicated by cardiac arrest and/or failure [
	609460008	
]
Infectious disease of heart [
	128403000	
]
Mechanical complication due to heart valve prosthesis [
	43873004	
]
Myocardial disease [
	57809008	
]
Myxedema heart disease [
	4641009	
]
Outflow tract abnormality in solitary indeterminate ventricle [
	448898002	
]
Pulmonary heart disease [
	274096000	
]
Rheumatic heart disease [
	23685000	
]
Sinistrocardia [
	424889004	
]
Structural disorder of heart [
	128599005	
]
Thyrotoxic heart disease [
	60446003	
]
Ventricular tachycardia [
	25569003	
]"""

In [None]:
heart_disease_ids = [line.split("\t")[1] for line in heart_disease_children.split("\n") if line and len(line.split("\t")) == 3]

rollup_map = {}
for cid in heart_disease_ids:
    children = snowmed_relationships[snowmed_relationships["destinationId"] == cid]["sourceId"].unique().tolist()
    rollup_map.update(dict.fromkeys(children, cid))
pprint(dict(itertools.islice(rollup_map.items(), 10)))

Finally, we will roll up all the SNOWMED CIDs to the level of "Heart disease"

In [None]:
def snowmed_cid_rollup(cids: List[str]):
    return [rollup_map[cid] for cid in cids if cid in rollup_map]

df['SNOMED_HEART_DISEASE_CID'] = df['SNOMED_CID'].map(snowmed_cid_rollup)
df['IS_HEART_DISEASE'] = df['SNOMED_HEART_DISEASE_CID'].map(lambda x: 1 if x else -1)
df

Below, we can check the fraction of notes that were mapped to at least one SNOWMED Disease CID

> Note, 100% of notes are labelled with one or more ICD9 code

In [None]:
print(f"{sum([True for labels in df['SNOMED_HEART_DISEASE_CID'] if labels]) / len(df) * 100:.2f}%")

## Save the Data

First, we partion the dataset into stratified train, validation and test splits. Then we will save it in a format that is suitable for use with [AllenNLP-multi-label](https://github.com/semantic-health/allennlp-multi-label) and [Deep Patient Cohorts](https://github.com/semantic-health/deep-patient-cohorts).

Set the filepath to save the data as stratified train/valid/test folds (70%/10%/20% splits).

In [None]:
output_dir = "../data/MIMIC-III-HEART-DISEASE"

In [None]:
import json

from sklearn.model_selection import train_test_split

def save_data(text: np.ndarray, labels: np.ndarray, output_dir: str, use_string_labels: bool = False) -> None:
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    text = df[["TEXT"]].values
    labels = df["IS_HEART_DISEASE"].values
    
    if use_string_labels:
        labels = np.where(labels==1, "heart_disease", "no_heart_disease")

    train_size = 0.7
    valid_size = 0.1
    test_size = 0.2

    # https://datascience.stackexchange.com/a/53161
    X_train, X_test, y_train, y_test = train_test_split(
        text, labels, test_size=1 - train_size, stratify=labels
    )
    X_valid, X_test, y_valid, y_test = train_test_split(
        X_test, y_test, test_size=test_size/(test_size + valid_size), stratify=y_test
    )

    X_train = X_train.ravel()
    X_valid = X_valid.ravel()
    X_test = X_test.ravel()
    
    y_train = y_train.tolist()
    y_valid = y_valid.tolist()
    y_test = y_test.tolist()

    with open(output_dir / "train.jsonl", "w") as f:
        for t, l in zip(X_train, y_train):
            f.write(f"{json.dumps({'text': t, 'label': l})}\n")
    with open(output_dir / "valid.jsonl", "w") as f:
        for t, l in zip(X_valid, y_valid):
            f.write(f"{json.dumps({'text': t, 'label': l})}\n")
    with open(output_dir / "test.jsonl", "w") as f:
        for t, l in zip(X_test, y_test):
            f.write(f"{json.dumps({'text': t, 'label': l})}\n")

In [None]:
text = df[["TEXT"]].values
labels = df["IS_HEART_DISEASE"].values

save_data(text, labels, output_dir, use_string_labels=False)