# CS598 Deep Learning for Healthcare Final Project
## Reproduction of Deepr: A Convolutional Net for Medical Records
### Juan Alvarez Martinez, Shane Sepac

In [None]:
### TODO: Include summary and report of findings here [200 words]

## Load MIMIC-III Dataset. 
Several csv files are needed from the MIMIC-III dataset: ADMISSIONS, PATIENTS, DIAGNOSES_ICD, and PROCEDURES_ICD. These files can be loaded automatically out of S3, or you can place them in `<project_root>/mimic3`. 
- If loading out of S3, ensure you have all environment variables from .env.sample copied and instantiated in a .env file!

In [1]:
# install the required dependencies
%pip install boto3 python-dotenv pandas pyhealth


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Get MIMIC-3 Data
Attempt to load MIMIC-3 data out of S3 if the relevant CSV files are not already in the mimic3 folder at the project root.

In [2]:
import pandas as pd
import os
from utils import copy_file_from_s3

data_folder = "mimic3"
required_files = ["ADMISSIONS.csv", "PATIENTS.csv", "DIAGNOSES_ICD.csv", "PROCEDURES_ICD.csv"]

for i, fn in enumerate(required_files):
  if not os.path.exists(f"{data_folder}/{fn}"):
    print(f"Cannot find {fn} in {data_folder}, trying to download from S3...")
    copy_file_from_s3(fn, data_folder)
  else:
    print(f"Found {fn}...")

Found ADMISSIONS.csv...
Found PATIENTS.csv...
Found DIAGNOSES_ICD.csv...
Found PROCEDURES_ICD.csv...


In [1]:
from pyhealth.datasets import MIMIC3Dataset

mimic3_ds = MIMIC3Dataset("./mimic3/", ["DIAGNOSES_ICD", "PROCEDURES_ICD"]) #pyhealth does not support mapping ICD-9 to ICD-10 codes.

mimic3_ds.info()
mimic3_ds.stat()



dataset.patients: patient_id -> <Patient>

<Patient>
    - visits: visit_id -> <Visit> 
    - other patient-level info
    
    <Visit>
        - event_list_dict: table_name -> List[Event]
        - other visit-level info
    
        <Event>
            - code: str
            - other event-level info


Statistics of base dataset (dev=False):
	- Dataset: MIMIC3Dataset
	- Number of patients: 46520
	- Number of visits: 58976
	- Number of visits per patient: 1.2678
	- Number of events per visit in DIAGNOSES_ICD: 11.0384
	- Number of events per visit in PROCEDURES_ICD: 4.0711



'\nStatistics of base dataset (dev=False):\n\t- Dataset: MIMIC3Dataset\n\t- Number of patients: 46520\n\t- Number of visits: 58976\n\t- Number of visits per patient: 1.2678\n\t- Number of events per visit in DIAGNOSES_ICD: 11.0384\n\t- Number of events per visit in PROCEDURES_ICD: 4.0711\n'

## Sequencing EMR: Creating Sentences representing patient episodes
Per Deepr, an EMR must be translated into a sentence for use downstream the model. An EMR is a sequence of time-stamped visit episodes. Each episode involves a series of diagnoses and treatments, called a phrase. Each phrase is separated by a time interval equal to `(0–1], (1–3], (3–6], (6–12], and 12+` or `TRANSFER`, with the latter indicating a transfer between care providers (separate departments within the same hospital or between hospitals.) Infrequent words are coded with `RAREWORD`, which indicates the word has appeared <100 times. Per the Deepr paper, an example sentence looks as follows:

```
1910 Z83 911 1008 D12 K31 1-3m R94 RAREWORD H53 Y83 M62 Y92 E87 T81 RAREWORD RAREWORD 1893 D12 S14 738 1910 1916 Z83 0-1m T91 RAREWORD Y83 Y92 K91 M10 E86 6-12m K31 1008 1910 Z13 Z83.
```

Note: In the sentence above, diagnoses are in ICD-10 format (a character followed by digits) and procedures are in digits. 

The MIMIC-3 dataset provides ICD-9 codes, and these will be used, but the level-3 variant of them for consistency with the original paper. It can also be noted that the encounter and discharge datetimes for visits are between the years 2100-2200 in order to deidentify patients, however, the time interval between visits is indeed preserved.


In [2]:
'''
Find rare words (diagnoses and procedures with counts of less than 100)
'''
word_cnts = {}
for i, p in enumerate(mimic3_ds.patients.values()):
  for _, v in p.visits.items():
    words = []
    for e in v.get_event_list('DIAGNOSES_ICD'):
      words.append(e.code)

    for e in v.get_event_list('PROCEDURES_ICD'):
      words.append(e.code)
  
  for word in words:
    # If the word is already in the dictionary, increment the count
    if word in word_cnts:
        word_cnts[word] += 1
    # Otherwise, add the word to the dictionary with a count of 1
    else:
        word_cnts[word] = 1

In [7]:

from utils import timedelta_to_interval
import random
import json

'''
Translate EMRs into sentences outlined by the paper. A sentence consists of phrases, which are randomly shuffled diagnosis and procedure codes, separated by the 
time interval between visits, if the time interval exists. Sentences should have 100 words max.

While looping over each patient:
  1. Sort visits by encounter_time
  2. Find the time interval between each visit and generate its relevant string word
  3. Build arrays of diagnosis and procedure codes for each visit, replacing ICD-10 codes with less than 100 usages with RAREWORD
  4. Randomly shuffle each array of diagnosis and procedure codes, then append the time interval string if available. This represents a phrase.
    Concat each phrase to an array, which will be concatenated to form the final sentence. If the concatenation would form a sentence longer
    than 100 words, min(100, words(sentence)) is adhered to.
'''
sentences = []
for i, p in enumerate(mimic3_ds.patients.values()):

  # Sort patient visits by encounter_time
  sorted_visits = sorted(p.visits.items(), key=lambda v: v[1].encounter_time) # sort by encounter time in order to guage interval between visits

  # Generate timestamps in between visits
  timestamps = list(map(lambda visit: visit[1].encounter_time, sorted_visits))
  time_intervals = [
    t2 - t1
    for t1, t2 in zip(timestamps[:-1], timestamps[1:])
  ]
  # Convert timestamps to month intervals as specified in paper
  time_interval_strs = timedelta_to_interval(time_intervals)

  # event_diagnoses_ls = (visit, diagnoses_codes)
  event_diagnoses_ls = []
  # event_procedures_ls = (visit, procedure_codes)
  event_procedures_ls = []

  # Helper function to create arrays with RAREWORD using list comprehension
  def handle_event(event_list, word_cnts):
      return ["RAREWORD" if e.code in word_cnts and word_cnts[e.code] < 100 else e.code for e in event_list]

  # build arrays of diagnoses and procedures on a visit level, add to event_diagnoses_ls or event_procedures_ls
  for _, v in sorted_visits:
      visit_diagnoses = handle_event(v.get_event_list('DIAGNOSES_ICD'), word_cnts)
      event_diagnoses_ls.append(visit_diagnoses)

      visit_procedures = handle_event(v.get_event_list('PROCEDURES_ICD'), word_cnts)
      event_procedures_ls.append(visit_procedures)


  # Randomly shuffle diagnosis and procedure codes and append a time interval after, if available. Ensure the output sentence will not be more than 100 words.
  arrs = []
  word_cnt = 0
  for i, vd in enumerate(event_diagnoses_ls):
      arr = vd + event_procedures_ls[i]
      random.shuffle(arr)
      if i < len(time_interval_strs):
          arr.append(time_interval_strs[i])

      new_word_cnt = word_cnt + len(arr)

      if new_word_cnt > 100:
          # Calculate the number of elements needed to reach exactly 100 words
          elements_needed = 100 - word_cnt
          # Take a subset of arr to make new_word_cnt equal 100
          arr = arr[:elements_needed]
          arrs.append(arr)
          break

      arrs.append(arr)
      word_cnt = new_word_cnt

  # Combine all codes and time interval to create a phrase, representing a visit
  phrases = [" ".join(arr) for arr in arrs]
  # Combine all phrases to create a sentence, representing a sequence as outlined by the paper
  sentence = " ".join(phrases)
  sentences.append(sentence)

# output to json file
output_dir = "data"
output_filename = "sentences.json"

os.makedirs(output_dir, exist_ok=True)

with open(os.path.join(output_dir, output_filename), "w") as json_file:
  json.dump(sentences, json_file)
  

### Test that output sentences satisfy the following conditions:
- There is a sentence for each patient
- Each sentence is capped to max 100 words
- Multi visit patients have visits separated by a timestamp
- Words should not exist in their ICD-10 form if used less than 100 times (should be replaced with RAREWORD)


In [8]:
import re

# There should be one sentence per patient
num_patients = len(mimic3_ds.patients)
num_sentences = len(sentences)
assert(num_patients == num_sentences)

# There should be max 100 words per sentence
word_lengths = map(lambda s: len(s.split()), sentences)
assert(max(list(word_lengths)) <= 100)

# There should be no word in any of the sentences that is present less than 100 times
rarewords = [word for word, count in word_cnts.items() if count < 100]
for sentence in sentences:
  words_of_sentence = sentence.split()
  rareword_violations = list(filter(lambda w: w in word_cnts and word_cnts[w] < 100, words_of_sentence))
  assert(len(rareword_violations) == 0)


# Patients with multiple visits should have timestamps separating their visits i.e. 1-3m or 12+m #TODO: Add TRANSFER to regex
pattern = re.compile(r"[-+]")
for i, p in enumerate(mimic3_ds.patients.values()):
    if len(p.visits) > 1:
        if not pattern.search(sentences[i]):
            print(f"Failed assertion for sentences[{i}]: '{sentences[i]}'")
            assert(False)
