<a href="https://colab.research.google.com/github/thotran2015/6.871/blob/master/Problem_1_Skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Unified Medical Language System consists of a set of medical vocabularies that are standard for use across many settings. MetaMap  is a useful (although flawed) tool to extract clinical concepts from raw clinical text and map them to UMLS vocabularies. Here we will explore a subset of the MIMIC-III discharge summaries and use the extracted clinical concepts to build relationships between diseases and symptoms.



# 0. Load Data

In [0]:
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt

import numpy as np
import pickle

from google.colab import auth

We have pulled concepts from about ~2k discharge summaries.


In [0]:
auth.authenticate_user()

In [0]:
!gsutil cp gs://hst-956/pset2/adult_dc_concepts.csv ./
!gsutil cp gs://hst-956/pset2/adult_dc_summaries.csv ./
!gsutil cp gs://hst-956/pset2/cooccurrence_info.p ./
!gsutil cp gs://hst-956/pset2/male_cooccurrence_info.p ./
!gsutil cp gs://hst-956/pset2/female_cooccurrence_info.p ./
!gsutil cp gs://hst-956/pset2/disease_symptom_names.p ./


concepts = pd.read_csv('adult_dc_concepts.csv')
discharge = pd.read_csv('adult_dc_summaries.csv')

Because of how we extract clinical concepts, the index from the discharge summaries corresponds to the index column of the clinical concepts.



In [0]:
discharge['index'] = discharge.index
df = discharge.merge(concepts, on='index', how='right')

# Part 1: What are clinical concepts? How do they work?

We can do some basic statistics for our data.


In [0]:
print(len(df['index'].value_counts()), 'discharge summaries')
print(len(df), 'extracted concepts')

In [0]:
# TODO: 1.1 How many extracted concepts per discharge summary on average?
# Plot the histogram of number of concepts per discharge summary.
concept_per_discharge = len(df)/len(df['index'].value_counts())


Let's take a look at the table we have now.



In [0]:
df.head(1)

In particular, we see a few columns of interest:

* trigger: the source word(s) from which the clinical concept was extracted
* semtypes: group of clinical concept extracted (more info)
* preferred_name: explanation of concept in human-readable form
* score: UMLS assigned score of extracted concept, larger is more confident
* dc_chart: raw discharge summary from which concept was extracted
* cui: the concept unique identifier for each extracted concept


Let's take a look at the patient with icustay_id = 232593

In [0]:
print(len(df[(df['icustay_id'] == 232593)]), 'extracted concepts for icustay_id = 232593')

In [0]:
df[df['icustay_id'] == 232593][['trigger', 'semtypes', 'preferred_name','cui']].head()


If we look above, we can examine the trigger or source word and examine what clinical concepts were extracted. For example, we see in row 4, the clinical concept Saccharomyces cerevisiae or cui = C0036025 is extracted. By looking at the [Semantic Type guide](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt), we see that semtypes=[fngs] meaning this CUI is a fungus.



### Difficulty of Concept Extraction

Clinical concept extraction is incredibly difficult due to the ambiguity and diversity of language. We will explore that below.

In [0]:
# TODO: 1.2 Give a list of 3 additional words that indicate CUI C0015967 (Fever)
# in our dataset.
df[df['cui'] == 'C0015967']['trigger'].apply(lambda x : x.split('-')[0].strip('["')).unique()

Hyperthermia, Increased temperature, temperature elevation, high body temperature, feverish, hyperthermic, Febrile, 

In [0]:
# TODO: 1.3 Give a list of at least 3 CUIs that include Cold in 
# their preferred_name. Why may these be hard to disambiguate?
df[df['preferred_name'].str.contains(pat = 'cold', case = False)]['preferred_name'].unique()

In [0]:
# TODO: 1.4 For icustay_id = 232593, explain what triggered one concept to
# include "Fruit" in the preferred_name column. 


# 2. Relating Symptoms and Diseases

In the first lecture, we discussed how synthesis of a lot of medical knowledge is done manually. The INTERNIST-1/QMR model was a probabilistic model relating 500+ diseases and 4000+ symptoms, but developing it took over 15 person-years of work. Here, you will do a short analysis of symptom-disease co-occurrence based on mentions in notes. 

#### TODO:
  - maybe remove wholesale? Maybe keep, given it is short?
  - Use p(disease | symptom) directly? Or p(symptom | disease)?

In [0]:
# TODO: 1.5 Which semtype corresponds to a disease and which semtype corresponds to a symptom?
# Use the MetaMap documentation to guide you: https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt

In [0]:
# TODO: 1.6 What are the 5 most frequent diseases and the 5 most frequent 
# symptoms in the dataset? 

Some of the diseases/symptoms we've extracted are too general for our use case, or just incorrect, as you learned with the "Fruit" example in the previous case. Therefore, we will need to do some data cleaning and ignore some diseases and symptoms. We provide a list of diseases/symptoms that we ignore below. 

In [0]:
## Preferred names of diseases and symptoms to ignore

ignore_diseases = ['Disease', # condition of any kind
                   'Communicable Diseases', # too broad
                    'Infantile Neuroaxonal Dystrophy', # plan
                   'SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME', # soft??
                   'SYNOVITIS, GRANULOMATOUS, WITH UVEITIS AND CRANIAL NEUROPATHIES (disorder)', # EOS -> eosinphil
                   'Pneumocystis jiroveci pneumonia', # PCP
                   'Infantile Neuroaxonal Dystrophy', # plan
                   'Ventricular Fibrillation, Paroxysmal Familial, 1', # ivf-> intravenal fluid
                   'Nuclear non-senile cataract', # NS??
                   'Macrophage Activation Syndrome', # mass
                   'MYOTONIC DYSTROPHY 1', # DM
                   'MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME', # MEDS
                   'Illness (finding)', # illness
                   'Oculocutaneous albinism type 1A', # ATN
                   'POLYARTERITIS NODOSA, CHILDHOOD-ONSET', # PAN
                  ]

ignore_symptoms = ['Discharge, body substance', # discharge date
                   'Mass of body region', # "no masses"
                   'Clubbing', # No clubbing
                   'Symptoms', # overly vague
                   'Signs and Symptoms', # too vague
]

## Disease-Symptom Co-occurrence 

We now build a co-occurrence matrix of diseases and symptoms using the function provided below. We provide to you the output of this function as a pickle file since it can take 20-30 minutes to run in Colab. For each of the 2k+ ICU stays, we added a 1 if both the disease and the symptom are mentioned in a note during the stay, and a 0 otherwise. Therefore, each value in the cooccurrence matrix takes on a value between 0 and ~2k. 

In [0]:
# Function for creating a cooccurrence matrix
def create_cooccurrence_matrix(df, diseases, symptoms):
  cooccur = np.zeros((len(diseases), len(symptoms)))
  disease_count = np.zeros((len(diseases)))
  symptom_count = np.zeros((len(symptoms)))
  stay_ids = df['icustay_id'].unique()
  for i, stay in enumerate(stay_ids):
      sub_df = df[df['icustay_id'] == stay]
      # Update disease counts
      for d_idx, d in enumerate(diseases):   
          d_in_uid = (d in sub_df['preferred_name'].values)
          if d_in_uid:
              disease_count[d_idx] += 1
      # Update symptom counts
      for s_idx, s in enumerate(symptoms):
          s_in_uid = (s in sub_df['preferred_name'].values)
          if s_in_uid:
              symptom_count[s_idx] += 1
      # Update combined counts
      for d_idx, d in enumerate(diseases):   
          d_in_uid = (d in sub_df['preferred_name'].values)
          for s_idx, s in enumerate(symptoms):
              s_in_uid = (s in sub_df['preferred_name'].values)
              if d_in_uid and s_in_uid:
                  cooccur[d_idx][s_idx] += 1
  return cooccur, disease_count, symptom_count, len(stay_ids)

In [0]:
# Cooccur is a matrix of (num_diseases, num_symptoms), representing the number of ICU stays where a disease and
# a symptom were mentioned
# Disease_counts is a list of length num_diseases, equaling the number of ICU stays where the disease was mentioned
# Symptom_counts is a list of length nym_symptoms, equaling the number of ICU stays where the symptom was mentioned
# N is the number of patients in the cohort
cooccur, disease_counts, symptom_counts, N = pickle.load(open('./cooccurrence_info.p', 'rb')) 

In [0]:
disease_names, symptom_names = pickle.load(open('disease_symptom_names.p', 'rb'))

To quantify how much more often symptoms and disease co-occur together than would be expected by chance, we will use [lift](https://en.wikipedia.org/wiki/Lift_(data_mining)). In other words, for each disease-symptom pair, we calculate $\frac{P(disease\&symptom)}{P(disease)P(symptom)}.$ This is obviously purely correlational, but it can still be somewhat informative as to how diseases present. For a more sophisticated approach of quantifying disease-symptom relationships, see the Noisy-OR model in [past work on creating a health knowledge graph](https://www.nature.com/articles/s41598-017-05778-z).






In [0]:
# TODO 1.7 Fill in the function below to calculate lift. You should return a 
# matrix of size (disease_ct, symptom_ct) containing the lift for each pair.
# What disease-symptom pair has the highest lift?

def calculate_lift(cooccur, disease_ct, symptom_ct, N):
  ## TODO: fill in
  return None

One drawback of hand-crafted disease-symptom models is that they are brittle. They don't necessarily transfer well to new populations or work well on subpopulations, in which disease prevalence and presentation may be different. However, this could be a chance to leverage EHR data computationally to allow for more personalized disease-symptom models.

For example, it has been found that men and women often display [different symptoms for heart attacks](https://www.heart.org/en/health-topics/heart-attack/warning-signs-of-a-heart-attack/heart-attack-symptoms-in-women). However, the "prototypical" heart attack symptoms are those men suffer from, and as a result, heart attacks in women are often incorrectly diagnosed.  Let us see if we can observe these differences in disease presentation in our data! 

First, we load occurrences matrices we made separately for men and women.  

In [0]:
male_cooccur, male_disease_ct, male_symptom_ct, male_N = pickle.load(open('./male_cooccurrence_info.p', 'rb')) 
female_cooccur, female_disease_ct, female_symptom_ct, female_N = pickle.load(open('./female_cooccurrence_info.p', 'rb')) 

In [0]:
#TODO: 1.8 Rerun your lift calculation on the male and female cohorts separately.
# Then, look at the lifts between each of the symptoms and heart attack
# (denoted Myocardial Infarction in the disease names list). Separately, for both 
# men and women, list the 5 symptoms with the highest lift. Discuss your findings 
# in 1-2 sentences, as they relate to the information linked above from the 
# American Heart Association.