# QuickUMLS and cui2vec

Using [QuickUMLS ](https://github.com/Georgetown-IR-Lab/QuickUMLS) and use the [pre-trained CUI embeddings](https://figshare.com/s/00d69861786cd0156d81) to get an embedding that represents a clinical note.

## Imports and Inits

In [1]:
import pandas as pd
import numpy as np
import pickle

from collections import OrderedDict
from pathlib import Path

from QuickUMLS.quickumls import QuickUMLS

In [2]:
PATH = Path('data')
QUMLS_PATH = Path('/storage/UMLS-Storage/quickumls')

## Functions

In [3]:
def total_len(l):
    return sum([len(sl) for sl in l])

`QuickUMLS` returns a **list of list of dicts**:
1. Outer list holds all the concepts that were found in the note
2. Interlist contains all the *variations* of a single concept that QuickUMLS found
3. Each dictionary contains information about the concept itself with the following keys:
    * start: starting position of the concept in the note
    * end: ending position of the concept in the note
    * ngram: the group of words that were used to identify the concept. This value is <= $window_size$ param of the matcher. **All the dictionary within an innerlist have the same ngram**.
    * term: all the variations within the ngram words that are individual medical concepts with their own CUIs. For example with ngram "type 2 diabetes" we get terms "type 2 diabetes" and "type diabetes" each with its own CUI. **The innerlist contains dicts with different term keys**.
    * cui: The CUI of the concept
    * similarity: Similarity measure based on metric of the matcher (jaccard, dice, cosine). This value is >= $threshold$ param of the matcher
    * semtypes
    * preferred

The pre-trained CUI vectors are of size 500. The basic idea is to represent each note with a 500-dimension vector that captures the unique medical concepts pertaining to that note by combining these pre-trained vectors. Not all the CUIs returned by the matcher are found in the pre-trained vector list. For those that are not found, I just discard them. This is how the combination is done:
1. First the matcher is defined and the note is run through the matcher
2. If there are no matches found, then just return a 500-dimension zero vector
3. Initialize two dictionaries: 
    1. to hold the counts of the CUIs found in the matcher (for final weighted averaging) $ngram\_counts$
    2. to hold the average CUI of a given ngram concept $ngram\_vecs$
4. Iterate over the matches and grab the ngram (the outer list)
5. If its already present in $ngram\_vecs$, we've seen it before, so just update the count and continue
6. If not, initilize a variable to hold the number of valid cuis $n\_valid\_cuis$ (i.e., CUIs for which we have a pre-trained vector available) and 500-dimension zero vector to hold the average $cui\_avg$
7. Iterate of all the matched CUIs (inner list) and check if the current CUI is valid
8. If it is then increment $n\_valid\_cuis$ and grab the vector from the pre-trained vector for the current CUI and add it to $cui\_avg$
9. If we have valid CUIS, then set the count of the CUI to 1 and compute the average from the sum created in the previous step
10. Convert both the counts and average vectors into numpy arrays
11. Use numpy's `average` method to compute the weighted average of the concept whereby repeated concepts receive a higher weight (count)
12. Return the resulting 500-dimension vector that would represent the note

Downfalls:
1. Negations are not taken into account. Thus if there are concepts that repeated many times but are negated it will still receive a higher weight which could lead to problems
2. Abbr of concepts

In [4]:
def cui_repr(matches):    
    if not matches:
        return np.zeros(cui_emb_sz)
    
    ngram_counts = OrderedDict()
    ngram_avgvecs = OrderedDict()

    for cs in matches:
        ngram = cs[0]['ngram']
        if ngram in ngram_avgvecs:
            ngram_counts[ngram] += 1
            continue
        n_valid_cuis = 0
        cui_avgvec = np.zeros(cui_emb_sz)
        for c in cs:
            if c['cui'] in cui2idx:
                n_valid_cuis += 1
                cui_avgvec += pretrained_cuis[cui2idx[c['cui']]]
        if n_valid_cuis != 0:
            ngram_counts[ngram] = 1
            ngram_avgvecs[ngram] = cui_avgvec / n_valid_cuis
    
    weights = np.array(list(ngram_counts.values()))
    vectors = np.array(list(ngram_avgvecs.values()))
    
    return ngram_counts, np.average(vectors, axis=0, weights=weights)

## Load and save pre-trained CUI Vectors

In [None]:
cui_vecs_df = pd.read_csv('/storage/pre-trained/cui2vec_pretrained.csv')
cui_vecs_df.head()

In [None]:
idx2cui = cui_vecs_df['cui'].to_dict()
pretrained_cuis = cui_vecs_df.loc[:, cui_vecs_df.columns != 'cui'].values
cui2idx = {c: i for i, c in idx2cui.items()}

In [None]:
cui_vecs_df.sample()

In [None]:
cui2idx['C2895975']

In [None]:
pickle.dump(idx2cui, open(PATH/'idx2cui.pkl', 'wb'))
np.save(PATH/'pretrained_cuis.npy', pretrained_cuis)

## Extract CUIs

In [5]:
idx2cui = pickle.load(open(PATH/'idx2cui.pkl', 'rb'))
cui2idx = {c: i for i, c in idx2cui.items()}

pretrained_cuis = np.load(PATH/'pretrained_cuis.npy')
cui_emb_sz = pretrained_cuis.shape[1]

In [6]:
if 'qu' in globals():
    del qu
qu = QuickUMLS(QUMLS_PATH, threshold=0.85, similarity_name='jaccard', window=5, overlapping_criteria='score')

In [7]:
test_note = '''
Joseph R. Smith

1234567-8

4/5/2006

REASON FOR VISIT:Here to get a new primary care physician.

HPI: Mr. Smith is a 56-year-old gentleman formally followed at Carolina Premier who presents to obtain a new primary care physician secondary to insurance changes. He has a past medical history significant for a myocardial infarction in 1994. His cholesterol has been fine. Catheterization showed a possible "kink" in one of his vessels and it was thought that he had a possible "eddy" of current which led to a clot. He has been on Coumadin since then as well as a calcium channel blocker with the thought that there may have been a superimposed spasm. He has had several unremarkable stress tests since then. He works out on a Nordic-Track three times a week without any chest pain or shortness of breath. He has a history of possible peptic ulcer disease in 1981. He was treated with H2 blockers and his symptoms resolved. He has never had any bleeding to his knowledge. He had a hernia repair bilaterally in 1989 and surgery for a right knee cyst in 1999, all of which went well. He also has about a year long history of right buttock pain. This happens only when he is sitting for some time and does not change position. It does not happen when he is walking or exercising. He wonders if it might be pyriformis syndrome. If he changes positions frequently or stretches his legs, this seems to help. He has no acute complaints today and is here to get plugged into the system. He does wonder if there is anything else that can be done about his buttock pain.

PAST MEDICAL HISTORY: As above. In addition, he had a flexible sigmoidoscopy in 2003 which was okay.

MEDICATIONS: Tylenol p.r.n., baby aspirin p.o. q.d., Coumadin 3.5 mg p.o. q.d., Adalat Time Released tablets 30 mg one p.o. q.d., multivitamin, glucosamine and chondroitin sulfate.

ALLERGIES: No Known Drug Allergies

FAMILY HISTORY: His mother had diabetes developed at age 55 and coronary artery disease in her mid sixties. His father had CAD as well but not until his 70s. A paternal aunt had breast cancer. There is no history of colon or prostate cancer.

SOCIAL HISTORY: He lives in Durham with his wife and mother. He works for a biotech company. He does not smoke. He drinks two beers per night and reports no trouble with alcohol in the past. No history of drug use.

REVIEW OF SYSTEMS: As per his personal health summary and is significant only for his buttock pain as listed above, but is otherwise essentially unremarkable.

PHYSICAL EXAM:

VITAL SIGNS: Weight 175.2 pounds which is 79.3 kg, blood pressure 142/96 by the nurse, 140/92 by me, pulse is 64.

GENERAL: A healthy-appearing middle-aged gentleman.

HEENT: Pupils equal, round, reactive to light. Conjunctivaepink. Sclerae anicteric. Tympanic membranes clear. Oropharynx clear.

NECK: No lymphadenopathy or thyromegaly or JVD.

LUNGS: Clear to auscultation and percussion.

HEART: Regular rate and rhythm without murmur, rub, or gallop.

ABDOMEN: Normal bowel sounds. Soft, nontender. No hepatosplenomegaly.

EXTREMITIES: No cyanosis, clubbing, or edema. 2+ peripheral pulses.

NEUROLOGIC: Motor and sensation grossly intact.

PSYCHIATRIC: Normal affect and behavior.

DERMATOLOGIC: He has a small whitish papule at the upper borderof his mustache on the left.

MUSCULOSKELETAL: Full range of motion of his legs and hips bilaterally with no tenderness to palpation over his buttocks.

ASSESSMENT/PLAN:

Status post myocardial infarction. No evidence for actual CAD. He will continue his aspirin and Coumadin. Unclear to exactly how long he should be on Coumadin or if this is really needed. At some point, may discuss this with cardiology. We will check an INR today and get him plugged into our Coumadin clinic. Check cholesterol.
High blood pressure. He will continue with diet and exercise changes. At next visit, may go up on his calcium channel blocker versus add another agent. Check creatinine and potassium today.
Buttock pain. Will try some physical therapy.
Will send him to dermatology for the papule above his mustache.
Possible h/o peptic ulcer disease. As he has not had any known GI bleed, do not feel a need to check H. pylori serology today.
Health maintenance. Tetanus shot today. Flex sig as above. Will discuss prostate cancer screening next visit.
Return to clinic in four months.
John Student, MS3

Seen with Joe Doctor, MD
'''

In [8]:
%%time
matches = qu.match(test_note, best_match=True, ignore_syntax=False)
l = len(matches)
tl = total_len(matches)
if l:
    al = tl/l
else:
    al = 0
print(l, tl, al)

146 200 1.36986301369863
CPU times: user 1.14 s, sys: 92.4 ms, total: 1.24 s
Wall time: 341 ms


In [9]:
counts, note_vector = cui_repr(matches)

In [11]:
counts

OrderedDict([('calcium channel blocker', 2),
             ('coronary artery disease', 1),
             ('flexible sigmoidoscopy', 1),
             ('myocardial infarction', 2),
             ('past medical history', 1),
             ('peptic ulcer disease', 2),
             ('PAST MEDICAL HISTORY', 1),
             ('shortness of breath', 1),
             ('chondroitin sulfate', 1),
             ('Normal bowel sounds', 1),
             ('High blood pressure', 1),
             ('hepatosplenomegaly', 1),
             ('REVIEW OF SYSTEMS', 1),
             ('REASON FOR VISIT', 1),
             ('physical therapy', 1),
             ('Catheterization', 1),
             ('change position', 1),
             ('prostate cancer', 1),
             ('lymphadenopathy', 1),
             ('range of motion', 1),
             ('Drug Allergies', 1),
             ('FAMILY HISTORY', 1),
             ('blood pressure', 1),
             ('hernia repair', 1),
             ('breast cancer', 1),
             ('

## Grab sample data from MIMIC

In [11]:
cats = pd.read_csv('cats.csv')
max_limit = 5

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select category, text from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select category, text from correctnotes where category=\'{category}\';
        """
    queries.append(q)

dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
# df.set_index('row_id', inplace=True)
df.shape

(70, 2)

In [55]:
test_note = df['text'].iloc[3]
print(test_note)

TITLE:  Case Management Initial Assessment
   The patient is a 24 yo female with h/o SLE, ESRD on HD, malignant HTN,
   h/o SVC syndrome, PRES, prior ICH, and recent SBO, initially admitted
   with dyspnea, and hypertensive urgency and called out to medicine, now
   readmitted to the ICU with stridor and dyspnea.
   The patient is well known to this NCM from multiple previous [**Hospital1 54**]
   admissions.  She is known to several home care agencies, most recently
   receiving services from the VNA of [**Location (un) 223**].   The patient also is
   scheduled for hemodialysis T-Th-Sat at [**Location (un) **], though she does not
   always keep her scheduled appointments.  Transportation has long been
   arranged for these dialysis appointments and is not a barrier to the
   patient attending them.
   This NCM had a long conversation with [**First Name8 (NamePattern2) **] [**Doctor Last Name **], the liaison nurse from
   the VNA of [**Location (un) 223**] (VNAB) who knows the patie

In [56]:
%%time
matches = qu.match(test_note, best_match=True, ignore_syntax=False)
l = len(matches)
tl = total_len(matches)
if l:
    al = tl/l
else:
    al = 0
print(l, tl, al)

49 89 1.816326530612245
CPU times: user 923 ms, sys: 56.1 ms, total: 980 ms
Wall time: 344 ms


In [57]:
counts, note_vector = cui_repr(matches)
OrderedDict(sorted(counts.items(), key=lambda x: x[1], reverse=True))

OrderedDict([('Last Name', 2),
             ('dyspnea', 2),
             ('assist', 2),
             ('hypertensive urgency', 1),
             ('discharge planning', 1),
             ('Case Management', 1),
             ('stabilization', 1),
             ('SVC syndrome', 1),
             ('hemodialysis', 1),
             ('Assessment', 1),
             ('admissions', 1),
             ('Identifier', 1),
             ('condition', 1),
             ('discharge', 1),
             ('admitted', 1),
             ('dialysis', 1),
             ('stridor', 1),
             ('likely', 1),
             ('ESRD', 1),
             ('able', 1),
             ('HTN', 1),
             ('discharged', 1)])

## pyContext

We need to encode domain knowledge in the modifiers and targets for using pyContext

In [None]:
import networkx as nx
from pyConTextNLP.pyConTextNLP import pyConText
from pyConTextNLP.pyConTextNLP import itemData

In [3]:
modifiers = itemData.get_items(
    "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.yml")
targets = itemData.get_items(
    "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/utah_crit.yml")

In [4]:
markup = pyConText.ConTextMarkup()

In [7]:
markup.setRawText(test_note.lower())
markup.cleanText()
markup.markItems(modifiers, mode='modifier')
markup.markItems(targets, mode='target')

In [None]:
for node in markup.nodes(data=True):
    print(node)