# Cui2Vec on MIMIC Notes

This project is to grab the CUIs of each concept identified in a given MIMIC note using [QuickUMLS ](https://github.com/Georgetown-IR-Lab/QuickUMLS) and use the [pre-trained CUI embeddings](https://figshare.com/s/00d69861786cd0156d81) to get an embedding that represents the note for all MIMIC notes.

## Imports and Inits

In [1]:
import pandas as pd
import numpy as np
import pickle
import psycopg2

from pathlib import Path

from QuickUMLS.quickumls import QuickUMLS

In [2]:
PATH = Path('data')
QUMLS_PATH = Path('/storage/UMLS-Storage/quickumls')

## Functions

In [7]:
def total_len(l):
    return sum([len(sl) for sl in l])

## Load and save pre-trained CUI Vectors

In [None]:
cui_vecs_df = pd.read_csv('/storage/pre-trained/cui2vec_pretrained.csv')

In [None]:
idx2cui = cui_vecs_df['cui'].to_dict()
cui_vecs = cui_vecs_df.loc[:, cui_vecs_df.columns != 'cui'].values

pickle.dump(idx2cui, open(PATH/'idx2cui.pkl', 'wb'))
np.save(PATH/'cui-vecs.npy', cui_vecs)

## Extract CUIs

In [3]:
idx2cui = pickle.load(open(PATH/'idx2cui.pkl', 'rb'))
cui2idx = {c: i for i, c in idx2cui.items()}

cui_vecs = np.load(PATH/'cui-vecs.npy')

In [4]:
qu = QuickUMLS(QUMLS_PATH, threshold=0.8, similarity_name='cosine')

In [118]:
test_note = "patient has no pneumothorax"

## Checkpoint

In [124]:
matches = qu.match(test_note, best_match=True, ignore_syntax=False)
l = len(matches)
tl = total_len(matches)
if l:
    al = tl/l
else:
    al = 0
l, tl, al

(2, 11, 5.5)

In [125]:
cuis = []
terms = {}
for cs in matches:
    for c in cs:
        print(c['ngram'])
        break

pneumothorax
patient


In [90]:
matches[0]

[{'start': 3,
  'end': 11,
  'ngram': 'diabetes',
  'term': 'diabetes',
  'cui': 'C0011849',
  'similarity': 1.0,
  'semtypes': {'T047'},
  'preferred': 1},
 {'start': 3,
  'end': 11,
  'ngram': 'diabetes',
  'term': 'Diabetes',
  'cui': 'C0011847',
  'similarity': 0.8333333333333334,
  'semtypes': {'T047'},
  'preferred': 1},
 {'start': 3,
  'end': 11,
  'ngram': 'diabetes',
  'term': 'Diabetes',
  'cui': 'C0011849',
  'similarity': 0.8333333333333334,
  'semtypes': {'T047'},
  'preferred': 1},
 {'start': 3,
  'end': 11,
  'ngram': 'diabetes',
  'term': 'prediabetes',
  'cui': 'C0271650',
  'similarity': 0.816496580927726,
  'semtypes': {'T047'},
  'preferred': 1},
 {'start': 3,
  'end': 11,
  'ngram': 'diabetes',
  'term': 'prediabetes',
  'cui': 'C0362046',
  'similarity': 0.816496580927726,
  'semtypes': {'T047'},
  'preferred': 1}]

In [None]:
test_note = '''
Joseph R. Smith

1234567-8

4/5/2006

REASON FOR VISIT:Here to get a new primary care physician.

HPI: Mr. Smith is a 56-year-old gentleman formally followed at Carolina Premier who presents to obtain a new primary care physician secondary to insurance changes. He has a past medical history significant for a myocardial infarction in 1994. His cholesterol has been fine. Catheterization showed a possible "kink" in one of his vessels and it was thought that he had a possible "eddy" of current which led to a clot. He has been on Coumadin since then as well as a calcium channel blocker with the thought that there may have been a superimposed spasm. He has had several unremarkable stress tests since then. He works out on a Nordic-Track three times a week without any chest pain or shortness of breath. He has a history of possible peptic ulcer disease in 1981. He was treated with H2 blockers and his symptoms resolved. He has never had any bleeding to his knowledge. He had a hernia repair bilaterally in 1989 and surgery for a right knee cyst in 1999, all of which went well. He also has about a year long history of right buttock pain. This happens only when he is sitting for some time and does not change position. It does not happen when he is walking or exercising. He wonders if it might be pyriformis syndrome. If he changes positions frequently or stretches his legs, this seems to help. He has no acute complaints today and is here to get plugged into the system. He does wonder if there is anything else that can be done about his buttock pain.

PAST MEDICAL HISTORY: As above. In addition, he had a flexible sigmoidoscopy in 2003 which was okay.

MEDICATIONS: Tylenol p.r.n., baby aspirin p.o. q.d., Coumadin 3.5 mg p.o. q.d., Adalat Time Released tablets 30 mg one p.o. q.d., multivitamin, glucosamine and chondroitin sulfate.

ALLERGIES: No Known Drug Allergies

FAMILY HISTORY: His mother had diabetes developed at age 55 and coronary artery disease in her mid sixties. His father had CAD as well but not until his 70s. A paternal aunt had breast cancer. There is no history of colon or prostate cancer.

SOCIAL HISTORY: He lives in Durham with his wife and mother. He works for a biotech company. He does not smoke. He drinks two beers per night and reports no trouble with alcohol in the past. No history of drug use.

REVIEW OF SYSTEMS: As per his personal health summary and is significant only for his buttock pain as listed above, but is otherwise essentially unremarkable.

PHYSICAL EXAM:

VITAL SIGNS: Weight 175.2 pounds which is 79.3 kg, blood pressure 142/96 by the nurse, 140/92 by me, pulse is 64.

GENERAL: A healthy-appearing middle-aged gentleman.

HEENT: Pupils equal, round, reactive to light. Conjunctivaepink. Sclerae anicteric. Tympanic membranes clear. Oropharynx clear.

NECK: No lymphadenopathy or thyromegaly or JVD.

LUNGS: Clear to auscultation and percussion.

HEART: Regular rate and rhythm without murmur, rub, or gallop.

ABDOMEN: Normal bowel sounds. Soft, nontender. No hepatosplenomegaly.

EXTREMITIES: No cyanosis, clubbing, or edema. 2+ peripheral pulses.

NEUROLOGIC: Motor and sensation grossly intact.

PSYCHIATRIC: Normal affect and behavior.

DERMATOLOGIC: He has a small whitish papule at the upper borderof his mustache on the left.

MUSCULOSKELETAL: Full range of motion of his legs and hips bilaterally with no tenderness to palpation over his buttocks.

ASSESSMENT/PLAN:

Status post myocardial infarction. No evidence for actual CAD. He will continue his aspirin and Coumadin. Unclear to exactly how long he should be on Coumadin or if this is really needed. At some point, may discuss this with cardiology. We will check an INR today and get him plugged into our Coumadin clinic. Check cholesterol.
High blood pressure. He will continue with diet and exercise changes. At next visit, may go up on his calcium channel blocker versus add another agent. Check creatinine and potassium today.
Buttock pain. Will try some physical therapy.
Will send him to dermatology for the papule above his mustache.
Possible h/o peptic ulcer disease. As he has not had any known GI bleed, do not feel a need to check H. pylori serology today.
Health maintenance. Tetanus shot today. Flex sig as above. Will discuss prostate cancer screening next visit.
Return to clinic in four months.
John Student, MS3

Seen with Joe Doctor, MD
'''

## Grab sample data from MIMIC

In [None]:
cats = pd.read_csv('cats.csv')
max_limit = 5

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select category, text from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select category, text from correctnotes where category=\'{category}\';
        """
    queries.append(q)

dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
# df.set_index('row_id', inplace=True)
df.shape

## pyContext

We need to encode domain knowledge in the modifiers and targets for using pyContext

In [119]:
import pyConTextNLP.pyConText as pyConText
import pyConTextNLP.itemData as itemData
import networkx as nx

In [120]:
modifiers = itemData.get_items(
    "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.yml")
targets = itemData.get_items(
    "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/utah_crit.yml")

In [121]:
markup = pyConText.ConTextMarkup()

In [122]:
markup.setRawText(test_note.lower())
markup.cleanText()
markup.markItems(modifiers, mode='modifier')
markup.markItems(targets, mode='target')

In [123]:
for node in markup.nodes(data=True):
    print(node)

(<id> 13099433910707665954603210464753359458 </id> <phrase> no </phrase> <category> ['definite_negated_existence'] </category> , {'category': 'modifier'})
(<id> 13100669077761263335626293814939097698 </id> <phrase> pneumothorax </phrase> <category> ['pneumothorax'] </category> , {'category': 'target'})
