In [212]:
# Import libraries
import os
import re
import random
import pickle
import subprocess
import numpy as np
import pandas as pd
import datetime as dt

from tqdm import tqdm
from datetime import datetime
from collections import Counter

# 1. Setup concept extractors

Some options were [MetaMap](https://metamap.nlm.nih.gov/) and [spaCy](https://spacy.io/). 

[MetaMap](https://metamap.nlm.nih.gov/) is specific to recognizing UMLS concepts. There is a [Python wrapper](https://github.com/AnthonyMRios/pymetamap), but known to be slow and bad.

[spaCy](https://spacy.io/) is a popular NLP Python package with an extensive library for named entity recognition. It has a wide variety of [extensions](https://spacy.io/universe) and models to choose from. We're going with the following.

* [scispaCy](https://spacy.io/universe/project/scispacy) contains spaCy models for processing biomedical, scientific or clinical text. It seems easy to use and has a wide variety of concepts it can recognize, including UMLS, RxNorm, etc.

* [negspaCy](https://spacy.io/universe/project/negspacy) identifies negations using some extension of regEx. Probably useful for things like, "this pt is diabetic" v. "this pt is not diabetic." [todo: negation identification of medspacy might be better, https://github.com/medspacy/medspacy]

* [Med7](https://github.com/kormilitzin/med7) is a model trained for recognizing entities in prescription text, e.g. identifies drug name, dosage, duration, etc., which could be useful stuff to check for conflicts. 

We're going with spaCy for this.. and coming up with a coherent way to integrate entities picked up by these three extensions/models.

## i) Installations

In [2]:
import sys; sys.executable

'/opt/conda/envs/opennotes/bin/python'

In [3]:
import spacy
import scispacy

from pprint import pprint
from collections import OrderedDict

from spacy import displacy
# from scispacy.abbreviation import AbbreviationDetector # UMLS already contains abbrev. detect
from scispacy.umls_linking import UmlsEntityLinker

# should be 2.3.5 and >=0.3.0
spacy.__version__, scispacy.__version__

('2.3.5', '0.3.0')

## ii) Setting up the model

The model is used to form word/sentence embeddings for the NER task. Thus, it's important to choose model that has been tuned for our specific use case (e.g. clinical text, prescription information) so the embeddings are useful for naming the entity.

[Note to self:] one potential idea to look into if we have time remaining, something about using custom model for spacy pipeline (could we do smth with the romanov models since they've been trained specifically for conflict detection?) -- https://spacy.io/usage/v3

### a) scispaCy

For scispaCy, we set up one of their models that has been trained on biomedical data. Other models can be found [here](https://allenai.github.io/scispacy/). 

We load two models since we will be linking different entity linkers (knowledge bases that link text to named entites) later.

In [4]:
## uncomment to install model if not already installed
# !/opt/conda/envs/opennotes/bin/python -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz

In [5]:
# for umls (general biomedical concepts)
umls_nlp   = spacy.load("en_core_sci_sm")

# for rxnorm (prescriptions)
rxnorm_nlp = spacy.load("en_core_sci_sm")

### b) Med7

For Med7, we set up their model that has been trained specifically for NER of medication-related concepts: dosage, drug names, duration, form, frequency, route of administration, and strength. The model is trained on MIMIC-III, so it should work well for us.

In [6]:
# # installs Med7 model
# !pip install https://www.dropbox.com/s/xbgsy6tyctvrqz3/en_core_med7_lg.tar.gz?dl=1

In [7]:
med7_nlp = spacy.load("en_core_med7_lg")

## iii) Adding an entity linker

The EntityLinker is a spaCy component that links to a knowledge base. The linker compares words with the concepts in the specified knowledge base (e.g. scispaCy's UMLS does some form of character overlap-based nearest neighbor search, has option to resolve abbreviations first).

[Note: Entities generally get resolved to a list of different entities. This [blog post](http://sujitpal.blogspot.com/2020/08/disambiguating-scispacy-umls-entities.html) describes one potential way to disambiguate this by figuring out "most likely" set of entities. Gonna start off with just resolving to the 1st entity tho... hopefully that's sufficient.]

### a) scispaCy

#### UMLS Linker

UMLS linker maps entities to the UMLS concept. Main parts we'll be interested in are: semantic type and concept (mainly the common name, maybe the CUI might become important later).

* _Semantic type_ is the broader category that the entity falls under, e.g. disease, pharmacologic substance, etc. See [this](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt) for a full list.

* _Concepts_ refer to the more fundamental entity itself, e.g. pneumothorax, ventillator, etc. Many concepts can fall under a semantic type.

More info on `UmlsEntityLinker` ([source code](https://github.com/allenai/scispacy/blob/4ade4ec897fa48c2ecf3187caa08a949920d126d/scispacy/linking.py#L9))

See source code for `.jsonl` file with the knowledge base.

In [8]:
from scispacy.umls_linking import UmlsEntityLinker

# abbreviation_pipe = AbbreviationDetector(nlp) # automatically included with UMLS linker
# nlp.add_pipe(abbreviation_pipe)
umls_linker = UmlsEntityLinker(k=10,                          # number of nearest neighbors to look up from
                               threshold=0.7,                 # confidence threshold to be added as candidate
                               max_entities_per_mention=1,    # number of entities returned per concept (todo: tune)
                               filter_for_definitions=False,  # no definition is OK
                               resolve_abbreviations=True)    # resolve abbreviations before linking
umls_nlp.add_pipe(umls_linker)



#### RxNorm Linker

RxNorm linker maps entities to RxNorm, an ontology for clinical drug names. It contains about 100k concepts for normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

More info on `RxNorm` ([NIH page](https://www.nlm.nih.gov/research/umls/rxnorm/index.html), [source code](https://github.com/allenai/scispacy/blob/2290a80cfe0948e48d8ecfbd60064019d57a6874/scispacy/linking_utils.py#L120))

See source code for `.jsonl` file with the knowledge base.

In [9]:
from scispacy.linking import EntityLinker

# rxnorm_linker = EntityLinker(resolve_abbreviations=True, name="rxnorm")
rxnorm_linker = EntityLinker(k=10,                          # number of nearest neighbors to look up from
                             threshold=0.7,                 # confidence threshold to be added as candidate
                             max_entities_per_mention=1,    # number of entities returned per concept (todo: tune)
                             filter_for_definitions=False,  # no definition is OK
                             resolve_abbreviations=True,    # resolve abbreviations before linking
                             name="rxnorm")                 # RxNorm ontology

rxnorm_nlp.add_pipe(rxnorm_linker)



### b) Med7 

No need for entity linker

### c) Negspacy [TODO]

# 2. Setup data structures

## Categorizing type of conflict

The first larger task is to categorize by the type of conflict to check for since our method will likely be different (at least for the rule based). We wrote up a short list [here](https://docs.google.com/document/d/1fEBk0JHeyQWshYWW5w_VTkaYyRfm9MBxJ9DAGoVa8Yw/edit?usp=sharing). 

To do this, we're using the semantic type that is identified by the UMLS linker. Here's a table of the semantic types we're filtering for, and which conflict they'll be used for.

Here's a [full list](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt) of semantic types. You can look up definitions of semantic types [here](http://linkedlifedata.com/resource/umls-semnetwork/T033).

| Conflict | Semantic Type |
| --- | ----------- |
| Diagnoses-related errors | Disease or Syndrome (T047), Diagnostic Procedure(T060) |
| Inaccurate description of medical history (symptoms) | Sign or Symptom (T184) |
| Inaccurate description of medical history (operations) | Therapeutic or Preventive Procedure (T061) |
| Inaccurate description of medical history (other) | [all of the above and below] |
| Medication or allergies | Clinical Drug (T200), Pharmacologic Substance (T121) |
| Test procedures or results | Laboratory Procedure (T059), Laboratory or Test Result (T034) | 


For clarity, the concepts we'll keep from the UMLS linker are anything falling into these semantic types (which we will then categorize by type of conflict using the table above):

* T047 - Disease or Syndrome
* T121 - Pharmacologic Substance
* T023 - Body Part, Organ, or Organ Component
* T061 - Therapeutic or Preventive Procedure 
* T060 - Diagnostic Procedure
* T059 - Laboratory Procedure
* T034 - Laboratory or Test Result 
* T184 - Sign or Symptom 
* T200 - Clinical Drug

We'll store this info into a dictionary now.

<!-- Some useful def's 
Finding - 
That which is discovered by direct observation or measurement of an organism attribute or condition, including the clinical history of the patient. The history of the presence of a disease is a 'Finding' and is distinguished from the disease itself.  -->

In [10]:
SEMANTIC_TYPES = ['T047', 'T121', 'T023', 'T061', 'T060', 'T059', 'T034', 'T184', 'T200']
SEMANTIC_NAMES = ['Disease or Syndrome', 'Pharmacologic Substance', 'Body Part, Organ, or Organ Component', \
                  'Therapeutic or Preventive Procedure', 'Diagnostic Procedure', 'Laboratory Procedure', \
                  'Laboratory or Test Result', 'Sign or Symptom', 'Clinical Drug']
SEMANTIC_TYPE_TO_NAME = dict(zip(SEMANTIC_TYPES, SEMANTIC_NAMES))

SEMANTIC_TYPE_TO_NAME

{'T047': 'Disease or Syndrome',
 'T121': 'Pharmacologic Substance',
 'T023': 'Body Part, Organ, or Organ Component',
 'T061': 'Therapeutic or Preventive Procedure',
 'T060': 'Diagnostic Procedure',
 'T059': 'Laboratory Procedure',
 'T034': 'Laboratory or Test Result',
 'T184': 'Sign or Symptom',
 'T200': 'Clinical Drug'}

In [11]:
CONFLICT_TO_SEMANTIC_TYPE = {
    "diagnosis": {'T047', 'T060'},
    "med_history_symptom": {'T184'},
    "med_history_operation": {'T061'},
    "med_history_other": set(SEMANTIC_TYPES),
    "med_allergy": {'T200', 'T121'},
    "test_results": {'T059', 'T034'}
}

CONFLICT_TO_SEMANTIC_TYPE

{'diagnosis': {'T047', 'T060'},
 'med_history_symptom': {'T184'},
 'med_history_operation': {'T061'},
 'med_history_other': {'T023',
  'T034',
  'T047',
  'T059',
  'T060',
  'T061',
  'T121',
  'T184',
  'T200'},
 'med_allergy': {'T121', 'T200'},
 'test_results': {'T034', 'T059'}}

In [67]:
from data_structures import Patient,\
                            Note, PrescriptionOrders, LabResults,\
                            Sentence, Prescription, Lab

In [234]:
# from importlib import reload # python 2.7 does not require this
# import data_structures
# reload(data_structures)
# from data_structures import Patient,\
#                             Note, PrescriptionOrders, LabResults,\
#                             Sentence, Prescription, Lab

# 3. Load and process data

In [13]:
# Load MIMIC tables
notes_df  = pd.read_csv('NOTEEVENTS.csv.gz',    compression='gzip', error_bad_lines=False)
drug_df   = pd.read_csv('PRESCRIPTIONS.csv.gz', compression='gzip', error_bad_lines=False)
lab_df    = pd.read_csv('LABEVENTS.csv.gz',     compression='gzip', error_bad_lines=False)
d_lab_df  = pd.read_csv('D_LABITEMS.csv.gz',    compression='gzip', error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


#### Updated script for processing HADM ID's with consecutive physician notes (does not count the autosaves)

In [14]:
# Load HADM ID's with consecutive physician notes
if os.path.exists("hadm_ids.pkl"):
    with open("hadm_ids.pkl", "rb") as f:
        hadm_ids = pickle.load(f)
else:
    hadm_ids = []
    for hadm_id in tqdm(notes_df.HADM_ID.unique()):
        hadm_data = notes_df.loc[notes_df.HADM_ID == hadm_id]
        hadm_phys_notes = hadm_data.loc[hadm_data.CATEGORY == "Physician "]

        if len(hadm_phys_notes.CHARTTIME.unique()) > 1: # ensure > 1 unique notes (not counting autosave)
            hadm_ids.append(hadm_id)

    with open("hadm_ids.pkl", "wb") as f:
        pickle.dump(hadm_ids, f)
        
print(f"There are {len(hadm_ids)} patients with consecutive physician notes.")

There are 8158 patients with consecutive physician notes.


## Example: Extracting similar topic sentence pairs for 1 patient

In [126]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [237]:
# test an example
hadm_id = hadm_ids[0] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# Create patient instance -- processes all the data
pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boole

### Inserting contradictions -- @Diana, start reading here

Types of contradictions to generate, and considerations for each.

* Lab test results (Yuria)
* Diagnoses (Diana)
* Medication (Diana)

For each type of contradiction, generate note-to-note AND note-to-structured table examples.

<!-- Comments: For evaluation purposes, we should maybe pick out  -->

In [215]:
# Iterate through the sentence pairs -- replace sentence with contradiction

#### Please ignore this section

**[TODO, need for full pipeline] Method 1: Going from Sentence instance to Pat.notes_df**

In [216]:
"""
data = pat.notes[4].datas[10]

## todo: we need a way to map from tokenized sentences back to original note
# given a data instance, access the data (sentence) in original notes_df dataframe
notes_df     = data.dailydata.patient.notes_df
note_row_id  = data.dailydata.row_id
sentence_idx = data.sentence_idx

# notes_df.loc[notes_df.ROW_ID == note_row_id, "TEXT"]

note_index = notes_df.ROW_ID.values.tolist().index(note_row_id)
notes_df.at[notes_df.index[note_index], "TEXT"] # = new text here
"""

print("Todo: we need a way to map from tokenized sentences back to original note\n" +\
      "we are only going back to tokenized sentences for now")

Todo: we need a way to map from tokenized sentences back to original note
we are only going back to tokenized sentences for now


**[Temporary solution for now] Method 2: Going from Sentence instance to Note.sentences**

This has now been included in Sentence.update_text(new_text, run_reprocessing)

In [217]:
"""
# Given this sentence, say I want to change this sentence to something else (inserting contradiction)
sentence = pat.notes[4].datas[10]

# First, trace back to the DailyData instance (in this case, a Note instance)
note = sentence.dailydata

# Since we want to update note.sentences, get the index of this sentence
# then, update the sentence.
note.sentences[sentence.sentence_idx] = "CONTRADICTION??" # = something new

# Once we make all of our updates, re-run the processing from note.sentences
note.process_sentences(note.sentences)

# Check that this worked
pat.notes[4].datas[10].txt
"""

print("Included in data_structures.py")

Included in data_structures.py


#### Process patient data and iterate over pairs of Data instances to get pairs

##### Step 1 [TODO]: Select a patient

In [246]:
from importlib import reload # python 2.7 does not require this
import data_structures
reload(data_structures)
from data_structures import Patient,\
                            Note, PrescriptionOrders, LabResults,\
                            Sentence, Prescription, Lab

In [247]:
hadm_id = hadm_ids[0] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# Create patient instance -- processes all the data
pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boole

In [248]:
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

In [260]:
def is_comparable_type(data_i, data_j):
    """ We only want to compare note-to-note OR note-to-structured data. 
    
    Comparable types:
    - sentence v. sentence
    - sentence v. prescription
    - sentence v. lab
    
    Uncomparable types:
    - lab v. lab 
    - lab v. prescription
    - prescription v. prescription
    """
    return (data_i.type == "sentence"     and data_j.type == "sentence") or \
           (data_i.type == "sentence"     and data_j.type == "prescription") or \
           (data_i.type == "prescription" and data_j.type == "sentence") or \
           (data_i.type == "sentence"     and data_j.type == "lab") or \
           (data_i.type == "lab"          and data_j.type == "sentence")

##### Step 2: Generate pairs for this patient

In [261]:
processed_pairs = []  # for dataframe + csv
data_inst_pairs = []  # for pipeline, list of tuples: ((Data 1, Data 2), label)
pair_idx = 0

# Iterate over all of the patient's DailyData instances (e.g. note, prescription order, lab results for same day)
## pat.dailydata = {[date]: [DailyData instance from that date], ...}
for day, pat_dailydatas in pat.dailydata.items(): # pat_dailydatas is list of all DailyData instances for `day`
    print(f"********** Processing data for {day} **********")
    # Collect all the daily datas (note, prescription orders, lab results) for current day
    current_dds = []
    current_dds_features = []
    current_dds_txts = []
    current_dds_sem_types = []
    current_dds_sem_names = []
    for dd in pat_dailydatas: # iterating over DailyData instances, e.g. dd=physician note taken on `day`
        current_dds.extend(dd.datas)
        current_dds_features.extend(dd.datas_features)
        current_dds_txts.extend(dd.datas_txts)
        current_dds_sem_types.extend(dd.datas_semantic_types)
        current_dds_sem_names.extend(dd.datas_semantic_names)

    current_dds           = np.array(current_dds)
    current_dds_features  = np.array(current_dds_features)
    current_dds_txts      = np.array(current_dds_txts)
    current_dds_sem_types = np.array(current_dds_sem_types)
    current_dds_sem_names = np.array(current_dds_sem_names)

    # extract similar sentences for each semantic type
    for sem_type in SEMANTIC_TYPES:
        # data for this semantic type
        sem_type_bools   = [sem_type in x for x in current_dds_sem_types]
        sem_type_indices = np.where(sem_type_bools)[0]
        indices_map = dict(
                        zip(range(len(sem_type_indices)), 
                            sem_type_indices)
                      )  # maps regular indices in sem_type_current_dds_* lists to indices in current_dds_* lists

        sem_type_current_dds           = current_dds[sem_type_indices]
        sem_type_current_dds_features  = current_dds_features[sem_type_indices]
        sem_type_current_dds_txts      = current_dds_txts[sem_type_indices]
        sem_type_current_dds_sem_types = current_dds_sem_types[sem_type_indices]
        sem_type_current_dds_sem_names = current_dds_sem_names[sem_type_indices]

        # current_dds_featuresfor features (umls + rxnorm concepts)
        vectorizer = CountVectorizer()
        corpus = list(map(lambda x: ' '.join(x), sem_type_current_dds_features))
        if len(corpus) == 0: # skip rest if no candidate sentences exist
            continue
        X = vectorizer.fit_transform(corpus)
        X = X.toarray()

        # get cosine similarity using umls + rxnorm concepts
        similarity = cosine_similarity(X)     # larger=more similar
        sim_is, sim_js = np.where(similarity>0.5) # all pairs with at least 0.5 similarity

        for i, j in zip(sim_is, sim_js):
            data_i = sem_type_current_dds[i]
            data_j = sem_type_current_dds[j]
            # removing same sentence pairs, checking dates
            if i>j and is_comparable_type(data_i, data_j):
                print(f"***** PAIR INDEX {pair_idx} *****")
                print(f"Cosine similarity: {similarity[i, j]}")
                print(f"----- Data i -----")
                print(f">> Time: {data_i.time}\n" +\
                      f">> Type: {data_i.type}\n" +\
                      f">> Concepts: {data_i.features}\n" +\
                      f">> {data_i.txt}")
                print(f"----- Data j -----")
                print(f">> Time: {data_j.time}\n" +\
                      f">> Type: {data_j.type}\n" +\
                      f">> Concepts: {data_j.features}\n" +\
                      f">> {data_j.txt}")
                print("**********************************")

                # save
                processed_pairs.append([data_i.txt,      data_j.txt, \
                                        data_i.time,     data_j.time, \
                                        data_i.type,     data_j.type, \
                                        data_i.features, data_j.features, \
                                        similarity[i, j], SEMANTIC_TYPE_TO_NAME[sem_type]])
        #                                 SEMANTIC_TYPE_TO_NAME[semantic_type]])
                
                data_inst_pairs.append(((data_i, data_j), None))
                pair_idx += 1

###############
#### Final ####
###############        
df = \
pd.DataFrame(np.array(processed_pairs), \
             columns=["sentence 1", "sentence 2", \
                      "time 1", "time 2", \
                      "type 1", "type 2", \
                      "concepts 1", "concepts 2", \
                      "cosine similarity", "semantic type"])

# df.to_csv(pt_csv)

# print("Data has been saved!")

********** Processing data for 2131-12-23 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.7071067811865477
----- Data i -----
>> Time: 2131-12-23 23:51:00
>> Type: sentence
>> Concepts: {'Acute respiratory failure', 'Chronic Kidney Diseases'}
>> Altered mental status, acute respiratory failure,    chronic renal failure, hypotension    I saw and examined the patient, and was physically present with the ICU    Resident for key portions of the services provided.'
----- Data j -----
>> Time: 2131-12-23 23:51:00
>> Type: sentence
>> Concepts: {'Chronic Kidney Diseases'}
>> Patient with chronic renal failure.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.5031152949374527
----- Data i -----
>> Time: 2131-12-23 23:51:00
>> Type: sentence
>> Concepts: {'Chronic obstructive pulmonary disease of horses', 'Pneumonia', 'Structure of left lower lobe of lung'}
>> HPI:     [**Age over 90 382**]  year old woman hx of COPD, recent admit in early  [**Month (on

In [262]:
df

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Altered mental status, acute respiratory failu...",Patient with chronic renal failure.',2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Acute respiratory failure, Chronic Kidney Dis...",{Chronic Kidney Diseases},0.707107,Disease or Syndrome
1,HPI: [**Age over 90 382**] year old woman...,"Assessment and Plan PNEUMONIA, BACTERIAL, COM...",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,{Chronic obstructive pulmonary disease of hors...,"{Infantile Neuroaxonal Dystrophy, Kidney, Pneu...",0.503115,Disease or Syndrome
2,"Compared to previous CXR, LLL infiltrate not ...",HPI: [**Age over 90 382**] year old woman...,2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Lung hy...",{Chronic obstructive pulmonary disease of hors...,0.744208,Disease or Syndrome
3,"CXR with LLL infiltrate, which may be persi...",HPI: [**Age over 90 382**] year old woman...,2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Pneumonia}",{Chronic obstructive pulmonary disease of hors...,0.848528,Disease or Syndrome
4,"CXR with LLL infiltrate, which may be persi...","Compared to previous CXR, LLL infiltrate not ...",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Pneumonia}","{Structure of left lower lobe of lung, Lung hy...",0.877058,Disease or Syndrome
...,...,...,...,...,...,...,...,...,...,...
177,Height: 65 Inch Total In: ...,Height: 65 Inch Total In: ...,2131-12-26 10:04:00,2131-12-26 07:42:00,sentence,sentence,"{Blood Stop Topical Product, Blood product, Ur...","{Blood Stop Topical Product, Blood product, Ur...",1.0,Therapeutic or Preventive Procedure
178,TSH 1.8 on this admission.',"TSH, B12, RPR - normal/non-reactive - st...",2131-12-26 07:42:00,2131-12-26 07:42:00,sentence,sentence,{Thyroid stimulating hormone measurement},{Thyroid stimulating hormone measurement},1.0,Laboratory Procedure
179,ABG: 7.38/36/128/23/-2',ABG: 7.38/36/128/23/-2',2131-12-26 10:04:00,2131-12-26 07:42:00,sentence,sentence,{Analysis of arterial blood gases and pH},{Analysis of arterial blood gases and pH},1.0,Laboratory Procedure
180,Ve: 7.7 L/min PaO2 / FiO2: 320 Physic...,Ve: 7.7 L/min PaO2 / FiO2: 320 Physic...,2131-12-26 10:04:00,2131-12-26 07:42:00,sentence,sentence,{PO2 measurement},{PO2 measurement},1.0,Laboratory Procedure


#### Inserting contradictions to Sentence instances

IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

**Prescription Order Contradictions**

##### Step 3 [TODO]: Get all the pairs about prescriptions OR diagnoses 

I've already done the prescriptions below. You would probably need to write new code for diagnoses.

In [265]:
# Grab all the pairs that are about prescriptions
semantic_type_ids = CONFLICT_TO_SEMANTIC_TYPE['med_allergy']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_prescription = df['semantic type'].apply(lambda x: x in semantic_type_names)
prescription_pairs_df = df.loc[(df['type 1'] == "prescription") | (df['type 2'] == "prescription") | is_prescription]

prescription_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
23,DVT: (Systemic anticoagulation: Coumadin) ...,DVT: (Systemic anticoagulation: Heparin gtt...,2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,"{Deep Vein Thrombosis, Stress ulcer of stomach...","{Deep Vein Thrombosis, Anticoagulation Therapy...",0.923077,Pharmacologic Substance
24,Skin: Warm',"Skin: Warm, No(t) Rash: , No(t) Jaundice'",2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,{Felis catus skin extract},{Felis catus skin extract},1.0,Pharmacologic Substance


##### Step 4 [TODO]: Insert contradictions

Read through the comments below.

We should probably aim for 1-2 contradictions per patient. 
So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.

Small heads up -- for a given patient, try not to insert contradictions into two sentences that look really really similar. There's a chance this might refer to the same underlying Sentence instance, which could overwrite a contradiction you previously inserted. 

In [178]:
# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

pair_idx = 23
is_sentence2 = True

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

In [179]:
# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "CONTRADICTION!!"
sentence_to_modify.update_text(contradicting_txt)

In [180]:
# Repeat this for all the contradictions you want to insert for this patient. 

pair_idx = 24
is_sentence2 = False

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

contradicting_txt = "ANOTHER CONTRADICTION!!"
sentence_to_modify.update_text(contradicting_txt)

# @Diana -- end

In [190]:
pat.d_lab_df

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE
0,546,51346,Blasts,Cerebrospinal Fluid (CSF),Hematology,26447-3
1,547,51347,Eosinophils,Cerebrospinal Fluid (CSF),Hematology,26451-5
2,548,51348,"Hematocrit, CSF",Cerebrospinal Fluid (CSF),Hematology,30398-2
3,549,51349,Hypersegmented Neutrophils,Cerebrospinal Fluid (CSF),Hematology,26506-6
4,550,51350,Immunophenotyping,Cerebrospinal Fluid (CSF),Hematology,
...,...,...,...,...,...,...
748,749,51551,VOIDED SPECIMEN,OTHER BODY FLUID,HEMATOLOGY,
749,750,51552,VOIDED SPECIMEN,STOOL,CHEMISTRY,
750,751,51553,VOIDED SPECIMEN,URINE,CHEMISTRY,
751,752,51554,VOIDED SPECIMEN,JOINT FLUID,HEMATOLOGY,


In [189]:
pat.lab_df

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ITEMID,CHARTTIME,VALUE,VALUENUM,VALUEUOM,FLAG
16083806,16462787,26601,155131.0,50868,2131-12-23 16:50:00,15,15.00,mEq/L,
16083807,16462788,26601,155131.0,50882,2131-12-23 16:50:00,28,28.00,mEq/L,
16083808,16462789,26601,155131.0,50893,2131-12-23 16:50:00,7.8,7.80,mg/dL,abnormal
16083809,16462790,26601,155131.0,50902,2131-12-23 16:50:00,94,94.00,mEq/L,abnormal
16083810,16462791,26601,155131.0,50912,2131-12-23 16:50:00,6.3,6.30,mg/dL,abnormal
...,...,...,...,...,...,...,...,...,...
16290809,16463068,26601,155131.0,51274,2131-12-29 06:07:00,28.9,28.90,sec,abnormal
16290810,16463069,26601,155131.0,51275,2131-12-29 06:07:00,54.6,54.60,sec,abnormal
16290811,16463070,26601,155131.0,51277,2131-12-29 06:07:00,18.2,18.20,%,abnormal
16290812,16463071,26601,155131.0,51279,2131-12-29 06:07:00,3.18,3.18,m/uL,abnormal


In [192]:
#     def process_prescriptions(self):
#         start, end = self._get_prescription_start_end_dt()  # get start/end dates
#         self._process_prescription_sents()                  # get prescription info in sentence form

#         # for each date, get all the prescriptions given and construct PrescriptionOrders
#         delta = dt.timedelta(days=1)
#         current = start
#         prescriptions = []
#         while current <= end:
#             current_prescription_df = self.prescription_df.apply(lambda x: x.START_DT <= current and x.END_DT >= current, axis=1)
#             if current_prescription_df.sum() > 0: # if there is at least 1
#                 prescription_order = PrescriptionOrders(self, current_prescription_df, current)
#                 prescriptions.append(prescription_order)

#             current += delta # go to next date
#         self.prescriptions = prescriptions

#     def _get_prescription_start_end_dt(self):
#         # get datetimes for start and end dates
#         start_dt = self.prescription_df.STARTDATE.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())
#         end_dt   = self.prescription_df.ENDDATE.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())

#         self.prescription_df[['START_DT']] = start_dt
#         self.prescription_df[['END_DT']]   = end_dt

#         # get earliest and latest dates
#         start = min(start_dt)
#         end   = max(end_dt)
        
#         return start, end

def _get_lab_start_end_dt():
    time_dt = pat.lab_df.CHARTTIME.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())
    start = min(time_dt)
    end   = max(time_dt)

16083806    2131-12-23
16083807    2131-12-23
16083808    2131-12-23
16083809    2131-12-23
16083810    2131-12-23
               ...    
16290809    2131-12-29
16290810    2131-12-29
16290811    2131-12-29
16290812    2131-12-29
16290813    2131-12-29
Name: CHARTTIME, Length: 286, dtype: object

In [211]:
#     def _process_prescription_sents(self):
#         # process sentence for each prescription
#         # get_prescription_sent = lambda row: f"Patient was prescribed {row.DRUG.item()} {row.PROD_STRENGTH.item()} {row.ROUTE.item()} of total {row.DOSE_VAL_RX.item()} {row.DOSE_UNIT_RX.item()}"
#         get_prescription_sent = lambda row: f"Patient was prescribed {row.DRUG} {row.PROD_STRENGTH} {row.ROUTE} of total {row.DOSE_VAL_RX} {row.DOSE_UNIT_RX}"
#         prescription_sents = self.prescription_df.apply(get_prescription_sent, axis=1)
#         self.prescription_df[['Sentence']] = prescription_sents

pat.lab_df.head()

row = lab_df.iloc[16083806]

def get_label(row):
    return pat.d_lab_df.loc[pat.d_lab_df.ITEMID == row.ITEMID].LABEL.item()

def get_flag(row):
    if np.isnan(row.FLAG): # no flag info if no flag
        return ""
    return f" , which is {row.FLAG}"

get_label(row)
get_flag(row)

get_lab_sent = lambda row: f"Patient's {get_label(row)} lab came back {row.VALUE} {row.VALUEUOM}{get_flag(row)}."

get_lab_sent(row)

"Patient's Anion Gap lab came back 15 mEq/L."

In [None]:
#   normal_template = "{} lab came back normal."
#   abnormal_template = "{} lab came back abnormal."
#   id_data = lab_df.loc[lab_df["HADM_ID"] == id]
#   daily_labs = {}
#   for i, row in id_data.iterrows():
#     item = row["ITEMID"]
#     day = row["CHARTTIME"].split()[0]
#     if day not in daily_labs:
#       daily_labs[day] = set()
#     label = find_label(item, d_lab_df)
#     if label is None:
#       continue
#     if row["FLAG"] == "abnormal":
#       daily_labs[day].add(abnormal_template.format(label))
#     else:
#       daily_labs[day].add(normal_template.format(label))
#   return daily_labs

### Getting History + Allergy Information

In [116]:
# todo: 
# - DONE function to re-process all data from Patient instance -- pat.process_notes(); pat.process_by_date()
# - function to update Note -- should update dataframe of patient directly
#   - can go back to dataframe, but can't map tokenized sentence to original note in df -- todo
#   - function to update tokenized sentence
# - later: function to update original dataframe from patient dataframe

import re

def get_section(regex_dict, txt):
    """ Given a dictionary of start and end regex's for a
        particular section, gets the start and endpoint of 
        section in the text and returns indices. 
        Returns None if section does not exist.
    """
    try:
        start    = re.search(regex_dict["start"], txt).start()
        end      = re.search(regex_dict["end"],   txt).start()
    except AttributeError:
        start, end = None, None
    
    return start, end

note = pat.notes[4]

# Sections to store 
# note: most of these sections have already been removed,
#       but if they haven't might have to remove then 
#       reprocess everything
allergy_regex = {"start": "Allergies:",
                 "end":   "Last dose of Antibiotics:"}
history_regex = {"start": "Past medical history:",
                 "end":   "Other:"}

allergy_start, allergy_end = get_section(allergy_regex, note.txt)
history_start, history_end = get_section(history_regex, note.txt)

pt_allergies = "" if allergy_start is None else note.txt[allergy_start:allergy_end]
pt_histories = "" if history_start is None else note.txt[history_start:history_end]

print("******** Allergies ********")
print(pt_allergies[:100])
print("******** Histories ********")
print(pt_histories[:100])

******** Allergies ********
Allergies:
   Bactrim (Oral) (Sulfamethoxazole/Trimethoprim)
   Nausea/Vomiting
   Amiodarone
   Ras
******** Histories ********

