In [1]:
# Import libraries
import os
import re
import random
import pickle
import subprocess
import numpy as np
import pandas as pd
import datetime as dt

from tqdm import tqdm
from datetime import datetime
from collections import Counter

# 1. Setup concept extractors

Some options were [MetaMap](https://metamap.nlm.nih.gov/) and [spaCy](https://spacy.io/). 

[MetaMap](https://metamap.nlm.nih.gov/) is specific to recognizing UMLS concepts. There is a [Python wrapper](https://github.com/AnthonyMRios/pymetamap), but known to be slow and bad.

[spaCy](https://spacy.io/) is a popular NLP Python package with an extensive library for named entity recognition. It has a wide variety of [extensions](https://spacy.io/universe) and models to choose from. We're going with the following.

* [scispaCy](https://spacy.io/universe/project/scispacy) contains spaCy models for processing biomedical, scientific or clinical text. It seems easy to use and has a wide variety of concepts it can recognize, including UMLS, RxNorm, etc.

* [negspaCy](https://spacy.io/universe/project/negspacy) identifies negations using some extension of regEx. Probably useful for things like, "this pt is diabetic" v. "this pt is not diabetic." [todo: negation identification of medspacy might be better, https://github.com/medspacy/medspacy]

* [Med7](https://github.com/kormilitzin/med7) is a model trained for recognizing entities in prescription text, e.g. identifies drug name, dosage, duration, etc., which could be useful stuff to check for conflicts. 

We're going with spaCy for this.. and coming up with a coherent way to integrate entities picked up by these three extensions/models.

## i) Installations

In [2]:
import sys; sys.executable

'/opt/conda/envs/opennotes/bin/python'

In [3]:
import spacy
import scispacy

from pprint import pprint
from collections import OrderedDict

from spacy import displacy
# from scispacy.abbreviation import AbbreviationDetector # UMLS already contains abbrev. detect
from scispacy.umls_linking import UmlsEntityLinker

# should be 2.3.5 and >=0.3.0
spacy.__version__, scispacy.__version__

('2.3.5', '0.3.0')

## ii) Setting up the model

The model is used to form word/sentence embeddings for the NER task. Thus, it's important to choose model that has been tuned for our specific use case (e.g. clinical text, prescription information) so the embeddings are useful for naming the entity.

[Note to self:] one potential idea to look into if we have time remaining, something about using custom model for spacy pipeline (could we do smth with the romanov models since they've been trained specifically for conflict detection?) -- https://spacy.io/usage/v3

### a) scispaCy

For scispaCy, we set up one of their models that has been trained on biomedical data. Other models can be found [here](https://allenai.github.io/scispacy/). 

We load two models since we will be linking different entity linkers (knowledge bases that link text to named entites) later.

In [4]:
## uncomment to install model if not already installed
# !/opt/conda/envs/opennotes/bin/python -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz

In [5]:
# for umls (general biomedical concepts)
umls_nlp   = spacy.load("en_core_sci_sm")

# for rxnorm (prescriptions)
rxnorm_nlp = spacy.load("en_core_sci_sm")

### b) Med7

For Med7, we set up their model that has been trained specifically for NER of medication-related concepts: dosage, drug names, duration, form, frequency, route of administration, and strength. The model is trained on MIMIC-III, so it should work well for us.

In [6]:
# # installs Med7 model
# !pip install https://www.dropbox.com/s/xbgsy6tyctvrqz3/en_core_med7_lg.tar.gz?dl=1

In [6]:
med7_nlp = spacy.load("en_core_med7_lg")

## iii) Adding an entity linker

The EntityLinker is a spaCy component that links to a knowledge base. The linker compares words with the concepts in the specified knowledge base (e.g. scispaCy's UMLS does some form of character overlap-based nearest neighbor search, has option to resolve abbreviations first).

[Note: Entities generally get resolved to a list of different entities. This [blog post](http://sujitpal.blogspot.com/2020/08/disambiguating-scispacy-umls-entities.html) describes one potential way to disambiguate this by figuring out "most likely" set of entities. Gonna start off with just resolving to the 1st entity tho... hopefully that's sufficient.]

### a) scispaCy

#### UMLS Linker

UMLS linker maps entities to the UMLS concept. Main parts we'll be interested in are: semantic type and concept (mainly the common name, maybe the CUI might become important later).

* _Semantic type_ is the broader category that the entity falls under, e.g. disease, pharmacologic substance, etc. See [this](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt) for a full list.

* _Concepts_ refer to the more fundamental entity itself, e.g. pneumothorax, ventillator, etc. Many concepts can fall under a semantic type.

More info on `UmlsEntityLinker` ([source code](https://github.com/allenai/scispacy/blob/4ade4ec897fa48c2ecf3187caa08a949920d126d/scispacy/linking.py#L9))

See source code for `.jsonl` file with the knowledge base.

In [7]:
from scispacy.umls_linking import UmlsEntityLinker

# abbreviation_pipe = AbbreviationDetector(nlp) # automatically included with UMLS linker
# nlp.add_pipe(abbreviation_pipe)
umls_linker = UmlsEntityLinker(k=10,                          # number of nearest neighbors to look up from
                               threshold=0.7,                 # confidence threshold to be added as candidate
                               max_entities_per_mention=1,    # number of entities returned per concept (todo: tune)
                               filter_for_definitions=False,  # no definition is OK
                               resolve_abbreviations=True)    # resolve abbreviations before linking
umls_nlp.add_pipe(umls_linker)

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmphpaafo5y
Finished download, copying /tmp/tmphpaafo5y to cache at /home/yutsumi/.scispacy/datasets/e9f7327283e43f0482f7c0c71b71dec278a58ccb3ffdd03c2c2350159e7ef146.f2a350ad19015b2591545f7feeed6a6d6d2fffcd635d868a5d7fc0dfc3cadfd8.tfidf_vectors_sparse.npz
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/nmslib_index.bin not found in cache, downloading to /tmp/tmpts22037q
Finished download, copying /tmp/tmpts22037q to cache at /home/yutsumi/.scispacy/datasets/f48455d6c79262057cce66b4619123c2b558b21092d42fac97f47bb99a5b8f9f.dd70d3dffe7d90d7ac8914460e16a48375dab32485fb6313a34e6fbcaf53218b.nmslib_index.bin
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmpmc211imo
Finished download, copying /tmp/tmpmc211imo to cache at /h



https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/concept_aliases.json not found in cache, downloading to /tmp/tmp8zx8quxl
Finished download, copying /tmp/tmp8zx8quxl to cache at /home/yutsumi/.scispacy/datasets/1428ec15d3b1061731ea273c03699130b3d6b90948993e74bda66af605ff8e2a.aeb7a686c654df6bccb6c2c23d3eda3eb381daaefda4592b58158d0bee53b352.concept_aliases.json
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/kbs/2020-10-09/umls_2020_aa_cat0129.jsonl not found in cache, downloading to /tmp/tmpjj37n3ic
Finished download, copying /tmp/tmpjj37n3ic to cache at /home/yutsumi/.scispacy/datasets/4d7fb8fcae1035d1e0a47d9072b43d5a628057d35497fbfb2499b4b7b2dd4dd7.05ec7eef12f336d4666da85b7fa69b9401883a7dd4244473f7b88b413ccbba03.umls_2020_aa_cat0129.jsonl


#### RxNorm Linker

RxNorm linker maps entities to RxNorm, an ontology for clinical drug names. It contains about 100k concepts for normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

More info on `RxNorm` ([NIH page](https://www.nlm.nih.gov/research/umls/rxnorm/index.html), [source code](https://github.com/allenai/scispacy/blob/2290a80cfe0948e48d8ecfbd60064019d57a6874/scispacy/linking_utils.py#L120))

See source code for `.jsonl` file with the knowledge base.

In [8]:
from scispacy.linking import EntityLinker

# rxnorm_linker = EntityLinker(resolve_abbreviations=True, name="rxnorm")
rxnorm_linker = EntityLinker(k=10,                          # number of nearest neighbors to look up from
                             threshold=0.7,                 # confidence threshold to be added as candidate
                             max_entities_per_mention=1,    # number of entities returned per concept (todo: tune)
                             filter_for_definitions=False,  # no definition is OK
                             resolve_abbreviations=True,    # resolve abbreviations before linking
                             name="rxnorm")                 # RxNorm ontology

rxnorm_nlp.add_pipe(rxnorm_linker)



### b) Med7 

No need for entity linker

### c) Negspacy [TODO]

# 2. Setup data structures

## Categorizing type of conflict

The first larger task is to categorize by the type of conflict to check for since our method will likely be different (at least for the rule based). We wrote up a short list [here](https://docs.google.com/document/d/1fEBk0JHeyQWshYWW5w_VTkaYyRfm9MBxJ9DAGoVa8Yw/edit?usp=sharing). 

To do this, we're using the semantic type that is identified by the UMLS linker. Here's a table of the semantic types we're filtering for, and which conflict they'll be used for.

Here's a [full list](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt) of semantic types. You can look up definitions of semantic types [here](http://linkedlifedata.com/resource/umls-semnetwork/T033).

| Conflict | Semantic Type |
| --- | ----------- |
| Diagnoses-related errors | Disease or Syndrome (T047), Diagnostic Procedure(T060) |
| Inaccurate description of medical history (symptoms) | Sign or Symptom (T184) |
| Inaccurate description of medical history (operations) | Therapeutic or Preventive Procedure (T061) |
| Inaccurate description of medical history (other) | [all of the above and below] |
| Medication or allergies | Clinical Drug (T200), Pharmacologic Substance (T121) |
| Test procedures or results | Laboratory Procedure (T059), Laboratory or Test Result (T034) | 


For clarity, the concepts we'll keep from the UMLS linker are anything falling into these semantic types (which we will then categorize by type of conflict using the table above):

* T047 - Disease or Syndrome
* T121 - Pharmacologic Substance
* T023 - Body Part, Organ, or Organ Component
* T061 - Therapeutic or Preventive Procedure 
* T060 - Diagnostic Procedure
* T059 - Laboratory Procedure
* T034 - Laboratory or Test Result 
* T184 - Sign or Symptom 
* T200 - Clinical Drug

We'll store this info into a dictionary now.

<!-- Some useful def's 
Finding - 
That which is discovered by direct observation or measurement of an organism attribute or condition, including the clinical history of the patient. The history of the presence of a disease is a 'Finding' and is distinguished from the disease itself.  -->

In [10]:
SEMANTIC_TYPES = ['T047', 'T121', 'T023', 'T061', 'T060', 'T059', 'T034', 'T184', 'T200']
SEMANTIC_NAMES = ['Disease or Syndrome', 'Pharmacologic Substance', 'Body Part, Organ, or Organ Component', \
                  'Therapeutic or Preventive Procedure', 'Diagnostic Procedure', 'Laboratory Procedure', \
                  'Laboratory or Test Result', 'Sign or Symptom', 'Clinical Drug']
SEMANTIC_TYPE_TO_NAME = dict(zip(SEMANTIC_TYPES, SEMANTIC_NAMES))

SEMANTIC_TYPE_TO_NAME

{'T047': 'Disease or Syndrome',
 'T121': 'Pharmacologic Substance',
 'T023': 'Body Part, Organ, or Organ Component',
 'T061': 'Therapeutic or Preventive Procedure',
 'T060': 'Diagnostic Procedure',
 'T059': 'Laboratory Procedure',
 'T034': 'Laboratory or Test Result',
 'T184': 'Sign or Symptom',
 'T200': 'Clinical Drug'}

In [11]:
CONFLICT_TO_SEMANTIC_TYPE = {
    "diagnosis": {'T047', 'T060'},
    "med_history_symptom": {'T184'},
    "med_history_operation": {'T061'},
    "med_history_other": set(SEMANTIC_TYPES),
    "med_allergy": {'T200', 'T121'},
    "test_results": {'T059', 'T034'}
}

CONFLICT_TO_SEMANTIC_TYPE

{'diagnosis': {'T047', 'T060'},
 'med_history_symptom': {'T184'},
 'med_history_operation': {'T061'},
 'med_history_other': {'T023',
  'T034',
  'T047',
  'T059',
  'T060',
  'T061',
  'T121',
  'T184',
  'T200'},
 'med_allergy': {'T121', 'T200'},
 'test_results': {'T034', 'T059'}}

In [12]:
# # Patient -- get minimum/maximum date
# notes_dates = list(map(lambda x: x.time.date(), pat.notes))
# prescriptions_dates = list(map(lambda x: x.date, pat.prescriptions))
# start = min(notes_dates + prescriptions_dates)
# end   = max(notes_dates + prescriptions_dates)

# # Patient -- get all Note & PrescriptionOrder instances for date
# current = start

# # get all items for current date
# def get_current_items(items, dates, current):
#     items_and_dates = zip(items, dates)
#     current_items_and_dates = filter(lambda x: x[1] == current, items_and_dates)
#     current_items, current_dates = list(zip(*current_items_and_dates))
    
#     return current_items
    
# items = pat.notes
# dates = notes_dates
# current_notes = get_current_items(items, dates, current)

# items = pat.prescriptions
# dates = prescriptions_dates
# current_prescriptions = get_current_items(items, dates, current)

# current_data = current_notes + current_prescriptions

In [179]:
class Data(object):
    def __init__(self, dailydata, txt, filter_map=None, conflict_map=None):
        self.dailydata = dailydata
        self.txt       = txt
        self.time      = dailydata.time
        
        self.umls_cui_map   = dailydata.umls_linker.umls.cui_to_entity # maps CUI to UMLS knowledge base
        self.rxnorm_cui_map = dailydata.rxnorm_linker.kb.cui_to_entity # maps CUI to RxNorm knowledge base
        self.filter_map   = filter_map
        self.conflict_map = conflict_map
        self.is_filter    = (filter_map is not None)
        self.is_conflict  = (conflict_map is not None)
        
        self.umls_doc    = dailydata.umls(self.txt)
        self.rxnorm_doc  = dailydata.rxnorm(self.txt)
        self.med7_doc    = dailydata.med7(self.txt)
        
        self.semantic_types = []
        self.semantic_names = []  # names of categories of entities
        
        self.umls_concepts   = [] # names of types of entities (UMLS)
        self.get_umls_info()
        
        self.rxnorm_concepts = [] # names of types of entities (RxNorm)
        self.get_rxnorm_info()
        
        self.semantic_types  = set(self.semantic_types)
        self.semantic_names  = set(self.semantic_names)
        self.umls_concepts   = set(self.umls_concepts)
        self.rxnorm_concepts = set(self.rxnorm_concepts)
        
        self.med7_entities = []   # list of tuples with (entity word, entity label), e.g. (aspirin, drug)
        self.get_med7_info()
        
    @property
    def features(self):
        """ Returns canonical names of extracted concepts. Used to get cosine similarities. """
        return self.umls_concepts | self.rxnorm_concepts
    
    def get_med7_info(self):
        # list of tuples with (entity word, entity label), e.g. (aspirin, drug)
        self.med7_entities = [(ent.text, ent.label_) for ent in self.med7_doc.ents]
        
    def get_umls_info(self):
        for ent in self.umls_doc.ents: # extract info (umls) for each entity
            # todo: look into this bug, ent._.umls_ents sometimes empty list
            try:
                cui, _ = ent._.umls_ents[0] # assuming `max_entites_per_mention=1` for now
            except IndexError:
                continue
            cui_info = self.umls_cui_map[cui]
            
            if not self.is_filter:
                ent_valid_type_list = [True for _ in cui_info.types] # add everything if no filter
            else: 
                ent_valid_type_list = [t in self.filter_map for t in cui_info.types]
            ent_valid_type = any(ent_valid_type_list) # checks if entity is a valid type
            
            if ent_valid_type: # only add to list if we're not filtering of it's valid
                self.umls_concepts.append(cui_info.canonical_name)
                for (stype, keep) in zip(cui_info.types, ent_valid_type_list):
                    if keep:
                        self.semantic_types.append(stype)
                        self.semantic_names.append(self.filter_map[stype])
                
    def get_rxnorm_info(self):
        for ent in self.rxnorm_doc.ents: # extract info for each rxnorm entity
            try:
                cui, _ = ent._.kb_ents[0] # assuming `max_entites_per_mention=1` for now
            except IndexError:
                continue 
            cui_info = self.rxnorm_cui_map[cui]

            if not self.is_filter:
                ent_valid_type_list = [True for _ in cui_info.types] # add everything if no filter
            else: 
                ent_valid_type_list = [t in self.filter_map for t in cui_info.types]
            ent_valid_type = any(ent_valid_type_list) # checks if entity is a valid type
            
            if ent_valid_type: # only add to list if we're not filtering of it's valid
                self.rxnorm_concepts.append(cui_info.canonical_name)
                for (stype, keep) in zip(cui_info.types, ent_valid_type_list):
                    if keep:
                        self.semantic_types.append(stype)
                        self.semantic_names.append(self.filter_map[stype])
            
    def is_ctype(self, ctype):
        """ Given a conflict type (e.g. "diagnosis"),
            returns True if this sentence falls into that category, False otherwise.
            Returns None if conflict_map is undefined.
        """
        if self.is_conflict: 
            ctype_stypes = self.conflict_map[ctype] # get list of semantic types for this conflict
            return any([stype in ctype_stypes for stype in self.semantic_types])
        return None
    
class Sentence(Data):
    def __init__(self, note, sentence_idx, filter_map=None, conflict_map=None):
        """
        Extracts important information and stores them as attributes. 
        """
        txt = note.sentences[sentence_idx]       
        self.sentence_idx = sentence_idx

        super(Sentence, self).__init__(note, txt, filter_map, conflict_map)
        
class Prescription(Data):
    def __init__(self, prescription_order, prescription_idx, filter_map=None, conflict_map=None):
        txt = prescription_order.sentences[prescription_idx]
        self.prescription_idx = prescription_idx
        
        super(Prescription, self).__init__(prescription_order, txt, filter_map, conflict_map)
        
class Lab(Data):
    def __init__(self, lab_result, lab_idx):
        super(Lab, self).__init__(lab_result)
        self.lab_idx = lab_idx

In [180]:

class DailyData(object):
    """ Collection of data from same day. e.g. clinical notes, lab tests, prescription orders. """
    def __init__(self, patient):
        self.patient  = patient   # patient this data is for
    
    def __getitem__(self, idx):
        return self.datas[idx]

    @property
    def hadm_id(self):
        return self.patient.hadm_id
    
    @property
    def datas_txts(self):
        return list(map(lambda x: x.txt, self.datas))
    
    @property
    def datas_features(self):
        return list(map(lambda x: x.features, self.datas))
    
    @property
    def datas_semantic_types(self):
        return list(map(lambda x: x.semantic_types, self.datas))

    @property
    def datas_semantic_names(self):
        return list(map(lambda x: x.semantic_names, self.datas))

    @property
    def med7(self):
        return self.patient.med7
    
    @property
    def umls(self):
        return self.patient.umls
    
    @property
    def rxnorm(self):
        return self.patient.rxnorm
    
    @property
    def umls_linker(self):
        return self.patient.umls_linker

    @property
    def rxnorm_linker(self):
        return self.patient.rxnorm_linker

class Note(DailyData):
    def __init__(self, patient, row_id):
        super(Note, self).__init__(patient)
        
        self.note_row = patient.notes_df.loc[patient.notes_df.ROW_ID == row_id]   # df row for this note
        self.txt      = self.note_row.TEXT.item()                                       # note in string format
        self.cat      = self.note_row.CATEGORY.item()                                   # note category
        
        # Get datetime
        if type(self.note_row.CHARTTIME.item()) == str:
            self.time = datetime.strptime(self.note_row.CHARTTIME.item(), "%Y-%m-%d %H:%M:%S")
        elif type(self.note_row.CHARTDATE.item()) == str:
            self.time = datetime.strptime(self.note_row.CHARTDATE.item(), "%Y-%m-%d")
        else:
            self.time = None
            
        # Tokenize note
        sents = !python mimic-tokenize/heuristic-tokenize.py "{self.txt}"
        sentences = sents[0].split(", \'")
#         # For python script: runs command and returns stdout as bytes, convert to utf-8, list of sentences
#         sents = subprocess.check_output(f"python mimic-tokenize/heuristic-tokenize.py {self.txt}".split(" "))
#         sents = sents.decode("utf-8")
#         sentences = sents.split(", \'")

        # Remove lab tables, remove titles
        sentences = self._delete_copied_lab_tables(sentences)
        sentences = self._remove_titles(sentences)
        
        self.sentences = sentences # todo: process each sentence
        
        # Process each sentence
        sentence_reps = []
        for idx, sent in enumerate(sentences):
            sent_rep = Sentence(self, idx,
                                filter_map=SEMANTIC_TYPE_TO_NAME,
                                conflict_map=CONFLICT_TO_SEMANTIC_TYPE)
            sentence_reps.append(sent_rep)
            
        self.datas = sentence_reps
        
    def _diff_list(self, li1, li2):
        return list(set(li1) - set(li2)) + list(set(li2) - set(li1))

    def _delete_copied_lab_tables(self, ind_sentences):
        # [**yyyy-mm-dd**], 02:10
#         rgx_list = ["[\*\*\d{4}\-\d{1,2}\-\d{1,2}\*\*]", "\d{1,2}\-\d{1,2}"]
#         rgx_list = ["[\*\*[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}\*\*] *[0-9]{1,2}-[0-9]{1,2}"]
#         rgx_list = ["[\*\*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]\*\*]   [0-9][0-9]-[0-9][0-9]"]
        rgx_list = ["[\*\*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]\*\*]"]
#         rgx_list = ["[\d{4}\-\d{1,2}\-\d{1,2}][^\S]+\d{1,2}\-\d{1,2}"]
        
        delete_list = []
        # ind_sentences is list of strings
        for sentence in ind_sentences:
            for rgx_match in rgx_list:
                match = re.search(rgx_match, sentence)
                if match and sentence not in delete_list:
                    delete_list.append(sentence)
        return self._diff_list(ind_sentences, delete_list)
    
    def _remove_titles(self, sentences):
        """ Omits anything that has ':' in last two entries of the string. 
        e.g. "...Results:"
        """
        return list(filter(lambda x: ':' not in x[-2:], sentences))
        
class PrescriptionOrders(DailyData):
    def __init__(self, patient, daily_bools, date):
        """ Patient instance and boolean Series for selecting daily rows. """
        super(PrescriptionOrders, self).__init__(patient)
        self.date = date
        self.time = datetime.combine(date, datetime.min.time())
        
        self.prescription_df = self.patient.prescription_df[daily_bools]  # order dataframe
        self.sentences       = self.prescription_df.Sentence.values       # array of orders in sentence form
        
        # Process each prescription
        prescription_data = []
        for idx, prescript in enumerate(self.prescription_df):
            prescript_rep = Prescription(self, idx,
                                         filter_map=SEMANTIC_TYPE_TO_NAME,
                                         conflict_map=CONFLICT_TO_SEMANTIC_TYPE)
            prescription_data.append(prescript_rep)
        
        self.datas = prescription_data

class LabResults(DailyData):
    def __init__(self, patient):
        super(LabResults, self).__init__(patient)

In [181]:
class Patient(object):
    def __init__(self, hadm_id, notes_df, prescription_df, lab_df, d_lab_df, \
                 med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
                 physician_only=True):
        """ Patient representation
        
        med7_nlp:      spacy model from Med7
        umls_nlp:      spacy model with UMLS entity linker
        rxnorm_nlp:    spacy model with RxNorm entity linker
        umls_linker:   entity linker for UMLS, should already be linked to umls_nlp
        rxnorm_linker: entity linker for RxNorm, should already be linked to rxnorm_nlp
        """
        self.hadm_id = hadm_id
        self.physician_only = physician_only
        
        # this patient's data
        self.notes_df = self.filter_notes(notes_df.loc[notes_df['HADM_ID'] == hadm_id])
        self.prescription_df  = prescription_df.loc[prescription_df['HADM_ID'] == hadm_id]
        self.lab_df   = lab_df.loc[lab_df['HADM_ID'] == hadm_id]
        
        self.d_lab_df = d_lab_df # lab ditems df
        
        # spaCy models & entity linkers
        self.med7   = med7_nlp
        self.umls   = umls_nlp
        self.rxnorm = rxnorm_nlp
        self.umls_linker   = umls_linker
        self.rxnorm_linker = rxnorm_linker
        
        # A. Process notes
        notes = []
        for row_id in self.notes_df.ROW_ID:
            note = Note(self, row_id)
            notes.append(note)
        self.notes = notes
                
        # B. Process prescription info
        start, end = self._get_prescription_start_end_dt()  # get start/end dates
        self._process_prescription_sents()                  # get prescription info in sentence form

        # for each date, get all the prescriptions given and construct PrescriptionOrders
        delta = dt.timedelta(days=1)
        current = start
        prescriptions = []
        while current <= end:
            current_prescription_df = self.prescription_df.apply(lambda x: x.START_DT <= current and x.END_DT >= current, axis=1)
            if current_prescription_df.sum() > 0: # if there is at least 1
                prescription_order = PrescriptionOrders(self, current_prescription_df, current)
                prescriptions.append(prescription_order)

            current += delta # go to next date
        self.prescriptions = prescriptions

        # todo: process labs
        
        # Final. Process all data (notes, prescriptions, labs), map by date
#         start, end, (notes_dates, prescriptions_dates) = self._get_start_end_dt()
        start, end, all_dates = self._get_start_end_dt()
        delta = dt.timedelta(days=1)
        current = start
        dailydata = {}
        while current <= end:
            current_dailydata = self._get_current_dailydata(current, all_dates)
            if len(current_dailydata) > 0: 
                dailydata[current] = current_dailydata
            current += delta
        self.dailydata = dailydata
        
    def filter_notes(self, pat_notes_df):
        if self.physician_only: pat_notes_df = self._filter_physician(pat_notes_df)
        pat_notes_df = self._filter_duplicates(pat_notes_df)
        
        return pat_notes_df
    
    def _get_prescription_start_end_dt(self):
        # get datetimes for start and end dates
        start_dt = self.prescription_df.STARTDATE.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())
        end_dt   = self.prescription_df.ENDDATE.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())

        self.prescription_df[['START_DT']] = start_dt
        self.prescription_df[['END_DT']]   = end_dt

        # get earliest and latest dates
        start = min(start_dt)
        end   = max(end_dt)
        
        return start, end
    
    def _process_prescription_sents(self):
        # process sentence for each prescription
        # get_prescription_sent = lambda row: f"Patient was prescribed {row.DRUG.item()} {row.PROD_STRENGTH.item()} {row.ROUTE.item()} of total {row.DOSE_VAL_RX.item()} {row.DOSE_UNIT_RX.item()}"
        get_prescription_sent = lambda row: f"Patient was prescribed {row.DRUG} {row.PROD_STRENGTH} {row.ROUTE} of total {row.DOSE_VAL_RX} {row.DOSE_UNIT_RX}"
        prescription_sents = self.prescription_df.apply(get_prescription_sent, axis=1)
        self.prescription_df[['Sentence']] = prescription_sents
    
    def _filter_physician(self, pat_notes_df):
        # Filter for only physician notes
        return pat_notes_df.loc[pat_notes_df.CATEGORY == "Physician "]
        
    def _filter_duplicates(self, pat_notes_df):
        # Filtering out duplicate / autosave's -- only take the longest
        for cat in pat_notes_df.CATEGORY.unique(): 
            cat_notes_df = pat_notes_df.loc[pat_notes_df.CATEGORY == cat]
            for time in cat_notes_df.CHARTTIME.unique():
                time_notes_df = cat_notes_df.loc[cat_notes_df.CHARTTIME == time]
                if len(time_notes_df) > 1:
                    # get indices of first N-1 shortest rows
                    idx_to_drop = time_notes_df.TEXT.apply(lambda x: len(x)).sort_index().index[:-1]
                    pat_notes_df = pat_notes_df.drop(idx_to_drop) # drop by row index
                    
        return pat_notes_df

    def _get_start_end_dt(self, return_all_dates=True):
        """ Gets start and end datetimes across all data. todo: add labs"""
        notes_dates         = list(map(lambda x: x.time.date(), self.notes))
        prescriptions_dates = list(map(lambda x: x.date,        self.prescriptions))
        start = min(notes_dates + prescriptions_dates)
        end   = max(notes_dates + prescriptions_dates)
        
        return start, end, (notes_dates, prescriptions_dates)

    def _get_current_items(self, items, dates, current):
        """ Get items for current date """
        items_and_dates = zip(items, dates)
        current_items_and_dates = filter(lambda x: x[1] == current, items_and_dates)
        
        try:
            current_items, current_dates = list(zip(*current_items_and_dates))
            return list(current_items)
        except ValueError: # if current items don't exist
            return []
        
    def _get_current_dailydata(self, current, all_dates):
        """ Gets DailyData instances for current date.
        
        all_dates: iterable of dates, corresponding to self.[DailyData list]
        """
        notes_dates, prescriptions_dates = all_dates
        
        current_notes         = self._get_current_items(self.notes,         notes_dates,         current)
        current_prescriptions = self._get_current_items(self.prescriptions, prescriptions_dates, current)
        
        current_data = current_notes + current_prescriptions
        
        return current_data

# 3. Load and process data

In [14]:
# Load MIMIC tables
notes_df  = pd.read_csv('NOTEEVENTS.csv.gz',    compression='gzip', error_bad_lines=False)
drug_df   = pd.read_csv('PRESCRIPTIONS.csv.gz', compression='gzip', error_bad_lines=False)
lab_df    = pd.read_csv('LABEVENTS.csv.gz',     compression='gzip', error_bad_lines=False)
d_lab_df  = pd.read_csv('D_LABITEMS.csv.gz',    compression='gzip', error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [15]:
# Load HADM ID's with consecutive physician notes
if os.path.exists("hadm_ids.pkl"):
    with open("hadm_ids.pkl", "rb") as f:
        hadm_ids = pickle.load(f)
else:
    hadm_ids = []
    for hadm_id in tqdm(data.HADM_ID.unique()):
        hadm_data = data.loc[data.HADM_ID == hadm_id]
        hadm_phys_notes = hadm_data.loc[hadm_data.CATEGORY == "Physician "]

        if len(hadm_phys_notes) > 1:
            hadm_ids.append(hadm_id)

    with open("hadm_ids.pkl", "wb") as f:
        pickle.dump(hadm_ids, f)
        
print(f"There are {len(hadm_ids)} patients with consecutive physician notes.")

There are 8733 patients with consecutive physician notes.


## Example: Extracting similar topic sentence pairs for 1 patient

In [214]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [217]:
# test an example
hadm_id = hadm_ids[10] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# Create patient instance -- processes all the data
pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col

In [218]:
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

In [219]:
processed_pairs = []

# Iterate over all of the patient's DailyData instances (e.g. note, prescription order, lab results for same day)
for day, pat_dailydatas in pat.dailydata.items(): # pat_dailydatas is list of all DailyData instances for `day`
    print(f"********** Processing data for {day} **********")
    for dd in pat_dailydatas: # iterating over DailyData instances, e.g. dd=physician note taken on `day`
        # Collect all the daily datas (note, prescription orders, lab results) for current day
        current_dds = []
        current_dds_features = []
        current_dds_txts = []
        current_dds_sem_types = []
        current_dds_sem_names = []
        for dd in current_dailydatas:
            current_dds.extend(dd.datas)
            current_dds_features.extend(dd.datas_features)
            current_dds_txts.extend(dd.datas_txts)
            current_dds_sem_types.extend(dd.datas_semantic_types)
            current_dds_sem_names.extend(dd.datas_semantic_names)

        current_dds           = np.array(current_dds)
        current_dds_features  = np.array(current_dds_features)
        current_dds_txts      = np.array(current_dds_txts)
        current_dds_sem_types = np.array(current_dds_sem_types)
        current_dds_sem_names = np.array(current_dds_sem_names)

        # extract similar sentences for each semantic type
        for sem_type in SEMANTIC_TYPES:
            # data for this semantic type
            sem_type_bools   = [sem_type in x for x in current_dds_sem_types]
            sem_type_indices = np.where(sem_type_bools)[0]
            indices_map = dict(
                            zip(range(len(sem_type_indices)), 
                                sem_type_indices)
                          )  # maps regular indices in sem_type_current_dds_* lists to indices in current_dds_* lists

            sem_type_current_dds           = current_dds[sem_type_indices]
            sem_type_current_dds_features  = current_dds_features[sem_type_indices]
            sem_type_current_dds_txts      = current_dds_txts[sem_type_indices]
            sem_type_current_dds_sem_types = current_dds_sem_types[sem_type_indices]
            sem_type_current_dds_sem_names = current_dds_sem_names[sem_type_indices]

            # current_dds_featuresfor features (umls + rxnorm concepts)
            vectorizer = CountVectorizer()
            corpus = list(map(lambda x: ' '.join(x), sem_type_current_dds_features))
            X = vectorizer.fit_transform(corpus)
            X = X.toarray()

            # get cosine similarity using umls + rxnorm concepts
            similarity = cosine_similarity(X)     # larger=more similar
            sim_is, sim_js = np.where(similarity>0.5) # all pairs with at least 0.5 similarity

            for i, j in zip(sim_is, sim_js):
                # removing same sentence pairs, checking dates
                if i>j:
                    print(f"Cosine similarity: {similarity[i, j]}")
                    print("----- SENTENCE 1 -----")
                    print(f">> Time: {sem_type_current_dds[i].time}\n" +\
                          f">> Concepts: {sem_type_current_dds_features[i]}\n" +\
                          f">> {sem_type_current_dds_txts[i]}")
                    print("----- SENTENCE 2 -----")
                    print(f">> Time: {sem_type_current_dds[j].time}\n" +\
                          f">> Concepts: {sem_type_current_dds_features[j]}\n" +\
                          f">> {sem_type_current_dds_txts[j]}")
                    print("**********************************")

                    # save
                    processed_pairs.append([sem_type_current_dds_txts[i],     sem_type_current_dds_txts[j], \
                                            sem_type_current_dds[i].time,     sem_type_current_dds[j].time, \
                                            sem_type_current_dds_features[i], sem_type_current_dds_features[j], \
                                            similarity[i, j], SEMANTIC_TYPE_TO_NAME[sem_type]])
            #                                 SEMANTIC_TYPE_TO_NAME[semantic_type]])

###############
#### Final ####
###############        
df = \
pd.DataFrame(np.array(processed_pairs), \
             columns=["sentence 1", "sentence 2", \
                      "time 1", "time 2", \
                      "concepts 1", "concepts 2", \
                      "cosine similarity", "semantic type"])

df.to_csv(pt_csv)

print("Data has been saved!")

********** Processing data for 2143-04-21 **********
Cosine similarity: 0.8770580193070292
----- SENTENCE 1 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Lung hyperinflation', 'Structure of left lower lobe of lung'}
>> Compared to previous CXR, LLL infiltrate not    significantly changed; right lung hyperinflation more impressive.'
----- SENTENCE 2 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> CXR with LLL infiltrate, which may be    persitent radiograph manifestation of her previous pneumonia (may take    6-8 weeks to resolve).'
**********************************
Cosine similarity: 0.9999999999999999
----- SENTENCE 1 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> HPI:     [**Age over 90 382**]  year old woman hx of COPD, recent admit in early  [**Month (only) 102**]  with LLL    pneumonia.'
----- SENTENCE 2 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia

# IGNORE

In [199]:
dd = current_dailydatas[0]

dd.datas
dd.datas_features
dd.datas_txts

# all the daily datas (note, prescription orders, lab results) for current day
current_dds = []
current_dds_features = []
current_dds_txts = []
current_dds_sem_types = []
current_dds_sem_names = []
for dd in current_dailydatas:
    current_dds.extend(dd.datas)
    current_dds_features.extend(dd.datas_features)
    current_dds_txts.extend(dd.datas_txts)
    current_dds_sem_types.extend(dd.datas_semantic_types)
    current_dds_sem_names.extend(dd.datas_semantic_names)
    
current_dds           = np.array(current_dds)
current_dds_features  = np.array(current_dds_features)
current_dds_txts      = np.array(current_dds_txts)
current_dds_sem_types = np.array(current_dds_sem_types)
current_dds_sem_names = np.array(current_dds_sem_names)

In [208]:
# extract similar sentences for each semantic type
processed_pairs = []
for sem_type in SEMANTIC_TYPES:
    # data for this semantic type
    sem_type_bools   = [sem_type in x for x in current_dds_sem_types]
    sem_type_indices = np.where(sem_type_bools)[0]
    indices_map = dict(
                    zip(range(len(sem_type_indices)), 
                        sem_type_indices)
                  )  # maps regular indices in sem_type_current_dds_* lists to indices in current_dds_* lists

    sem_type_current_dds           = current_dds[sem_type_indices]
    sem_type_current_dds_features  = current_dds_features[sem_type_indices]
    sem_type_current_dds_txts      = current_dds_txts[sem_type_indices]
    sem_type_current_dds_sem_types = current_dds_sem_types[sem_type_indices]
    sem_type_current_dds_sem_names = current_dds_sem_names[sem_type_indices]

    # current_dds_featuresfor features (umls + rxnorm concepts)
    vectorizer = CountVectorizer()
    corpus = list(map(lambda x: ' '.join(x), sem_type_current_dds_features))
    X = vectorizer.fit_transform(corpus)
    X = X.toarray()

    # get cosine similarity using umls + rxnorm concepts
    similarity = cosine_similarity(X)     # larger=more similar
    sim_is, sim_js = np.where(similarity>0.5) # all pairs with at least 0.5 similarity

    for i, j in zip(sim_is, sim_js):
        # removing same sentence pairs, checking dates
        if i>j:
            print(f"Cosine similarity: {similarity[i, j]}")
            print("----- SENTENCE 1 -----")
            print(f">> Time: {sem_type_current_dds[i].time}\n" +\
                  f">> Concepts: {sem_type_current_dds_features[i]}\n" +\
                  f">> {sem_type_current_dds_txts[i]}")
            print("----- SENTENCE 2 -----")
            print(f">> Time: {sem_type_current_dds[j].time}\n" +\
                  f">> Concepts: {sem_type_current_dds_features[j]}\n" +\
                  f">> {sem_type_current_dds_txts[j]}")
            print("**********************************")

            # save
            processed_pairs.append([sem_type_current_dds_txts[i],     sem_type_current_dds_txts[j], \
                                    sem_type_current_dds[i].time,     sem_type_current_dds[j].time, \
                                    sem_type_current_dds_features[i], sem_type_current_dds_features[j], \
                                    similarity[i, j], SEMANTIC_TYPE_TO_NAME[sem_type]])
    #                                 SEMANTIC_TYPE_TO_NAME[semantic_type]])

###############
#### Final ####
###############        
df = \
pd.DataFrame(np.array(processed_pairs), \
             columns=["sentence 1", "sentence 2", \
                      "time 1", "time 2", \
                      "concepts 1", "concepts 2", \
                      "cosine similarity", "semantic type"])


Cosine similarity: 0.8770580193070292
----- SENTENCE 1 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Lung hyperinflation', 'Structure of left lower lobe of lung'}
>> Compared to previous CXR, LLL infiltrate not    significantly changed; right lung hyperinflation more impressive.'
----- SENTENCE 2 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> CXR with LLL infiltrate, which may be    persitent radiograph manifestation of her previous pneumonia (may take    6-8 weeks to resolve).'
**********************************
Cosine similarity: 0.9999999999999999
----- SENTENCE 1 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> HPI:     [**Age over 90 382**]  year old woman hx of COPD, recent admit in early  [**Month (only) 102**]  with LLL    pneumonia.'
----- SENTENCE 2 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> CXR wit

In [187]:
processed_pairs = []

# current_dds_featuresfor features (umls + rxnorm concepts)
vectorizer = CountVectorizer()
corpus = list(map(lambda x: ' '.join(x), current_dds_features))
X = vectorizer.fit_transform(corpus)
X = X.toarray()

# get cosine similarity using umls + rxnorm concepts
similarity = cosine_similarity(X)     # larger=more similar
sim_is, sim_js = np.where(similarity>0.5) # all pairs with at least 0.5 similarity

for i, j in zip(sim_is, sim_js):
    # removing same sentence pairs, checking dates
    if i>j:
        print(f"Cosine similarity: {similarity[i, j]}")
        print("----- SENTENCE 1 -----")
        print(f">> Time: {current_dds[i].time}\n" +\
              f">> Concepts: {current_dds_features[i]}\n" +\
              f">> {current_dds_txts[i]}")
        print("----- SENTENCE 2 -----")
        print(f">> Time: {current_dds[j].time}\n" +\
              f">> Concepts: {current_dds_features[j]}\n" +\
              f">> {current_dds_txts[j]}")
        print("**********************************")

        # save
        processed_pairs.append([current_dds_txts[i], current_dds_txts[j], \
                                current_dds[i].time, current_dds[j].time, \
                                current_dds_features[i], current_dds_features[j], \
                                similarity[i, j], None])
#                                 SEMANTIC_TYPE_TO_NAME[semantic_type]])
        
###############
#### Final ####
###############        
df = \
pd.DataFrame(np.array(processed_pairs), \
             columns=["sentence 1", "sentence 2", \
                      "time 1", "time 2", \
                      "concepts 1", "concepts 2", \
                      "cosine similarity", "semantic type"])


Cosine similarity: 0.8770580193070292
----- SENTENCE 1 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Lung hyperinflation', 'Structure of left lower lobe of lung'}
>> Compared to previous CXR, LLL infiltrate not    significantly changed; right lung hyperinflation more impressive.'
----- SENTENCE 2 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> CXR with LLL infiltrate, which may be    persitent radiograph manifestation of her previous pneumonia (may take    6-8 weeks to resolve).'
**********************************
Cosine similarity: 0.9999999999999999
----- SENTENCE 1 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> HPI:     [**Age over 90 382**]  year old woman hx of COPD, recent admit in early  [**Month (only) 102**]  with LLL    pneumonia.'
----- SENTENCE 2 -----
>> Time: 2131-12-23 23:51:00
>> Concepts: {'Pneumonia', 'Structure of left lower lobe of lung'}
>> CXR wit

In [None]:
# for semantic_type in SEMANTIC_TYPE_TO_NAME:
#     # filter sreps by semantic type
#     # semantic_type = "T034"
#     semantic_sreps = list(filter(lambda x: semantic_type in x.semantic_types, all_sreps))

#     # filter sreps & corresponding datetimes by conflict type
#     all_conflict_sreps_and_times = list(zip(all_sreps, all_srep_times))
#     semantic_sreps_and_times = list(filter(lambda x: semantic_type in x[0].semantic_types, \
#                                            all_conflict_sreps_and_times))
    
#     if len(semantic_sreps_and_times) == 0:
#         continue
#     semantic_sreps, semantic_sreps_times = list(zip(*semantic_sreps_and_times)) # unzip

#     print(f"We have {len(semantic_sreps)} sentences that are \"{SEMANTIC_TYPE_TO_NAME[semantic_type]}\" related.")

#     # get canonical names for these sentences
#     semantic_sreps_canon_names = list(map(lambda x: x.canonical_names, semantic_sreps))

#     # get cosine sim
#     vectorizer = CountVectorizer()
#     corpus = list(map(lambda x: ' '.join(x), semantic_sreps_canon_names))
#     X = vectorizer.fit_transform(corpus)
#     X = X.toarray()

#     sim_X = cosine_similarity(X) # larger=more similar
#     simx, simy = np.where(sim_X>0.5)

#     for x, y in zip(simx, simy):
#         # removing same sentence pairs, checking dates
#         if x>y and is_comparable_time(semantic_sreps_times[x], semantic_sreps_times[y], semantic_type):
#             print(f"Cosine similarity: {sim_X[x, y]}")
#             print("----- SENTENCE 1 -----")
#             print(f">> Time: {semantic_sreps_times[x]}\n" +\
#                   f">> Concepts: {semantic_sreps_canon_names[x]}\n" +\
#                   f">> {semantic_sreps[x].doc}")
#             print("----- SENTENCE 2 -----")
#             print(f">> Time: {semantic_sreps_times[y]}\n" +\
#                   f">> Concepts: {semantic_sreps_canon_names[y]}\n" +\
#                   f">> {semantic_sreps[y].doc}")
#             print("**********************************")

#             # save
#             processed_pairs.append([semantic_sreps[x].doc, semantic_sreps[y].doc, \
#                                     semantic_sreps_times[x], semantic_sreps_times[y], \
#                                     semantic_sreps_canon_names[x], semantic_sreps_canon_names[y], \
#                                     sim_X[x, y], SEMANTIC_TYPE_TO_NAME[semantic_type]])
        
        
# df = \
# pd.DataFrame(np.array(processed_pairs), \
#              columns=["sentence 1", "sentence 2", \
#                       "time 1", "time 2", \
#                       "concepts 1", "concepts 2", \
#                       "cosine similarity", "semantic type"])

# df.to_csv(pt_csv)

In [27]:
# def get_umls_info(self):
#     for ent in self.umls_doc.ents: # extract info (umls) for each entity
#         # todo: look into this bug, ent._.umls_ents sometimes empty list
#         try:
#             cui, _ = ent._.umls_ents[0] # assuming `max_entites_per_mention=1` for now
#         except IndexError:
#             continue
#         cui_info = self.umls_cui_map[cui]

#         ent_valid_type_list = [t in self.filter_map for t in cui_info.types]
#         ent_valid_type = any(ent_valid_type_list) # checks if entity is a valid type

#         if not self.is_filter or ent_valid_type: # only add to list if we're not filtering of it's valid
#             self.canonical_names.append(cui_info.canonical_name)
#             for (stype, keep) in zip(cui_info.types, ent_valid_type_list):
#                 if keep:
#                     self.semantic_types.append(stype)
#                     self.semantic_names.append(self.filter_map[stype])
#
# self.umls_cui_map = dailydata.umls_linker.umls.cui_to_entity # maps CUI to entity information


rxnorm_doc = rxnorm_nlp(prescript.txt)

rxnorm_doc.ents

(Patient, prescribed, Warfarin 1, Tablet, PO)

In [50]:
cui, _ = rxnorm_doc.ents[2]._.kb_ents[0]

cui_info = rxnorm_linker.kb.cui_to_entity[cui]

cui_info

CUI: C0043031, Name: Warfarin
Definition: An anticoagulant that acts by inhibiting the synthesis of vitamin K-dependent coagulation factors. Warfarin is indicated for the prophylaxis and/or treatment of venous thrombosis and its extension, pulmonary embolism, and atrial fibrillation with embolization. It is also used as an adjunct in the prophylaxis of systemic embolism after myocardial infarction. Warfarin is also used as a rodenticide.
TUI(s): T109, T121, T131
Aliases: (total: 0): 
	 

In [52]:
cui_info.types

['T109', 'T121', 'T131']

In [53]:
cui_info.canonical_name

'Warfarin'

In [None]:
#         cui_info = self.umls_cui_map[cui]

#         ent_valid_type_list = [t in self.filter_map for t in cui_info.types]
#         ent_valid_type = any(ent_valid_type_list) # checks if entity is a valid type

#         if not self.is_filter or ent_valid_type: # only add to list if we're not filtering of it's valid
#             self.canonical_names.append(cui_info.canonical_name)
#             for (stype, keep) in zip(cui_info.types, ent_valid_type_list):
#                 if keep:
#                     self.semantic_types.append(stype)
#                     self.semantic_names.append(self.filter_map[stype])


In [148]:
scispacy.__version__

'0.2.5'

In [149]:
sys.executable

'/opt/conda/envs/opennotes/bin/python'

In [152]:
!/opt/conda/envs/opennotes/bin/python -m pip install scispacy==0.3.0
!pip install scispacy==0.3.0



In [153]:
import scispacy
scispacy.__version__

'0.2.5'

In [141]:
# sci_nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "name": "rxnorm"})
# sci_nlp.add_pipe("rxnorm")

from scispacy.linking import EntityLinker
rxnorm_linker = EntityLinker(resolve_abbreviations=True, name="rxnorm")

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/rxnorm/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmpdts9w6h0
Finished download, copying /tmp/tmpdts9w6h0 to cache at /home/yutsumi/.scispacy/datasets/2836c7529452f6f1c62f0339c20ddd2ae3284efbde00ccdb59dd98934e31e141.54c2128ad43025b0d3470c71188312dec38ad8255e9ea90b3c21af636ef505a8.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/rxnorm/nmslib_index.bin not found in cache, downloading to /tmp/tmptbeqdctj
Finished download, copying /tmp/tmptbeqdctj to cache at /home/yutsumi/.scispacy/datasets/3c88f806626ee8a4f9dcf5204b566c1400f53b07cd29782af99d868671fe73cf.75b9290d0d47f342fa0d81f79a90588798c509c20f83ddbeb72524da86078982.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/rxnorm/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmpz3sggt4n
Finished download, copying /tmp/tmpz3sggt4n to cache at /home/yutsumi/.scispacy/datasets/fa1b6c28de8239ff52c7



https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/rxnorm/concept_aliases.json not found in cache, downloading to /tmp/tmpn0nckpfo
Finished download, copying /tmp/tmpn0nckpfo to cache at /home/yutsumi/.scispacy/datasets/8ced845e6b800c149ddd9ac4553c691aaba66fead4cd1db6fefc90469d3039db.0c782b6639e832579f5004f8025b95c976b8fba32a121f8205e8295e17515e13.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/umls_2020_human_phenotype_ontology.jsonl not found in cache, downloading to /tmp/tmpnasnna5y
Finished download, copying /tmp/tmpnasnna5y to cache at /home/yutsumi/.scispacy/datasets/3f1640731187be93f569b85c7832da11b7a3d5c14595d30a03cfe375ec1b757b.b76cee53de23bd99dde8721a79bbc3fae46c47f2c6b27afcb7b6e30d737cd0f9.umls_2020_human_phenotype_ontology.jsonl


In [154]:
prescript.txt

'Patient was prescribed Warfarin 1mg Tablet PO of total 1 mg'

In [144]:
sci_nlp = spacy.load("en_core_sci_sm")

In [145]:
sci_nlp.add_pipe(rxnorm_linker)

In [147]:
sci_nlp(prescript.txt)

KeyError: 'Lamictal XR Blue Patient Titration Kit (for Patients Taking Valproate)'

In [126]:
list(map(lambda x: x.features, current_dailydata[0].sentence_reps))
# list(map(lambda x: x.features, current_dailydata[2].prescription_data))
list(map(lambda x: x.get_med7_info(), current_dailydata[2].prescription_data))

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [79]:
# # Patient -- get minimum/maximum date
# notes_dates = list(map(lambda x: x.time.date(), pat.notes))
# prescriptions_dates = list(map(lambda x: x.date, pat.prescriptions))
# start = min(notes_dates + prescriptions_dates)
# end   = max(notes_dates + prescriptions_dates)

# # Patient -- get all Note & PrescriptionOrder instances for date
# current = start

# # get all items for current date
# def get_current_items(items, dates, current):
#     items_and_dates = zip(items, dates)
#     current_items_and_dates = filter(lambda x: x[1] == current, items_and_dates)
#     current_items, current_dates = list(zip(*current_items_and_dates))
    
#     return current_items
    
# items = pat.notes
# dates = notes_dates
# current_notes = get_current_items(items, dates, current)

# items = pat.prescriptions
# dates = prescriptions_dates
# current_prescriptions = get_current_items(items, dates, current)

# current_data = current_notes + current_prescriptions

In [26]:
note = pat.notes[0]

for sent_rep in note.sentence_reps:
    sent_rep.canonical_names

In [22]:
def scripts_to_sentence(id, script_df):
  """
  id: HADM_ID for patient
  script_df: data frame of prescription table

  returns: dictionary: day -> set of sentences describing prescriptions given that day
  """
  template = "Patient given {}."
  # for each patient admission, create map of drug to its dosage info
  daily_drugs = {}
  # get patient specific data
  patient_data = script_df.loc[script_df['HADM_ID']==id]
  unique_drugs = patient_data.DRUG.unique()
  # record info for each unique drug the patient is on
  for drug in unique_drugs:
    drug_data = patient_data.loc[patient_data['DRUG']==drug]
    for i, row in drug_data.iterrows():
      starttime = row['STARTDATE']
      startday = starttime.split()[0]
      if startday not in daily_drugs:
        daily_drugs[startday] = set()
      daily_drugs[startday].add(template.format(drug.lower()))
  return daily_drugs

In [91]:
## Processing patient's drug info

# get datetimes for start and end dates
start_dt = pat.drug_df.STARTDATE.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())
end_dt   = pat.drug_df.ENDDATE.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())

pat.drug_df[['START_DT']] = start_dt
pat.drug_df[['END_DT']]   = end_dt

# get earliest and latest dates
start = min(start_dt)
end   = max(end_dt)

# process sentence for each prescription
# get_drug_sent = lambda row: f"Patient was prescribed {row.DRUG.item()} {row.PROD_STRENGTH.item()} {row.ROUTE.item()} of total {row.DOSE_VAL_RX.item()} {row.DOSE_UNIT_RX.item()}"
get_drug_sent = lambda row: f"Patient was prescribed {row.DRUG} {row.PROD_STRENGTH} {row.ROUTE} of total {row.DOSE_VAL_RX} {row.DOSE_UNIT_RX}"
drug_sents = pat.drug_df.apply(get_drug_sent, axis=1)
pat.drug_df[['Sentence']] = drug_sents

# for each date, get all the prescriptions given and construct PrescriptionOrders
delta = dt.timedelta(days=1)
current = start
while current <= end:
    current_drug_df = pat.drug_df.apply(lambda x: x.START_DT <= current and x.END_DT >= current, axis=1)
    prescription_order = PrescriptionOrders(pat, current_drug_df)
    
    current += delta # go to next date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in t

<__main__.PrescriptionOrders at 0x7f3196e269d0>

In [34]:
# Same day comparisons only for now
'A patient was prescribed Magnesium hydroxide 400mg/5ml suspension PO of total 30ml bid for the next 5 days.'


'A patient was prescribed Magnesium hydroxide 400mg/5ml suspension PO of total 30ml bid for the next 5 days.'

In [40]:
f"Patient was prescribed {row.DRUG.item()} {row.PROD_STRENGTH.item()} {row.ROUTE.item()} of total {row.DOSE_VAL_RX.item()} {row.DOSE_UNIT_RX.item()}"

'Patient was prescribed Midazolam 2mg/2mL Vial IV of total 0.25-1.5 mg'

In [32]:
row = pat.drug_df.loc[pat.drug_df.ROW_ID == 899285]

row.STARTDATE.item()
row.ENDDATE.item()

datetime.strptime(row.STARTDATE.item(), "%Y-%m-%d %H:%M:%S").date()

datetime.date(2131, 12, 24)

In [23]:
scripts_to_sentence(pat.hadm_id, pat.drug_df)

{'2131-12-24': {'Patient given 0.9% sodium chloride.',
  'Patient given 5% dextrose.',
  'Patient given chlorhexidine gluconate 0.12% oral rinse.',
  'Patient given clotrimazole cream.',
  'Patient given d5w.',
  'Patient given fentanyl citrate.',
  'Patient given heparin sodium.',
  'Patient given heparin.',
  'Patient given insulin.',
  'Patient given iso-osmotic dextrose.',
  'Patient given magnesium sulfate.',
  'Patient given midazolam.',
  'Patient given norepinephrine.',
  'Patient given nystatin oral suspension.',
  'Patient given piperacillin-tazobactam na.',
  'Patient given sw.',
  'Patient given vancomycin.',
  'Patient given vasopressin.'},
 '2131-12-26': {'Patient given 5% dextrose.',
  'Patient given albuterol 0.083% neb soln.',
  'Patient given ceftriaxone.',
  'Patient given ipratropium bromide neb.',
  'Patient given iso-osmotic dextrose.',
  'Patient given pantoprazole.',
  'Patient given sodium chloride 0.9%  flush.',
  'Patient given vancomycin.'},
 '2131-12-25': {

In [21]:
pat.drug_df

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,STARTDATE,ENDDATE,DRUG_TYPE,DRUG,DRUG_NAME_POE,DRUG_NAME_GENERIC,FORMULARY_DRUG_CD,GSN,NDC,PROD_STRENGTH,DOSE_VAL_RX,DOSE_UNIT_RX,FORM_VAL_DISP,FORM_UNIT_DISP,ROUTE
1942733,899285,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-27 00:00:00,MAIN,Midazolam,Midazolam,Midazolam,MIDA2I,003779,1.001900e+10,2mg/2mL Vial,0.25-1.5,mg,0.125-0.75,VIAL,IV
1942734,900007,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-27 00:00:00,BASE,5% Dextrose,,,D5W250,001972,3.380017e+08,250mL Bag,250,mL,250,mL,IV DRIP
1942735,899284,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-29 00:00:00,MAIN,Nystatin Oral Suspension,Nystatin Oral Suspension,Nystatin Oral Suspension,NYST5L,009537,4.725004e+08,"500,000 Unit UDCUP",5,mL,1,UDCUP,PO
1942736,899286,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-29 00:00:00,MAIN,Clotrimazole Cream,Clotrimazole Cream,Clotrimazole Cream,CLOT130C,007361,9.047822e+08,1% 30g Cream,1,Appl,1,TUBE,TP
1942737,899281,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-29 00:00:00,MAIN,Insulin,Insulin,Insulin - Sliding Scale,INSULIN,027413,0.000000e+00,Dummy Package for Sliding Scale,0,UNIT,0,VIAL,SC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1945160,900022,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-26 00:00:00,MAIN,Vancomycin,,,VANC1F,043952,3.383552e+08,1g Frozen Bag,1000,mg,1,BAG,IV
1945161,900021,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-26 00:00:00,MAIN,Piperacillin-Tazobactam Na,,,ZOSY2.25I,021185,2.068852e+08,2.25 g Frozen Bag,2.25,g,1,BAG,IV
1945162,900020,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-26 00:00:00,MAIN,Norepinephrine,,,LEVO4I,028633,7.031153e+08,4mg/4mL Amp,8,mg,2,AMP,IV DRIP
1945163,900018,26601,155131,276330.0,2131-12-24 00:00:00,2131-12-26 00:00:00,MAIN,Fentanyl Citrate,,,FENT2.5I,041385,1.001900e+10,2.5mg/50mL Vial,2.5,mg,50,mL,IV DRIP


In [19]:
pat.lab_df

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ITEMID,CHARTTIME,VALUE,VALUENUM,VALUEUOM,FLAG
16083806,16462787,26601,155131.0,50868,2131-12-23 16:50:00,15,15.00,mEq/L,
16083807,16462788,26601,155131.0,50882,2131-12-23 16:50:00,28,28.00,mEq/L,
16083808,16462789,26601,155131.0,50893,2131-12-23 16:50:00,7.8,7.80,mg/dL,abnormal
16083809,16462790,26601,155131.0,50902,2131-12-23 16:50:00,94,94.00,mEq/L,abnormal
16083810,16462791,26601,155131.0,50912,2131-12-23 16:50:00,6.3,6.30,mg/dL,abnormal
...,...,...,...,...,...,...,...,...,...
16290809,16463068,26601,155131.0,51274,2131-12-29 06:07:00,28.9,28.90,sec,abnormal
16290810,16463069,26601,155131.0,51275,2131-12-29 06:07:00,54.6,54.60,sec,abnormal
16290811,16463070,26601,155131.0,51277,2131-12-29 06:07:00,18.2,18.20,%,abnormal
16290812,16463071,26601,155131.0,51279,2131-12-29 06:07:00,3.18,3.18,m/uL,abnormal


In [107]:
# for each note/lab/prescription:
    

<__main__.Note at 0x7f0b258f3ed0>

In [None]:
# get cosine sim
vectorizer = CountVectorizer()
corpus = list(map(lambda x: ' '.join(x), semantic_sreps_canon_names))
X = vectorizer.fit_transform(corpus)
X = X.toarray()


In [96]:
for note in pat.notes:
    print(note.time)

2131-12-23 23:51:00
2131-12-23 22:56:00
2131-12-24 11:44:00
2131-12-24 07:33:00
2131-12-25 09:37:00
2131-12-25 07:56:00
2131-12-26 07:42:00
2131-12-26 10:04:00


In [62]:
pat.notes

[<__main__.Note at 0x7f0b1a142e50>,
 <__main__.Note at 0x7f0b258f1d10>,
 <__main__.Note at 0x7f0b08c12c10>,
 <__main__.Note at 0x7f0b07e5e350>,
 <__main__.Note at 0x7f0b07c36290>,
 <__main__.Note at 0x7f0b04897ed0>,
 <__main__.Note at 0x7f0b066ec190>,
 <__main__.Note at 0x7f0b041956d0>]

In [66]:
pat.notes[0].sentence_reps

[<__main__.Sentence at 0x7f0b258f1410>,
 <__main__.Sentence at 0x7f0b258f1310>,
 <__main__.Sentence at 0x7f0b229830d0>,
 <__main__.Sentence at 0x7f0b2597d190>,
 <__main__.Sentence at 0x7f0b258ed050>,
 <__main__.Sentence at 0x7f0b0b7c47d0>,
 <__main__.Sentence at 0x7f0b2593d5d0>,
 <__main__.Sentence at 0x7f0b258da390>,
 <__main__.Sentence at 0x7f0b0a13ecd0>,
 <__main__.Sentence at 0x7f0b09b14450>,
 <__main__.Sentence at 0x7f0b1b805490>,
 <__main__.Sentence at 0x7f0b0b902a50>,
 <__main__.Sentence at 0x7f0b09a38810>,
 <__main__.Sentence at 0x7f0b0a13ee10>,
 <__main__.Sentence at 0x7f0b0ac2c7d0>,
 <__main__.Sentence at 0x7f0b0abbebd0>,
 <__main__.Sentence at 0x7f0b0a13efd0>,
 <__main__.Sentence at 0x7f0b1b20ca50>,
 <__main__.Sentence at 0x7f0b0aad3390>,
 <__main__.Sentence at 0x7f0b0aad3210>,
 <__main__.Sentence at 0x7f0b0aad33d0>,
 <__main__.Sentence at 0x7f0b09718c10>,
 <__main__.Sentence at 0x7f0b25917090>,
 <__main__.Sentence at 0x7f0b0abbea90>,
 <__main__.Sentence at 0x7f0b09718f10>,


In [81]:
sent_rep = pat.notes[0].sentence_reps[4]
sent_rep.txt

"Potassium up slightly.'"