In [1]:
# Import libraries
import os
import re
import random
import pickle
import subprocess
import numpy as np
import pandas as pd
import datetime as dt

from tqdm import tqdm
from datetime import datetime
from collections import Counter

# 1. Setup concept extractors

Some options were [MetaMap](https://metamap.nlm.nih.gov/) and [spaCy](https://spacy.io/). 

[MetaMap](https://metamap.nlm.nih.gov/) is specific to recognizing UMLS concepts. There is a [Python wrapper](https://github.com/AnthonyMRios/pymetamap), but known to be slow and bad.

[spaCy](https://spacy.io/) is a popular NLP Python package with an extensive library for named entity recognition. It has a wide variety of [extensions](https://spacy.io/universe) and models to choose from. We're going with the following.

* [scispaCy](https://spacy.io/universe/project/scispacy) contains spaCy models for processing biomedical, scientific or clinical text. It seems easy to use and has a wide variety of concepts it can recognize, including UMLS, RxNorm, etc.

* [negspaCy](https://spacy.io/universe/project/negspacy) identifies negations using some extension of regEx. Probably useful for things like, "this pt is diabetic" v. "this pt is not diabetic." [todo: negation identification of medspacy might be better, https://github.com/medspacy/medspacy]

* [Med7](https://github.com/kormilitzin/med7) is a model trained for recognizing entities in prescription text, e.g. identifies drug name, dosage, duration, etc., which could be useful stuff to check for conflicts. 

We're going with spaCy for this.. and coming up with a coherent way to integrate entities picked up by these three extensions/models.

## i) Installations

In [2]:
import sys; sys.executable

'/opt/conda/envs/opennotes/bin/python'

In [3]:
import spacy
import scispacy

from pprint import pprint
from collections import OrderedDict

from spacy import displacy
# from scispacy.abbreviation import AbbreviationDetector # UMLS already contains abbrev. detect
from scispacy.umls_linking import UmlsEntityLinker

# should be 2.3.5 and >=0.3.0
spacy.__version__, scispacy.__version__

('2.3.5', '0.3.0')

## ii) Setting up the model

The model is used to form word/sentence embeddings for the NER task. Thus, it's important to choose model that has been tuned for our specific use case (e.g. clinical text, prescription information) so the embeddings are useful for naming the entity.

[Note to self:] one potential idea to look into if we have time remaining, something about using custom model for spacy pipeline (could we do smth with the romanov models since they've been trained specifically for conflict detection?) -- https://spacy.io/usage/v3

### a) scispaCy

For scispaCy, we set up one of their models that has been trained on biomedical data. Other models can be found [here](https://allenai.github.io/scispacy/). 

We load two models since we will be linking different entity linkers (knowledge bases that link text to named entites) later.

In [4]:
## uncomment to install model if not already installed
# !/opt/conda/envs/opennotes/bin/python -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz

In [5]:
# for umls (general biomedical concepts)
umls_nlp   = spacy.load("en_core_sci_sm")

# for rxnorm (prescriptions)
rxnorm_nlp = spacy.load("en_core_sci_sm")

### b) Med7

For Med7, we set up their model that has been trained specifically for NER of medication-related concepts: dosage, drug names, duration, form, frequency, route of administration, and strength. The model is trained on MIMIC-III, so it should work well for us.

In [6]:
# # installs Med7 model
# !/opt/conda/envs/opennotes/bin/python3.7 -m pip install https://www.dropbox.com/s/xbgsy6tyctvrqz3/en_core_med7_lg.tar.gz?dl=1

In [7]:
sys.executable

'/opt/conda/envs/opennotes/bin/python'

In [8]:
med7_nlp = spacy.load("en_core_med7_lg")

## iii) Adding an entity linker

The EntityLinker is a spaCy component that links to a knowledge base. The linker compares words with the concepts in the specified knowledge base (e.g. scispaCy's UMLS does some form of character overlap-based nearest neighbor search, has option to resolve abbreviations first).

[Note: Entities generally get resolved to a list of different entities. This [blog post](http://sujitpal.blogspot.com/2020/08/disambiguating-scispacy-umls-entities.html) describes one potential way to disambiguate this by figuring out "most likely" set of entities. Gonna start off with just resolving to the 1st entity tho... hopefully that's sufficient.]

### a) scispaCy

#### UMLS Linker

UMLS linker maps entities to the UMLS concept. Main parts we'll be interested in are: semantic type and concept (mainly the common name, maybe the CUI might become important later).

* _Semantic type_ is the broader category that the entity falls under, e.g. disease, pharmacologic substance, etc. See [this](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt) for a full list.

* _Concepts_ refer to the more fundamental entity itself, e.g. pneumothorax, ventillator, etc. Many concepts can fall under a semantic type.

More info on `UmlsEntityLinker` ([source code](https://github.com/allenai/scispacy/blob/4ade4ec897fa48c2ecf3187caa08a949920d126d/scispacy/linking.py#L9))

See source code for `.jsonl` file with the knowledge base.

In [9]:
from scispacy.umls_linking import UmlsEntityLinker

# abbreviation_pipe = AbbreviationDetector(nlp) # automatically included with UMLS linker
# nlp.add_pipe(abbreviation_pipe)
umls_linker = UmlsEntityLinker(k=10,                          # number of nearest neighbors to look up from
                               threshold=0.7,                 # confidence threshold to be added as candidate
                               max_entities_per_mention=1,    # number of entities returned per concept (todo: tune)
                               filter_for_definitions=False,  # no definition is OK
                               resolve_abbreviations=True)    # resolve abbreviations before linking
umls_nlp.add_pipe(umls_linker)



#### RxNorm Linker

RxNorm linker maps entities to RxNorm, an ontology for clinical drug names. It contains about 100k concepts for normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

More info on `RxNorm` ([NIH page](https://www.nlm.nih.gov/research/umls/rxnorm/index.html), [source code](https://github.com/allenai/scispacy/blob/2290a80cfe0948e48d8ecfbd60064019d57a6874/scispacy/linking_utils.py#L120))

See source code for `.jsonl` file with the knowledge base.

In [10]:
from scispacy.linking import EntityLinker

# rxnorm_linker = EntityLinker(resolve_abbreviations=True, name="rxnorm")
rxnorm_linker = EntityLinker(k=10,                          # number of nearest neighbors to look up from
                             threshold=0.7,                 # confidence threshold to be added as candidate
                             max_entities_per_mention=1,    # number of entities returned per concept (todo: tune)
                             filter_for_definitions=False,  # no definition is OK
                             resolve_abbreviations=True,    # resolve abbreviations before linking
                             name="rxnorm")                 # RxNorm ontology

rxnorm_nlp.add_pipe(rxnorm_linker)



### b) Med7 

No need for entity linker

### c) Negspacy [TODO]

# 2. Setup data structures

## Categorizing type of conflict

The first larger task is to categorize by the type of conflict to check for since our method will likely be different (at least for the rule based). We wrote up a short list [here](https://docs.google.com/document/d/1fEBk0JHeyQWshYWW5w_VTkaYyRfm9MBxJ9DAGoVa8Yw/edit?usp=sharing). 

To do this, we're using the semantic type that is identified by the UMLS linker. Here's a table of the semantic types we're filtering for, and which conflict they'll be used for.

Here's a [full list](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt) of semantic types. You can look up definitions of semantic types [here](http://linkedlifedata.com/resource/umls-semnetwork/T033).

| Conflict | Semantic Type |
| --- | ----------- |
| Diagnoses-related errors | Disease or Syndrome (T047), Diagnostic Procedure(T060) |
| Inaccurate description of medical history (symptoms) | Sign or Symptom (T184) |
| Inaccurate description of medical history (operations) | Therapeutic or Preventive Procedure (T061) |
| Inaccurate description of medical history (other) | [all of the above and below] |
| Medication or allergies | Clinical Drug (T200), Pharmacologic Substance (T121) |
| Test procedures or results | Laboratory Procedure (T059), Laboratory or Test Result (T034) | 


For clarity, the concepts we'll keep from the UMLS linker are anything falling into these semantic types (which we will then categorize by type of conflict using the table above):

* T047 - Disease or Syndrome
* T121 - Pharmacologic Substance
* T023 - Body Part, Organ, or Organ Component
* T061 - Therapeutic or Preventive Procedure 
* T060 - Diagnostic Procedure
* T059 - Laboratory Procedure
* T034 - Laboratory or Test Result 
* T184 - Sign or Symptom 
* T200 - Clinical Drug

We'll store this info into a dictionary now.

<!-- Some useful def's 
Finding - 
That which is discovered by direct observation or measurement of an organism attribute or condition, including the clinical history of the patient. The history of the presence of a disease is a 'Finding' and is distinguished from the disease itself.  -->

In [11]:
SEMANTIC_TYPES = ['T047', 'T121', 'T023', 'T061', 'T060', 'T059', 'T034', 'T184', 'T200']
SEMANTIC_NAMES = ['Disease or Syndrome', 'Pharmacologic Substance', 'Body Part, Organ, or Organ Component', \
                  'Therapeutic or Preventive Procedure', 'Diagnostic Procedure', 'Laboratory Procedure', \
                  'Laboratory or Test Result', 'Sign or Symptom', 'Clinical Drug']
SEMANTIC_TYPE_TO_NAME = dict(zip(SEMANTIC_TYPES, SEMANTIC_NAMES))

SEMANTIC_TYPE_TO_NAME

{'T047': 'Disease or Syndrome',
 'T121': 'Pharmacologic Substance',
 'T023': 'Body Part, Organ, or Organ Component',
 'T061': 'Therapeutic or Preventive Procedure',
 'T060': 'Diagnostic Procedure',
 'T059': 'Laboratory Procedure',
 'T034': 'Laboratory or Test Result',
 'T184': 'Sign or Symptom',
 'T200': 'Clinical Drug'}

In [12]:
CONFLICT_TO_SEMANTIC_TYPE = {
    "diagnosis": {'T047', 'T060'},
    "med_history_symptom": {'T184'},
    "med_history_operation": {'T061'},
    "med_history_other": set(SEMANTIC_TYPES),
    "med_allergy": {'T200', 'T121'},
    "test_results": {'T059', 'T034'}
}

CONFLICT_TO_SEMANTIC_TYPE

{'diagnosis': {'T047', 'T060'},
 'med_history_symptom': {'T184'},
 'med_history_operation': {'T061'},
 'med_history_other': {'T023',
  'T034',
  'T047',
  'T059',
  'T060',
  'T061',
  'T121',
  'T184',
  'T200'},
 'med_allergy': {'T121', 'T200'},
 'test_results': {'T034', 'T059'}}

In [13]:
from data_structures import Patient,\
                            Note, PrescriptionOrders, LabResults,\
                            Sentence, Prescription, Lab

In [14]:
# from importlib import reload # python 2.7 does not require this
# import data_structures
# reload(data_structures)
# from data_structures import Patient,\
#                             Note, PrescriptionOrders, LabResults,\
#                             Sentence, Prescription, Lab

# 3. Load and process data

In [15]:
# Load MIMIC tables
notes_df  = pd.read_csv('NOTEEVENTS.csv.gz',    compression='gzip', error_bad_lines=False)
drug_df   = pd.read_csv('PRESCRIPTIONS.csv.gz', compression='gzip', error_bad_lines=False)
lab_df    = pd.read_csv('LABEVENTS.csv.gz',     compression='gzip', error_bad_lines=False)
d_lab_df  = pd.read_csv('D_LABITEMS.csv.gz',    compression='gzip', error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


#### Updated script for processing HADM ID's with consecutive physician notes (does not count the autosaves)

In [16]:
# Load HADM ID's with consecutive physician notes
if os.path.exists("hadm_ids.pkl"):
    with open("hadm_ids.pkl", "rb") as f:
        hadm_ids = pickle.load(f)
else:
    hadm_ids = []
    for hadm_id in tqdm(notes_df.HADM_ID.unique()):
        hadm_data = notes_df.loc[notes_df.HADM_ID == hadm_id]
        hadm_phys_notes = hadm_data.loc[hadm_data.CATEGORY == "Physician "]

        if len(hadm_phys_notes.CHARTTIME.unique()) > 1: # ensure > 1 unique notes (not counting autosave)
            hadm_ids.append(hadm_id)

    with open("hadm_ids.pkl", "wb") as f:
        pickle.dump(hadm_ids, f)
        
print(f"There are {len(hadm_ids)} patients with consecutive physician notes.")

There are 8158 patients with consecutive physician notes.


# 4. Generating Contradictions

Generate 25-50 examples of positive and negative contradictions, each.

For lab values: 

* Find 50-100 total data pairs (about 2-4 per patient) and insert contradiction, or label as not a contradiction

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
pd.set_option("display.max_colwidth", -1) # prints full text

  """Entry point for launching an IPython kernel.


In [19]:
from importlib import reload # python 2.7 does not require this
import data_structures
reload(data_structures)
from data_structures import Patient,\
                            Note, PrescriptionOrders, LabResults,\
                            Sentence, Prescription, Lab

In [20]:
def is_comparable_type(data_i, data_j):
    """ We only want to compare note-to-note OR note-to-structured data. 
    
    Comparable types:
    - sentence v. sentence
    - sentence v. prescription
    - sentence v. lab
    
    Uncomparable types:
    - lab v. lab 
    - lab v. prescription
    - prescription v. prescription
    """
    return (data_i.type == "sentence"     and data_j.type == "sentence") or \
           (data_i.type == "sentence"     and data_j.type == "prescription") or \
           (data_i.type == "prescription" and data_j.type == "sentence") or \
           (data_i.type == "sentence"     and data_j.type == "lab") or \
           (data_i.type == "lab"          and data_j.type == "sentence")

In [21]:
def generate_data_pairs(pat):
    processed_pairs = []  # for dataframe + csv
    data_inst_pairs = []  # for pipeline, list of tuples: ((Data 1, Data 2), label)
    pair_idx = 0

    # Iterate over all of the patient's DailyData instances (e.g. note, prescription order, lab results for same day)
    ## pat.dailydata = {[date]: [DailyData instance from that date], ...}
    for day, pat_dailydatas in pat.dailydata.items(): # pat_dailydatas is list of all DailyData instances for `day`
        print(f"********** Processing data for {day} **********")
        # Collect all the daily datas (note, prescription orders, lab results) for current day
        current_dds = []
        current_dds_features = []
        current_dds_txts = []
        current_dds_sem_types = []
        current_dds_sem_names = []
        for dd in pat_dailydatas: # iterating over DailyData instances, e.g. dd=physician note taken on `day`
            current_dds.extend(dd.datas)
            current_dds_features.extend(dd.datas_features)
            current_dds_txts.extend(dd.datas_txts)
            current_dds_sem_types.extend(dd.datas_semantic_types)
            current_dds_sem_names.extend(dd.datas_semantic_names)

        current_dds           = np.array(current_dds)
        current_dds_features  = np.array(current_dds_features)
        current_dds_txts      = np.array(current_dds_txts)
        current_dds_sem_types = np.array(current_dds_sem_types)
        current_dds_sem_names = np.array(current_dds_sem_names)

        # extract similar sentences for each semantic type
        for sem_type in SEMANTIC_TYPES:
            # data for this semantic type
            sem_type_bools   = [sem_type in x for x in current_dds_sem_types]
            sem_type_indices = np.where(sem_type_bools)[0]
            indices_map = dict(
                            zip(range(len(sem_type_indices)), 
                                sem_type_indices)
                          )  # maps regular indices in sem_type_current_dds_* lists to indices in current_dds_* lists

            sem_type_current_dds           = current_dds[sem_type_indices]
            sem_type_current_dds_features  = current_dds_features[sem_type_indices]
            sem_type_current_dds_txts      = current_dds_txts[sem_type_indices]
            sem_type_current_dds_sem_types = current_dds_sem_types[sem_type_indices]
            sem_type_current_dds_sem_names = current_dds_sem_names[sem_type_indices]

            # current_dds_featuresfor features (umls + rxnorm concepts)
            vectorizer = CountVectorizer()
            corpus = list(map(lambda x: ' '.join(x), sem_type_current_dds_features))
            if len(corpus) == 0: # skip rest if no candidate sentences exist
                continue
            X = vectorizer.fit_transform(corpus)
            X = X.toarray()

            # get cosine similarity using umls + rxnorm concepts
            similarity = cosine_similarity(X)     # larger=more similar
            sim_is, sim_js = np.where(similarity>0.5) # all pairs with at least 0.5 similarity

            for i, j in zip(sim_is, sim_js):
                data_i = sem_type_current_dds[i]
                data_j = sem_type_current_dds[j]
                # removing same sentence pairs, checking dates
                if i>j and is_comparable_type(data_i, data_j):
                    print(f"***** PAIR INDEX {pair_idx} *****")
                    print(f"Cosine similarity: {similarity[i, j]}")
                    print(f"----- Data i -----")
                    print(f">> Time: {data_i.time}\n" +\
                          f">> Type: {data_i.type}\n" +\
                          f">> Concepts: {data_i.features}\n" +\
                          f">> {data_i.txt}")
                    print(f"----- Data j -----")
                    print(f">> Time: {data_j.time}\n" +\
                          f">> Type: {data_j.type}\n" +\
                          f">> Concepts: {data_j.features}\n" +\
                          f">> {data_j.txt}")
                    print("**********************************")

                    # save
                    processed_pairs.append([data_i.txt,      data_j.txt, \
                                            data_i.time,     data_j.time, \
                                            data_i.type,     data_j.type, \
                                            data_i.features, data_j.features, \
                                            similarity[i, j], SEMANTIC_TYPE_TO_NAME[sem_type]])
            #                                 SEMANTIC_TYPE_TO_NAME[semantic_type]])

                    data_inst_pairs.append(((data_i, data_j), None))
                    pair_idx += 1

    ###############
    #### Final ####
    ###############        
    df = \
    pd.DataFrame(np.array(processed_pairs), \
                 columns=["sentence 1", "sentence 2", \
                          "time 1", "time 2", \
                          "type 1", "type 2", \
                          "concepts 1", "concepts 2", \
                          "cosine similarity", "semantic type"])
    
    return df, data_inst_pairs

## README: Store generated data here

In [22]:
generated_data_dict = {}

## Patient 1

In [23]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[0] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 155131


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boole

********** Processing data for 2131-12-23 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.7071067811865477
----- Data i -----
>> Time: 2131-12-23 23:51:00
>> Type: sentence
>> Concepts: {'Chronic Kidney Diseases', 'Acute respiratory failure'}
>> Altered mental status, acute respiratory failure,    chronic renal failure, hypotension    I saw and examined the patient, and was physically present with the ICU    Resident for key portions of the services provided.'
----- Data j -----
>> Time: 2131-12-23 23:51:00
>> Type: sentence
>> Concepts: {'Chronic Kidney Diseases'}
>> Patient with chronic renal failure.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.8770580193070292
----- Data i -----
>> Time: 2131-12-23 23:51:00
>> Type: sentence
>> Concepts: {'Structure of left lower lobe of lung', 'Pneumonia'}
>> HPI:     [**Age over 90 382**]  year old woman hx of COPD, recent admit in early  [**Month (only) 102**]  with LLL    pneumonia.'
----- Data j --

In [24]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Altered mental status, acute respiratory failure, chronic renal failure, hypotension I saw and examined the patient, and was physically present with the ICU Resident for key portions of the services provided.'",Patient with chronic renal failure.',2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Chronic Kidney Diseases, Acute respiratory failure}",{Chronic Kidney Diseases},0.707107,Disease or Syndrome
1,"HPI: [**Age over 90 382**] year old woman hx of COPD, recent admit in early [**Month (only) 102**] with LLL pneumonia.'","Compared to previous CXR, LLL infiltrate not significantly changed; right lung hyperinflation more impressive.'",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Pneumonia}","{Structure of left lower lobe of lung, Lung hyperinflation}",0.877058,Disease or Syndrome


In [25]:
diagnosis_pairs_df.head(20)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Altered mental status, acute respiratory failure, chronic renal failure, hypotension I saw and examined the patient, and was physically present with the ICU Resident for key portions of the services provided.'",Patient with chronic renal failure.',2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Chronic Kidney Diseases, Acute respiratory failure}",{Chronic Kidney Diseases},0.707107,Disease or Syndrome
1,"HPI: [**Age over 90 382**] year old woman hx of COPD, recent admit in early [**Month (only) 102**] with LLL pneumonia.'","Compared to previous CXR, LLL infiltrate not significantly changed; right lung hyperinflation more impressive.'",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Pneumonia}","{Structure of left lower lobe of lung, Lung hyperinflation}",0.877058,Disease or Syndrome
2,"I agree with his / her note above, including assessment and plan.'","Assessment and Plan PNEUMONIA, BACTERIAL, COMMUNITY ACQUIRED (CAP) ALTERED MENTAL STATUS (NOT DELIRIUM) RENAL FAILURE, CHRONIC (CHRONIC RENAL FAILURE, CRF, CHRONIC KIDNEY DISEASE) HYPOTENSION (NOT SHOCK) COPD ACIDOSIS Patient with severe COPD and recent pneumonia.'",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,{Infantile Neuroaxonal Dystrophy},"{Dial Antibacterial, Pneumonia, Infantile Neuroaxonal Dystrophy, Kidney}",0.654654,Disease or Syndrome
3,"CXR with LLL infiltrate, which may be persitent radiograph manifestation of her previous pneumonia (may take 6-8 weeks to resolve).'","Compared to previous CXR, LLL infiltrate not significantly changed; right lung hyperinflation more impressive.'",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Pneumonia}","{Structure of left lower lobe of lung, Lung hyperinflation}",0.877058,Disease or Syndrome
4,"CXR with LLL infiltrate, which may be persitent radiograph manifestation of her previous pneumonia (may take 6-8 weeks to resolve).'","HPI: [**Age over 90 382**] year old woman hx of COPD, recent admit in early [**Month (only) 102**] with LLL pneumonia.'",2131-12-23 23:51:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Pneumonia}","{Structure of left lower lobe of lung, Pneumonia}",1.0,Disease or Syndrome
5,difficile if produces # Chronic renal failure: Secondary to ANCA vasculitis.',Patient with chronic renal failure.',2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,"{Chronic Kidney Diseases, Clostridium difficile colitis}",{Chronic Kidney Diseases},0.707107,Disease or Syndrome
6,difficile if produces # Chronic renal failure: Secondary to ANCA vasculitis.',"Altered mental status, acute respiratory failure, chronic renal failure, hypotension I saw and examined the patient, and was physically present with the ICU Resident for key portions of the services provided.'",2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,"{Chronic Kidney Diseases, Clostridium difficile colitis}","{Chronic Kidney Diseases, Acute respiratory failure}",0.5,Disease or Syndrome
7,"PTT: 57.4', "" INR: 2.0 [**2129-1-3**] 2:33 A12/21/ [**2131**] 09:24 PM [**2129-1-7**] 10:20 P [**2129-1-8**] 1:20 P [**2129-1-9**] 11:50 P [**2129-1-10**] 1:20 A [**2129-1-11**] 7:20 P 1//11/006 1:23 P [**2129-2-3**] 1:20 P [**2129-2-3**] 11:20 P [**2129-2-3**] 4:20 P TC02 28 Other labs: Lactic Acid:0.9 mmol/L Assessment and Plan A [**Age over 90 382**] year-old female with past medical history of chronic obstructive pulmonary disease, Wegener's granulomatosis, recent admission [**Date range (1) 7130**] for acute on chronic renal failure with decision to initiate hemodialysis at that time and hospital stay complicated by left lower lobe Moraxella pneumonia presenting with altered mental status.""","Compared to previous CXR, LLL infiltrate not significantly changed; right lung hyperinflation more impressive.'",2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Lactic acid, Chronic Kidney Diseases, Granulomatosis with polyangiitis, Activated Partial Thromboplastin Time measurement, Lung diseases, Infantile Neuroaxonal Dystrophy, Laboratory test finding, Integrated Neuromusculoskeletal Release}","{Structure of left lower lobe of lung, Lung hyperinflation}",0.547153,Disease or Syndrome
8,"PTT: 57.4', "" INR: 2.0 [**2129-1-3**] 2:33 A12/21/ [**2131**] 09:24 PM [**2129-1-7**] 10:20 P [**2129-1-8**] 1:20 P [**2129-1-9**] 11:50 P [**2129-1-10**] 1:20 A [**2129-1-11**] 7:20 P 1//11/006 1:23 P [**2129-2-3**] 1:20 P [**2129-2-3**] 11:20 P [**2129-2-3**] 4:20 P TC02 28 Other labs: Lactic Acid:0.9 mmol/L Assessment and Plan A [**Age over 90 382**] year-old female with past medical history of chronic obstructive pulmonary disease, Wegener's granulomatosis, recent admission [**Date range (1) 7130**] for acute on chronic renal failure with decision to initiate hemodialysis at that time and hospital stay complicated by left lower lobe Moraxella pneumonia presenting with altered mental status.""","HPI: [**Age over 90 382**] year old woman hx of COPD, recent admit in early [**Month (only) 102**] with LLL pneumonia.'",2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Lactic acid, Chronic Kidney Diseases, Granulomatosis with polyangiitis, Activated Partial Thromboplastin Time measurement, Lung diseases, Infantile Neuroaxonal Dystrophy, Laboratory test finding, Integrated Neuromusculoskeletal Release}","{Structure of left lower lobe of lung, Pneumonia}",0.519875,Disease or Syndrome
9,"PTT: 57.4', "" INR: 2.0 [**2129-1-3**] 2:33 A12/21/ [**2131**] 09:24 PM [**2129-1-7**] 10:20 P [**2129-1-8**] 1:20 P [**2129-1-9**] 11:50 P [**2129-1-10**] 1:20 A [**2129-1-11**] 7:20 P 1//11/006 1:23 P [**2129-2-3**] 1:20 P [**2129-2-3**] 11:20 P [**2129-2-3**] 4:20 P TC02 28 Other labs: Lactic Acid:0.9 mmol/L Assessment and Plan A [**Age over 90 382**] year-old female with past medical history of chronic obstructive pulmonary disease, Wegener's granulomatosis, recent admission [**Date range (1) 7130**] for acute on chronic renal failure with decision to initiate hemodialysis at that time and hospital stay complicated by left lower lobe Moraxella pneumonia presenting with altered mental status.""","CXR with LLL infiltrate, which may be persitent radiograph manifestation of her previous pneumonia (may take 6-8 weeks to resolve).'",2131-12-23 22:56:00,2131-12-23 23:51:00,sentence,sentence,"{Structure of left lower lobe of lung, Lactic acid, Chronic Kidney Diseases, Granulomatosis with polyangiitis, Activated Partial Thromboplastin Time measurement, Lung diseases, Infantile Neuroaxonal Dystrophy, Laboratory test finding, Integrated Neuromusculoskeletal Release}","{Structure of left lower lobe of lung, Pneumonia}",0.519875,Disease or Syndrome


In [209]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [26]:
pair_idx = 11
is_sentence2 = True

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "Compared to previous CXR, no LLL infiltrate; right lung hyperinflation more impressive."
sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	Recent    admission for acute on chronic renal failure with tunnelled catheter    placement, complicated by left lower lobe Moraxella pneumonia    discharged on cefpodoxime.'
sentence 2:	Compared to previous CXR, LLL infiltrate not    significantly changed; right lung hyperinflation more impressive.'

New contradicting sentence: Compared to previous CXR, no LLL infiltrate; right lung hyperinflation more impressive.


In [27]:
no_contradiction_pair_idx = [10, 17]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	Recent    admission for acute on chronic renal failure with tunnelled catheter    placement, complicated by left lower lobe Moraxella pneumonia    discharged on cefpodoxime.'
sentence 2:	Patient with chronic renal failure.'
*****************************
sentence 1:	- Vancomycin, zosyn for hospital-acquired pneumonia    - Sputum, blood, ?influenza/viral, urine cultures pending; stool    culture/C.'
sentence 2:	- Vancomycin, zosyn for hospital-acquired pneumonia    - Sputum, blood cultures pending    - Standing albuterol and ipratropium with additional albuterol PRN    # Bandemia/left shift: Most likely sources are pulmonary versus line    infection as above.'
*****************************


## Patient 2

In [28]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[1] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 129414


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2174-02-12 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.75
----- Data i -----
>> Time: 2174-02-12 19:14:00
>> Type: sentence
>> Concepts: {'Asthma', 'Vitamin B 12 Deficiency'}
>> Vitamin B12 deficiency:  not on b12 currently    Twin sister with asthma'
----- Data j -----
>> Time: 2174-02-12 19:14:00
>> Type: sentence
>> Concepts: {'Anemia', 'Vitamin B 12 Deficiency'}
>> #Anemia: Slightly lower than recent baseline, likely secondary to known    Vitamin B12 deficiency.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.9086882225022428
----- Data i -----
>> Time: 2174-02-12 19:14:00
>> Type: sentence
>> Concepts: {'Structure of middle lobe of right lung', 'Pneumonia'}
>> He had a CXR which was    inconclusive, possible RML pneumonia.'
----- Data j -----
>> Time: 2174-02-12 19:14:00
>> Type: sentence
>> Concepts: {'Structure of right lower lobe of lung', 'Pneumonia', 'Structure of right upper lobe of lung'}
>> Got a

In [29]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,Vitamin B12 deficiency: not on b12 currently Twin sister with asthma',"#Anemia: Slightly lower than recent baseline, likely secondary to known Vitamin B12 deficiency.'",2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Asthma, Vitamin B 12 Deficiency}","{Anemia, Vitamin B 12 Deficiency}",0.75,Disease or Syndrome
1,"He had a CXR which was inconclusive, possible RML pneumonia.'","Got a CTA for concern of PE-but no PE, but some increased interstitial markings in RUL and RLL, more likely chronic process vs. pneumonia.'",2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Structure of middle lobe of right lung, Pneumonia}","{Structure of right lower lobe of lung, Pneumonia, Structure of right upper lobe of lung}",0.908688,Disease or Syndrome


In [31]:
diagnosis_pairs_df

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,Vitamin B12 deficiency: not on b12 currently Twin sister with asthma',"#Anemia: Slightly lower than recent baseline, likely secondary to known Vitamin B12 deficiency.'",2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Asthma, Vitamin B 12 Deficiency}","{Anemia, Vitamin B 12 Deficiency}",0.75,Disease or Syndrome
1,"He had a CXR which was inconclusive, possible RML pneumonia.'","Got a CTA for concern of PE-but no PE, but some increased interstitial markings in RUL and RLL, more likely chronic process vs. pneumonia.'",2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Structure of middle lobe of right lung, Pneumonia}","{Structure of right lower lobe of lung, Pneumonia, Structure of right upper lobe of lung}",0.908688,Disease or Syndrome
2,Asthma: on Advair and Spiriva.',Patient only mildly wheezing on exam and unlikely asthma exacerbation.',2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Advair, Spiriva, Asthma}",{Asthma},0.57735,Disease or Syndrome
3,Obesity: Glucose intolerance: Obstructive sleep apnea: declined CPAP therapy.',Glucose intolerance: SSI while on steroids -SSI.',2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Sleep Apnea, Obstructive, Glucose Intolerance (disease)}","{Steroids, Glucose Intolerance (disease)}",0.612372,Disease or Syndrome
4,"#Tachycardia: Sinus Tach on EKG from ED, likely due to hypoxia.'",The ED resident noted that he was not particularly wheezy on exam.',2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Electrocardiography, Erectile dysfunction, Sinutab Sinus}",{Erectile dysfunction},0.632456,Disease or Syndrome
5,"Hypertension: on hydrochlorothiazide, lisinopril, and Norvasc.'","Hypertension: -continue hydrochlorothiazide, lisinopril and Norvasc.'",2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Lisinopril, Hypertet, Norvasc, Hypertensive disease, Hydrochlorothiazide, hydrochlorothiazide, lisinopril}","{Lisinopril, Hypertet, Norvasc, Hypertensive disease, Hydrochlorothiazide, hydrochlorothiazide, lisinopril}",1.0,Disease or Syndrome
6,Obstructive sleep apnea: Declined CPAP therapy in the past but will likely benefit from positive pressure as he is desatting when he falls asleep.',Obesity: Glucose intolerance: Obstructive sleep apnea: declined CPAP therapy.',2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Positive pressure therapy, Sleep Apnea, Obstructive}","{Sleep Apnea, Obstructive, Glucose Intolerance (disease)}",0.5,Disease or Syndrome
7,"In the ED, initial vs were: T: 99.5 P109 BP101/56 R24 79% on RA on presentation, 100% on NRB.'",The ED resident noted that he was not particularly wheezy on exam.',2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Refractory anemias, Erectile dysfunction}",{Erectile dysfunction},0.707107,Disease or Syndrome
20,"#Tachycardia: Sinus Tach on EKG from ED, likely due to hypoxia.'","Also be very tachycardic, EKG shows sinus tach.'",2174-02-12 19:14:00,2174-02-12 19:14:00,sentence,sentence,"{Electrocardiography, Erectile dysfunction, Sinutab Sinus}","{Electrocardiography, Sinutab Sinus}",0.774597,Diagnostic Procedure
27,"HPI: 51 yo history of asthma and hypoplasia of right lower lobe, frequent exacerbations of lung disease, sleep apnea - both central and OSA.'",CT did show consolidation in RML.',2174-02-13 10:57:00,2174-02-13 10:57:00,sentence,sentence,"{Lung diseases, Structure of right lower lobe of lung, Kcentra, Sleep Apnea Syndromes, Asthma}","{Lung consolidation, Structure of middle lobe of right lung}",0.719092,Disease or Syndrome


In [215]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [32]:
pair_idx = 6
is_sentence2 = True

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "Obesity: Glucose intolerance: Obstructive sleep apnea: mitigated with CPAP therapy"
sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	Obstructive sleep apnea:  Declined CPAP therapy in the past but    will likely benefit from positive pressure as he is desatting when he    falls asleep.'
sentence 2:	Obesity:    Glucose intolerance:    Obstructive sleep apnea: declined CPAP therapy.'

New contradicting sentence: Obesity: Glucose intolerance: Obstructive sleep apnea: mitigated with CPAP therapy


In [33]:
no_contradiction_pair_idx = [3, 5]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")

# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	Obesity:    Glucose intolerance:    Obstructive sleep apnea: declined CPAP therapy.'
sentence 2:	Glucose intolerance:  SSI while on steroids    -SSI.'
*****************************
sentence 1:	   Hypertension:  on hydrochlorothiazide, lisinopril, and Norvasc.'
sentence 2:	Hypertension:    -continue hydrochlorothiazide, lisinopril and Norvasc.'
*****************************


In [218]:
"""
Todo: ask Dr. Saenz
"""
potential_contradiction_pair_indices = [21]

print("Potential examples of contradictions")
print("*****************************")
for pair_idx in potential_contradiction_pair_indices:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")

Potential examples of contradictions
*****************************
lab 1:	Patient's White Blood Cells lab came back 6.0 K/uL.
sentence 2:	Labs notable for WBC of 5, HCT 34.5,    sodium of 131 and creatinine of 1.0.'
*****************************


## Patient 3

In [34]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[2] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 133623


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a

********** Processing data for 2145-11-30 **********
********** Processing data for 2145-12-01 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.5773502691896258
----- Data i -----
>> Time: 2145-12-01 03:47:00
>> Type: sentence
>> Concepts: {'Dilt', 'Ting AF', 'Atrial Fibrillation', 'Glucose tolerance test'}
>> #: AF with RVR: HR at 110s upon arrival to the floor on Dilt gtt.'
----- Data j -----
>> Time: 2145-12-01 03:47:00
>> Type: sentence
>> Concepts: {'Physical therapy', 'Ting AF', 'Atrial Fibrillation'}
>> Pt currently in AF with rates in 120s.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.6324555320336758
----- Data i -----
>> Time: 2145-12-01 03:47:00
>> Type: sentence
>> Concepts: {'Erectile dysfunction'}
>> The pt subsequently re-developed chest pain while in the ED.'
----- Data j -----
>> Time: 2145-12-01 03:47:00
>> Type: sentence
>> Concepts: {'ethanol', 'Erectile dysfunction', 'Chest Pain'}
>> Chest Pain, ETOH Intoxication    HPI:

In [35]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,#: AF with RVR: HR at 110s upon arrival to the floor on Dilt gtt.',Pt currently in AF with rates in 120s.',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Dilt, Ting AF, Atrial Fibrillation, Glucose tolerance test}","{Physical therapy, Ting AF, Atrial Fibrillation}",0.57735,Disease or Syndrome
1,The pt subsequently re-developed chest pain while in the ED.',"Chest Pain, ETOH Intoxication HPI: 54M with hx of ETOH abuse, HCV, presented to the ED this evening intoxicated.'",2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,{Erectile dysfunction},"{ethanol, Erectile dysfunction, Chest Pain}",0.632456,Disease or Syndrome


In [36]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,#: AF with RVR: HR at 110s upon arrival to the floor on Dilt gtt.',Pt currently in AF with rates in 120s.',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Dilt, Ting AF, Atrial Fibrillation, Glucose tolerance test}","{Physical therapy, Ting AF, Atrial Fibrillation}",0.57735,Disease or Syndrome
1,The pt subsequently re-developed chest pain while in the ED.',"Chest Pain, ETOH Intoxication HPI: 54M with hx of ETOH abuse, HCV, presented to the ED this evening intoxicated.'",2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,{Erectile dysfunction},"{ethanol, Erectile dysfunction, Chest Pain}",0.632456,Disease or Syndrome
2,# Chest Pain: Pt reported left sided chest pain while in the ED.',"Chest Pain, ETOH Intoxication HPI: 54M with hx of ETOH abuse, HCV, presented to the ED this evening intoxicated.'",2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Physical therapy, Erectile dysfunction, Chest Pain}","{ethanol, Erectile dysfunction, Chest Pain}",0.730297,Disease or Syndrome
3,# Chest Pain: Pt reported left sided chest pain while in the ED.',The pt subsequently re-developed chest pain while in the ED.',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Physical therapy, Erectile dysfunction, Chest Pain}",{Erectile dysfunction},0.57735,Disease or Syndrome
4,# Chest Pain: Pt reported left sided chest pain while in the ED.',"Pt has received a total of 100mg of Valium during his ED course, Diltiazem 30mg IV of dilt x3, Dilt 30mg PO and subsequently placed on a Diltiazem drip at 15mg/hr.'",2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Physical therapy, Erectile dysfunction, Chest Pain}","{Diltiazem, Dilt, Valium, Physical therapy, Erectile dysfunction, diltiazem}",0.516398,Disease or Syndrome
5,Pt evaluated by Cards while in ED.',# Chest Pain: Pt reported left sided chest pain while in the ED.',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Physical therapy, Erectile dysfunction, Respiratory Distress Syndrome, Adult}","{Physical therapy, Erectile dysfunction, Chest Pain}",0.57735,Disease or Syndrome
6,Upon assessment the pt reportedly became combative with a HR revealing AF with RVR with rates in 160-170s.',"ASSESSMENT AND PLAN: 54M with hx of ETOH abuse, HCV, presenting with AF with RVR, Chest Pain in the setting of ETOH intoxication.'",2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Ting AF, Atrial Fibrillation}","{ethanol, methionine, Ting AF, Atrial Fibrillation, Chest Pain, Infantile Neuroaxonal Dystrophy}",0.603023,Disease or Syndrome
7,Upon assessment the pt reportedly became combative with a HR revealing AF with RVR with rates in 160-170s.',Pt currently in AF with rates in 120s.',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Ting AF, Atrial Fibrillation}","{Physical therapy, Ting AF, Atrial Fibrillation}",0.816497,Disease or Syndrome
8,Upon assessment the pt reportedly became combative with a HR revealing AF with RVR with rates in 160-170s.',#: AF with RVR: HR at 110s upon arrival to the floor on Dilt gtt.',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Ting AF, Atrial Fibrillation}","{Dilt, Ting AF, Atrial Fibrillation, Glucose tolerance test}",0.707107,Disease or Syndrome
9,Upon assessment the pt reportedly became combative with a HR revealing AF with RVR with rates in 160-170s.',RR: 29 (13 - 29) insp/min SpO2: 95% Heart rhythm: AF (Atrial Fibrillation) Wgt (current): 117.4 kg (admission): 117.4 kg Total In: 558 mL PO: 480 mL TF: IVF: 78 mL Blood products: Total out: 0 mL 0 mL Urine: NG: Stool: Drains: Balance: 0 mL 558 mL Respiratory O2 Delivery Device: None SpO2: 95%',2145-12-01 03:47:00,2145-12-01 03:47:00,sentence,sentence,"{Ting AF, Atrial Fibrillation}","{Atrial Fibrillation Sotalol Hydrochloride 80 MG Oral Tablet [Betapace], Atrial Fibrillation, Blood Stop Topical Product, Uridine, Systane Balance, Drainage procedure, Blood product, Ting AF, Assisted Reproductive Technologies}",0.507093,Disease or Syndrome


In [222]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [37]:
pair_idx = 1
is_sentence2 = True

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "No reported Chest Pain, ETOH Intoxication HPI: 54M with hx of ETOH abuse, HCV, presented to the ED this evening intoxicated"
sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	The pt subsequently re-developed chest pain while in the ED.'
sentence 2:	Chest Pain, ETOH Intoxication    HPI:    54M with hx of ETOH abuse, HCV, presented to the ED this evening    intoxicated.'

New contradicting sentence: No reported Chest Pain, ETOH Intoxication HPI: 54M with hx of ETOH abuse, HCV, presented to the ED this evening intoxicated


In [38]:
no_contradiction_pair_idx = [63, 64]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	Repeat ECG unchanged.'
sentence 2:	ECG    unchanged.'
*****************************
sentence 1:	   Peripheral Vascular: (Right radial pulse: Present), (Left radial pulse:    Present), (Right DP pulse: Not assessed), (Left DP pulse: Not assessed)    Respiratory / Chest: (Expansion: Symmetric), (Percussion: Resonant : ),    (Breath Sounds: Clear : )'
sentence 2:	   Peripheral Vascular: (Right radial pulse: Present), (Left radial pulse:    Present), (Right DP pulse: Present), (Left DP pulse: Present)    Respiratory / Chest: (Expansion: Symmetric), (Percussion: Resonant : ),    (Breath Sounds: Clear : ), Reproducible Left sided chest pain'
*****************************


## Patient 4

In [39]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[3] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 197325


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

********** Processing data for 2157-02-01 **********
********** Processing data for 2157-02-02 **********
***** PAIR INDEX 0 *****
Cosine similarity: 1.0000000000000002
----- Data i -----
>> Time: 2157-02-02 01:37:00
>> Type: sentence
>> Concepts: {'Gastroesophageal reflux disease'}
>> GERD - ASymptomatic.'
----- Data j -----
>> Time: 2157-02-02 01:37:00
>> Type: sentence
>> Concepts: {'Gastroesophageal reflux disease'}
>> Gastroesophageal reflux disease.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.6546536707079772
----- Data i -----
>> Time: 2157-02-02 01:37:00
>> Type: sentence
>> Concepts: {'Hypertet', 'Hypertensive disease'}
>> Hypertension.'
----- Data j -----
>> Time: 2157-02-02 01:37:00
>> Type: sentence
>> Concepts: {'Hypertet', 'Hypertensive disease', 'Cholecystectomy procedure', 'Abdominal Pain'}
>> HPI:    Mrs.  [**Known firstname 12011**]   [**Known lastname 12012**]  is a very nice 85 year-old woman with    significant past medical his

In [40]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,GERD - ASymptomatic.',Gastroesophageal reflux disease.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,{Gastroesophageal reflux disease},{Gastroesophageal reflux disease},1.0,Disease or Syndrome
1,Hypertension.',"HPI: Mrs. [**Known firstname 12011**] [**Known lastname 12012**] is a very nice 85 year-old woman with significant past medical history of hypertension, cholecystectomy and ampullar stenosis wh ocomes with abdominal pain.'",2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Hypertet, Hypertensive disease}","{Hypertet, Hypertensive disease, Cholecystectomy procedure, Abdominal Pain}",0.654654,Disease or Syndrome


In [41]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,GERD - ASymptomatic.',Gastroesophageal reflux disease.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,{Gastroesophageal reflux disease},{Gastroesophageal reflux disease},1.0,Disease or Syndrome
1,Hypertension.',"HPI: Mrs. [**Known firstname 12011**] [**Known lastname 12012**] is a very nice 85 year-old woman with significant past medical history of hypertension, cholecystectomy and ampullar stenosis wh ocomes with abdominal pain.'",2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Hypertet, Hypertensive disease}","{Hypertet, Hypertensive disease, Cholecystectomy procedure, Abdominal Pain}",0.654654,Disease or Syndrome
2,Agree with plan to manage acute cholangitis with obstructing CBD stone with broad abx coverage with vanco / zosyn / flagyl while awaiting BCx and continuing hydration based on MAP / UOP.',Remainder of plan as outlined above.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Mapap, cannabidiol, Acute cholangitis, Infantile Neuroaxonal Dystrophy, BCX 34, Flagyl}",{Infantile Neuroaxonal Dystrophy},0.547723,Disease or Syndrome
3,Hypertension - Pt with normal renal function coming with cholangitis.',"HPI: Mrs. [**Known firstname 12011**] [**Known lastname 12012**] is a very nice 85 year-old woman with significant past medical history of hypertension, cholecystectomy and ampullar stenosis wh ocomes with abdominal pain.'",2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Physical therapy, Hypertet, Hypertensive disease}","{Hypertet, Hypertensive disease, Cholecystectomy procedure, Abdominal Pain}",0.507093,Disease or Syndrome
4,Hypertension - Pt with normal renal function coming with cholangitis.',Hypertension.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Physical therapy, Hypertet, Hypertensive disease}","{Hypertet, Hypertensive disease}",0.774597,Disease or Syndrome
5,"Will monitor FSG with RISS while in ICU; no prior hx DM but FSG >200 in the ED in patient with chronic pancreatic disease.', ""Will continue Parkinson's and GERD meds at home doses.""",Gastroesophageal reflux disease.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{parkinson's disease and parkinsonism, Pancreatic Diseases, MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME, Erectile dysfunction, Gastroesophageal reflux disease}",{Gastroesophageal reflux disease},0.516398,Disease or Syndrome
6,"Will monitor FSG with RISS while in ICU; no prior hx DM but FSG >200 in the ED in patient with chronic pancreatic disease.', ""Will continue Parkinson's and GERD meds at home doses.""",GERD - ASymptomatic.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{parkinson's disease and parkinsonism, Pancreatic Diseases, MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME, Erectile dysfunction, Gastroesophageal reflux disease}",{Gastroesophageal reflux disease},0.516398,Disease or Syndrome
7,"Elevated lactate - Pt with stable VS, cholangitis as above.'","Cholangitis - Pt with RUQ pain and boas sign found to have 13 WBC with 14% bands, AST 979, ALT 723, AP 120, TB 3.5 and direct of 2.4 found to have cholangitis on ERCP.'",2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Physical therapy, Cholangitis}","{Tuberculosis, Physical therapy, Right upper quadrant pain, Endoscopic Retrograde Cholangiopancreatography, Cholangitis}",0.522233,Disease or Syndrome
8,"Elevated lactate - Pt with stable VS, cholangitis as above.'",Hypertension - Pt with normal renal function coming with cholangitis.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Physical therapy, Cholangitis}","{Physical therapy, Hypertet, Hypertensive disease}",0.516398,Disease or Syndrome
9,"Left: Carotid 2+ Femoral 2+ Popliteal 2+ DP 2+ PT 2+ Labs / Radiology [image002.jpg] Other labs: Lactic Acid:4.3 mmol/L Assessment and Plan Mrs. [**Known firstname 12011**] [**Known lastname 12012**] is a very nice 85 year-old woman with significant past medical history of hypertension, cholecystectomy and ampullar stenosis wh ocomes with cholangitis now s/p stent removal.'",Hypertension - Pt with normal renal function coming with cholangitis.',2157-02-02 01:37:00,2157-02-02 01:37:00,sentence,sentence,"{Hypertet, Femur, carotid, Hypertensive disease, Cholangitis, removal technique, Physical therapy, Infantile Neuroaxonal Dystrophy, Laboratory test finding, Cholecystectomy procedure}","{Physical therapy, Hypertet, Hypertensive disease}",0.527046,Disease or Syndrome


In [228]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [42]:
pair_idx = 1
is_sentence2 = False

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "Hypotension."
sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	Hypertension.'
sentence 2:	HPI:    Mrs.  [**Known firstname 12011**]   [**Known lastname 12012**]  is a very nice 85 year-old woman with    significant past medical history of hypertension, cholecystectomy and    ampullar stenosis wh ocomes with abdominal pain.'

New contradicting sentence: Hypotension.


In [43]:
no_contradiction_pair_idx = [26, 95]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	GERD:  Asymptomatic: Continue home-dose PPI.'
sentence 2:	Will    monitor FSG with RISS while in ICU; no prior hx DM but FSG >200 in the    ED in patient with chronic pancreatic disease.', "Will continue    Parkinson's and GERD meds at home doses."
*****************************
sentence 1:	She underwent ERCP that    shwoed 1 cm stone in the common bile duct.'
sentence 2:	ERCP team notes transient episode of desaturation during    procedure, resolved with time and supplemental oxygen.'
*****************************


## Patient 5

In [44]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[4] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 186291


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2151-09-21 **********
********** Processing data for 2151-09-22 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.5477225575051663
----- Data i -----
>> Time: 2151-09-22 05:00:00
>> Type: sentence
>> Concepts: {'Hypertet', 'Constipation', 'Infantile Neuroaxonal Dystrophy', 'Commit Lozenge', 'Hypertensive disease', 'Nausea'}
>>    Imaging: CXR ( [**2151-9-21**] ) no acute infiltrates    Assessment and Plan    NAUSEA, VOMMITTING -- unclear etiology; unclear if preceeded    hypertension or whether hypertension is cause or contibuting;  Chronic    constipation may also be contributing (and evidence radiographically).'
----- Data j -----
>> Time: 2151-09-22 05:00:00
>> Type: sentence
>> Concepts: {'Infantile Neuroaxonal Dystrophy'}
>> I agree with his /    her note above, including assessment and plan.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.7071067811865477
----- Data i -----
>> Time: 2151-09-22 05:00:00
>> Typ

In [45]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Imaging: CXR ( [**2151-9-21**] ) no acute infiltrates Assessment and Plan NAUSEA, VOMMITTING -- unclear etiology; unclear if preceeded hypertension or whether hypertension is cause or contibuting; Chronic constipation may also be contributing (and evidence radiographically).'","I agree with his / her note above, including assessment and plan.'",2151-09-22 05:00:00,2151-09-22 05:00:00,sentence,sentence,"{Hypertet, Constipation, Infantile Neuroaxonal Dystrophy, Commit Lozenge, Hypertensive disease, Nausea}",{Infantile Neuroaxonal Dystrophy},0.547723,Disease or Syndrome
1,"COntrol with iv meds, and resume PO meds as tolerates.'","Continue usual meds, and low threshold for antimicrobials if evidence for COPD exacerbation.'",2151-09-22 05:00:00,2151-09-22 05:00:00,sentence,sentence,"{Ehlers-Danlos Syndrome, Type IV, MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME}","{MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME, Microbicides}",0.707107,Disease or Syndrome


In [46]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Imaging: CXR ( [**2151-9-21**] ) no acute infiltrates Assessment and Plan NAUSEA, VOMMITTING -- unclear etiology; unclear if preceeded hypertension or whether hypertension is cause or contibuting; Chronic constipation may also be contributing (and evidence radiographically).'","I agree with his / her note above, including assessment and plan.'",2151-09-22 05:00:00,2151-09-22 05:00:00,sentence,sentence,"{Hypertet, Constipation, Infantile Neuroaxonal Dystrophy, Commit Lozenge, Hypertensive disease, Nausea}",{Infantile Neuroaxonal Dystrophy},0.547723,Disease or Syndrome
1,"COntrol with iv meds, and resume PO meds as tolerates.'","Continue usual meds, and low threshold for antimicrobials if evidence for COPD exacerbation.'",2151-09-22 05:00:00,2151-09-22 05:00:00,sentence,sentence,"{Ehlers-Danlos Syndrome, Type IV, MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME}","{MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME, Microbicides}",0.707107,Disease or Syndrome
2,"NECK: No JVD, carotid pulses brisk, no bruits, no cervical lymphadenopathy, trachea midline'","Lymphatic: No(t) Cervical WNL, No(t) Supraclavicular WNL, No(t) Cervical adenopathy'",2151-09-22 03:58:00,2151-09-22 05:00:00,sentence,sentence,{Cervical lymphadenopathy},{Lymphadenopathy},0.707107,Disease or Syndrome
3,"He does not have lab evidence of pancreatitis, but does have h/o gastritis.'","No evience for cholycystitis, hepatitis or pancreatitis.'",2151-09-22 03:58:00,2151-09-22 05:00:00,sentence,sentence,"{Pancreatitis, Pancreatin}","{Pancreatin, Pancreatitis, Hepatitis}",0.816497,Disease or Syndrome
4,['[**Hospital Unit Name 44**] Resident Admission Note Reason for MICU Admission: Hypertension',HYPERTENSION -- unclear whether due to consequence of poor PO intake (medications) or primary problem.',2151-09-22 03:58:00,2151-09-22 05:00:00,sentence,sentence,"{Hypertet, Hypertensive disease}","{Hypertet, Hypertensive disease, Pharmaceutical Preparations}",0.774597,Disease or Syndrome
5,['[**Hospital Unit Name 44**] Resident Admission Note Reason for MICU Admission: Hypertension',"Imaging: CXR ( [**2151-9-21**] ) no acute infiltrates Assessment and Plan NAUSEA, VOMMITTING -- unclear etiology; unclear if preceeded hypertension or whether hypertension is cause or contibuting; Chronic constipation may also be contributing (and evidence radiographically).'",2151-09-22 03:58:00,2151-09-22 05:00:00,sentence,sentence,"{Hypertet, Hypertensive disease}","{Hypertet, Constipation, Infantile Neuroaxonal Dystrophy, Commit Lozenge, Hypertensive disease, Nausea}",0.547723,Disease or Syndrome
6,"HPI: This is a 66 yo M w/h/o HIV(last CD4 307 [**2151-9-10**] , VL 187 [**2151-9-15**] ), HTN, and severe COPD on 3L oxygen at home who presents w/nausea and emesis x 2 days.'",['[**Hospital Unit Name 44**] Resident Admission Note Reason for MICU Admission: Hypertension',2151-09-22 03:58:00,2151-09-22 03:58:00,sentence,sentence,"{Oxygen, Hypertensive disease, ASP2151}","{Hypertet, Hypertensive disease}",0.57735,Disease or Syndrome
7,Sister: HTN',HYPERTENSION -- unclear whether due to consequence of poor PO intake (medications) or primary problem.',2151-09-22 03:58:00,2151-09-22 05:00:00,sentence,sentence,{Hypertensive disease},"{Hypertet, Hypertensive disease, Pharmaceutical Preparations}",0.632456,Disease or Syndrome
8,Sister: HTN',['[**Hospital Unit Name 44**] Resident Admission Note Reason for MICU Admission: Hypertension',2151-09-22 03:58:00,2151-09-22 03:58:00,sentence,sentence,{Hypertensive disease},"{Hypertet, Hypertensive disease}",0.816497,Disease or Syndrome
9,Sister: HTN',"HPI: This is a 66 yo M w/h/o HIV(last CD4 307 [**2151-9-10**] , VL 187 [**2151-9-15**] ), HTN, and severe COPD on 3L oxygen at home who presents w/nausea and emesis x 2 days.'",2151-09-22 03:58:00,2151-09-22 03:58:00,sentence,sentence,{Hypertensive disease},"{Oxygen, Hypertensive disease, ASP2151}",0.707107,Disease or Syndrome


In [233]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [47]:
pair_idx = 2
is_sentence2 = False

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "NECK: No JVD, carotid pulses brisk, no bruits, cervical lymphadenopathy, trachea midline'"

sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	   NECK: No JVD, carotid pulses brisk, no bruits, no cervical    lymphadenopathy, trachea midline'
sentence 2:	   Lymphatic: No(t) Cervical WNL, No(t) Supraclavicular WNL, No(t)    Cervical adenopathy'

New contradicting sentence: NECK: No JVD, carotid pulses brisk, no bruits, cervical lymphadenopathy, trachea midline'


In [48]:
no_contradiction_pair_idx = [13, 81]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	He does not have lab evidence of pancreatitis, but does have    h/o gastritis.'
sentence 2:	No evience for cholycystitis, hepatitis or    pancreatitis.'
*****************************
sentence 1:	CT abdomen w/o evidence of any acute abdominal process    and KUB w/o evidence of SBO.'
sentence 2:	   KUB: Paucity of bowel gas, however, no radiographic evidence for bowel    obstruction.'
*****************************


## Patient 6

In [49]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[5] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 180836


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of

********** Processing data for 2152-02-15 **********
********** Processing data for 2152-02-16 **********
********** Processing data for 2152-02-17 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.5393598899705937
----- Data i -----
>> Time: 2152-02-17 10:25:00
>> Type: sentence
>> Concepts: {'Holding patient', 'ipratropium', 'Biphasic Positive Airway Pressure', 'Ipratropium', 'Spiriva'}
>> - holding spiriva    - receiving ipratropium and albuterol nebluizers    - azithromycin day 3    - pulse dose solumedrol    - BIPAP'
----- Data j -----
>> Time: 2152-02-17 10:25:00
>> Type: sentence
>> Concepts: {'oxygen', 'Biphasic Positive Airway Pressure'}
>>    Vitals T: 97.0 HR: 113 BP: 153/96  RR: 19 O2: 100% on BIPAP    General Thin elderly man, tachypneic, using accessory muscles for    respiration    HEENT sclera anicteric, conjunctiva pink, mucous membranes moist, no    lymphadenopathy'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.507092552837109

In [50]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
23,"COPD: As above, end stage disease with FEV1 of 0.5 which is 20% predicted.'","Respiratory Distress HPI: Mr. [**Known lastname **] is a 67M with HIV (Cd4 183, VL 96 copies/mL) and end stage COPD on 3-4L home O2 with a FEV1 of 0.5 who presented to the emergency room on [**2152-2-15**] with increased shortness of breath.'",2152-02-17 10:25:00,2152-02-17 10:25:00,sentence,sentence,"{Terminal illness, Pulmonary Function Test/Forced Expiratory Volume 1}","{Respiratory distress, Human Immunodeficiency Virus Measurement, Ventral Lateral Thalamic Nucleus, Pulmonary Function Test/Forced Expiratory Volume 1}",0.53033,Diagnostic Procedure
46,"Abdominal: Soft, non-tender, non-distended, +BS'","Decreased BS throughout, hyperexpanded.'",2152-02-19 07:02:00,2152-02-19 14:04:00,sentence,sentence,"{Bloom Syndrome, Sea Soft}",{Bloom Syndrome},0.707107,Disease or Syndrome


In [51]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
23,"COPD: As above, end stage disease with FEV1 of 0.5 which is 20% predicted.'","Respiratory Distress HPI: Mr. [**Known lastname **] is a 67M with HIV (Cd4 183, VL 96 copies/mL) and end stage COPD on 3-4L home O2 with a FEV1 of 0.5 who presented to the emergency room on [**2152-2-15**] with increased shortness of breath.'",2152-02-17 10:25:00,2152-02-17 10:25:00,sentence,sentence,"{Terminal illness, Pulmonary Function Test/Forced Expiratory Volume 1}","{Respiratory distress, Human Immunodeficiency Virus Measurement, Ventral Lateral Thalamic Nucleus, Pulmonary Function Test/Forced Expiratory Volume 1}",0.53033,Diagnostic Procedure
46,"Abdominal: Soft, non-tender, non-distended, +BS'","Decreased BS throughout, hyperexpanded.'",2152-02-19 07:02:00,2152-02-19 14:04:00,sentence,sentence,"{Bloom Syndrome, Sea Soft}",{Bloom Syndrome},0.707107,Disease or Syndrome
66,"COPD: As above, end stage disease with FEV1 of 0.5 which is 20% predicted.'","Attending 67M HIV (CD4 183/VL 96), severe COPD c FEV1 0.5 on 3-4L at home p/w cough, SOB and rising oxygen requirements.'",2152-02-19 07:02:00,2152-02-19 14:04:00,sentence,sentence,"{Terminal illness, Pulmonary Function Test/Forced Expiratory Volume 1}","{Human Immunodeficiency Virus Measurement, Dyspnea, Pulmonary Function Test/Forced Expiratory Volume 1}",0.639602,Diagnostic Procedure
76,"I agree with his note above, including assessment and plan.'","Disposition: plan call-out to floor today, much improved after short course intubatin.'",2152-02-20 07:22:00,2152-02-20 07:22:00,sentence,sentence,{Infantile Neuroaxonal Dystrophy},"{Intubation, Infantile Neuroaxonal Dystrophy}",0.866025,Disease or Syndrome


In [239]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [52]:
pair_idx = 66
is_sentence2 = True

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "Attending 67M HIV (CD4 183/VL 96), severe COPD c FEV1 1.5 on 3-4L at home p/w cough, SOB and rising oxygen requirements."
sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	   COPD:  As above, end stage disease with FEV1 of 0.5 which is 20%    predicted.'
sentence 2:	Attending    67M HIV (CD4 183/VL 96), severe COPD c FEV1 0.5 on 3-4L at home p/w    cough, SOB and rising oxygen requirements.'

New contradicting sentence: Attending 67M HIV (CD4 183/VL 96), severe COPD c FEV1 1.5 on 3-4L at home p/w cough, SOB and rising oxygen requirements.


In [53]:
no_contradiction_pair_idx = [23, 76]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	   COPD:  As above, end stage disease with FEV1 of 0.5 which is 20%    predicted.'
sentence 2:	Respiratory Distress    HPI:    Mr.  [**Known lastname **]  is a 67M with HIV (Cd4 183, VL 96 copies/mL) and end stage    COPD on 3-4L home O2 with a FEV1 of 0.5 who presented to the emergency    room on  [**2152-2-15**]  with increased shortness of breath.'
*****************************
sentence 1:	I agree with his    note above, including assessment and plan.'
sentence 2:	   Disposition:  plan call-out to floor today, much improved after short    course intubatin.'
*****************************


## Patient 7

In [54]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[6] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 154802


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a

********** Processing data for 2121-03-31 **********
********** Processing data for 2121-04-01 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.6030226891555273
----- Data i -----
>> Time: 2121-04-01 09:29:00
>> Type: sentence
>> Concepts: {'Adrenergic beta-Antagonists', 'aspirin', 'Hydroxymethylglutaryl-CoA Reductase Inhibitors', 'Aspirin'}
>>    Cardiovascular: Aspirin, Beta-blocker, Statins'
----- Data j -----
>> Time: 2121-04-01 09:29:00
>> Type: sentence
>> Concepts: {'aspirin', 'Aspirin'}
>> Aspirin EC5.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.9999999999999997
----- Data i -----
>> Time: 2121-04-01 00:00:00
>> Type: prescription
>> Concepts: {'glucose', 'Glucose 50 MG/ML Oral Solution'}
>> Patient was prescribed 5% Dextrose 250mL Bag IV DRIP of total 250 mL
----- Data j -----
>> Time: 2121-04-01 09:29:00
>> Type: sentence
>> Concepts: {'glucose', 'Glucose 50 MG/ML Oral Solution'}
>> Dextrose    50% .'
*****************************

In [55]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type


In [56]:
diagnosis_pairs_df

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type


In [244]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [57]:
# pair_idx = 10
# is_sentence2 = True

# data_1 = data_inst_pairs[pair_idx][0][0]
# data_2 = data_inst_pairs[pair_idx][0][1]

# print(f"{data_1.type} 1:\t{data_1.txt}")
# print(f"{data_2.type} 2:\t{data_2.txt}")

# sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# # Set `contradicting_txt` to the new contradicting sentence.
# # This will just update the text for now.

# contradicting_txt = "Metoclopramide held."
# sentence_to_modify.update_text(contradicting_txt)

# print(f"\nNew contradicting sentence: {contradicting_txt}")

# # Store conflict
# generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

In [58]:
# no_contradiction_pair_idx = [13, 14, 39, 44]

# print("Examples of non-contradictions")
# print("*****************************")
# for pair_idx in no_contradiction_pair_idx:
#     data_1 = data_inst_pairs[pair_idx][0][0]
#     data_2 = data_inst_pairs[pair_idx][0][1]
    
#     print(f"{data_1.type} 1:\t{data_1.txt}")
#     print(f"{data_2.type} 2:\t{data_2.txt}")
#     print("*****************************")
    
# # Store negative examples
# generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

## Patient 8

In [59]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[7] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 133857


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boole

********** Processing data for 2175-03-12 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.6123724356957946
----- Data i -----
>> Time: 2175-03-12 22:31:00
>> Type: sentence
>> Concepts: {'Nimodipine', 'Infantile Neuroaxonal Dystrophy', 'Dilantin', 'nimodipine'}
>>    Neurologic: Goal SBP < 140, Plan for angio tommorow, Nimodipine,    Dilantin, Hob >30 degrees.'
----- Data j -----
>> Time: 2175-03-12 22:31:00
>> Type: sentence
>> Concepts: {'Infantile Neuroaxonal Dystrophy'}
>>    Assessment And Plan: 69 year old male admitted with SAH'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 1.0
----- Data i -----
>> Time: 2175-03-12 00:00:00
>> Type: prescription
>> Concepts: {'niCARdipine', 'nicardipine'}
>> Patient was prescribed NiCARdipine IV 2.5mg/mL;10mL Amp IV DRIP of total 125 mg
----- Data j -----
>> Time: 2175-03-12 22:31:00
>> Type: sentence
>> Concepts: {'niCARdipine', 'nicardipine'}
>> Nicardipine gtt'
**********************************
****

In [60]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Neurologic: Goal SBP < 140, Plan for angio tommorow, Nimodipine, Dilantin, Hob >30 degrees.'",Assessment And Plan: 69 year old male admitted with SAH',2175-03-12 22:31:00,2175-03-12 22:31:00,sentence,sentence,"{Nimodipine, Infantile Neuroaxonal Dystrophy, Dilantin, nimodipine}",{Infantile Neuroaxonal Dystrophy},0.612372,Disease or Syndrome


In [61]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,"Neurologic: Goal SBP < 140, Plan for angio tommorow, Nimodipine, Dilantin, Hob >30 degrees.'",Assessment And Plan: 69 year old male admitted with SAH',2175-03-12 22:31:00,2175-03-12 22:31:00,sentence,sentence,"{Nimodipine, Infantile Neuroaxonal Dystrophy, Dilantin, nimodipine}",{Infantile Neuroaxonal Dystrophy},0.612372,Disease or Syndrome


In [250]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [76]:
# pair_idx = 10
# is_sentence2 = True

# data_1 = data_inst_pairs[pair_idx][0][0]
# data_2 = data_inst_pairs[pair_idx][0][1]

# print(f"{data_1.type} 1:\t{data_1.txt}")
# print(f"{data_2.type} 2:\t{data_2.txt}")

# sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# # Set `contradicting_txt` to the new contradicting sentence.
# # This will just update the text for now.

# contradicting_txt = "Phenytoin not given."
# sentence_to_modify.update_text(contradicting_txt)

# print(f"\nNew contradicting sentence: {contradicting_txt}")

# # Store conflict
# generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

prescription 1:	Patient was prescribed Phenytoin Sodium 250mg/5mL Vial IV of total 1000 mg
sentence 2:	Phenytoin'

New contradicting sentence: Phenytoin not given.


In [62]:
no_contradiction_pair_idx = [0]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	   Neurologic: Goal SBP < 140, Plan for angio tommorow, Nimodipine,    Dilantin, Hob >30 degrees.'
sentence 2:	   Assessment And Plan: 69 year old male admitted with SAH'
*****************************


## Patient 9

In [63]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[8] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 166389


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2196-10-13 **********
********** Processing data for 2196-10-14 **********
***** PAIR INDEX 0 *****
Cosine similarity: 1.0
----- Data i -----
>> Time: 2196-10-14 22:36:00
>> Type: sentence
>> Concepts: {'Hyponatremia'}
>> # Hyponatremia: No clear baseline.'
----- Data j -----
>> Time: 2196-10-14 22:36:00
>> Type: sentence
>> Concepts: {'Hyponatremia'}
>> Given hyponatremia, this is concerning for    new mental status changes, however, attention intact.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.5773502691896258
----- Data i -----
>> Time: 2196-10-14 22:36:00
>> Type: sentence
>> Concepts: {'Refractory anemias', 'Erectile dysfunction'}
>> ED course: Vitals: T 98 80 134/90 12 100% on RA.'
----- Data j -----
>> Time: 2196-10-14 22:36:00
>> Type: sentence
>> Concepts: {'potassium', 'Erectile dysfunction'}
>> # Hypokalemia/hypophosphatemia: K of 3 in the ED, now 2.9 after    reportedly receiving K in the ED.'
************

In [64]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,# Hyponatremia: No clear baseline.',"Given hyponatremia, this is concerning for new mental status changes, however, attention intact.'",2196-10-14 22:36:00,2196-10-14 22:36:00,sentence,sentence,{Hyponatremia},{Hyponatremia},1.0,Disease or Syndrome
1,ED course: Vitals: T 98 80 134/90 12 100% on RA.',"# Hypokalemia/hypophosphatemia: K of 3 in the ED, now 2.9 after reportedly receiving K in the ED.'",2196-10-14 22:36:00,2196-10-14 22:36:00,sentence,sentence,"{Refractory anemias, Erectile dysfunction}","{potassium, Erectile dysfunction}",0.57735,Disease or Syndrome


In [65]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,# Hyponatremia: No clear baseline.',"Given hyponatremia, this is concerning for new mental status changes, however, attention intact.'",2196-10-14 22:36:00,2196-10-14 22:36:00,sentence,sentence,{Hyponatremia},{Hyponatremia},1.0,Disease or Syndrome
1,ED course: Vitals: T 98 80 134/90 12 100% on RA.',"# Hypokalemia/hypophosphatemia: K of 3 in the ED, now 2.9 after reportedly receiving K in the ED.'",2196-10-14 22:36:00,2196-10-14 22:36:00,sentence,sentence,"{Refractory anemias, Erectile dysfunction}","{potassium, Erectile dysfunction}",0.57735,Disease or Syndrome
26,"Given that, renal was consulted with plan for hypertonic saline.'",Imaging: CXR: clear lungs Assessment and Plan 89 y.o.',2196-10-15 09:05:00,2196-10-15 09:05:00,sentence,sentence,"{Infantile Neuroaxonal Dystrophy, Kidney}","{Lung, Infantile Neuroaxonal Dystrophy}",0.75,Disease or Syndrome
27,Pt was admitted to medical floor and plan for fluid restriction.',Imaging: CXR: clear lungs Assessment and Plan 89 y.o.',2196-10-15 09:05:00,2196-10-15 09:05:00,sentence,sentence,"{Physical therapy, Infantile Neuroaxonal Dystrophy, Fluid restriction}","{Lung, Infantile Neuroaxonal Dystrophy}",0.566947,Disease or Syndrome
28,Pt was admitted to medical floor and plan for fluid restriction.',"Given that, renal was consulted with plan for hypertonic saline.'",2196-10-15 09:05:00,2196-10-15 09:05:00,sentence,sentence,"{Physical therapy, Infantile Neuroaxonal Dystrophy, Fluid restriction}","{Infantile Neuroaxonal Dystrophy, Kidney}",0.566947,Disease or Syndrome
29,"Given hyponatremia, this is concerning for new mental status changes, however, attention intact.'","Hyponatremia HPI: 89M p/w yesterday with confusion, Na=113.'",2196-10-15 06:46:00,2196-10-15 09:05:00,sentence,sentence,{Hyponatremia},{Hyponatremia},1.0,Disease or Syndrome
30,# Hyponatremia: No clear baseline.',"Hyponatremia HPI: 89M p/w yesterday with confusion, Na=113.'",2196-10-15 06:46:00,2196-10-15 09:05:00,sentence,sentence,{Hyponatremia},{Hyponatremia},1.0,Disease or Syndrome
31,# Hyponatremia: No clear baseline.',"Given hyponatremia, this is concerning for new mental status changes, however, attention intact.'",2196-10-15 06:46:00,2196-10-15 06:46:00,sentence,sentence,{Hyponatremia},{Hyponatremia},1.0,Disease or Syndrome


In [97]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [98]:
# pair_idx = 28
# is_sentence2 = True

# data_1 = data_inst_pairs[pair_idx][0][0]
# data_2 = data_inst_pairs[pair_idx][0][1]

# print(f"{data_1.type} 1:\t{data_1.txt}")
# print(f"{data_2.type} 2:\t{data_2.txt}")

# sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# # Set `contradicting_txt` to the new contradicting sentence.
# # This will just update the text for now.

# contradicting_txt = "He received IVFand was free water restricted, KCL stopped."
# sentence_to_modify.update_text(contradicting_txt)

# print(f"\nNew contradicting sentence: {contradicting_txt}")

# # Store conflict
# generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

prescription 1:	Patient was prescribed Potassium Chloride 2mEq/mL-20mL IV of total 40 mEq
sentence 2:	He received IVF, 60    mEq of KCL, and was free water restricted.'

New contradicting sentence: He received IVFand was free water restricted, KCL stopped.


In [66]:
no_contradiction_pair_idx = [29]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	Given hyponatremia, this is concerning for    new mental status changes, however, attention intact.'
sentence 2:	Hyponatremia    HPI:    89M p/w yesterday with confusion, Na=113.'
*****************************


## Patient 10

In [67]:
#### Process patient data and iterate over pairs of Data instances to get pairs
# Step 1: Select a patient -- processes all the data
hadm_id = hadm_ids[9] # Note: `hadm_ids` is a list of all HADM id's with consecutive physician notes

# for storing data
generated_data_dict[int(hadm_id)] = {"contradiction": {}, "none": []}

print(f"Patient {int(hadm_id)}")

pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
              med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
              physician_only=True)

# Making data directory
processed_dir = 'processed'
os.makedirs(processed_dir, exist_ok=True)

pt_csv = os.path.join(processed_dir, f"{int(hadm_id)}.csv")

# Step 2: Generate pairs for this patient
df, data_inst_pairs = generate_data_pairs(pat)

# df.to_csv(pt_csv)
# print("Data has been saved!")

Patient 196357


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice 

********** Processing data for 2143-04-20 **********
********** Processing data for 2143-04-21 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.6546536707079772
----- Data i -----
>> Time: 2143-04-21 09:19:00
>> Type: sentence
>> Concepts: {'Heart failure', 'Hyponatremia'}
>>    Hyponatremia:  Likely due to heart failure.'
----- Data j -----
>> Time: 2143-04-21 09:19:00
>> Type: sentence
>> Concepts: {'Heart failure', 'Kidney Failure, Acute'}
>>    Acute Renal Failure:   [**Month (only) 60**]  be due to decompensated heart failure.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.7071067811865475
----- Data i -----
>> Time: 2143-04-21 01:24:00
>> Type: sentence
>> Concepts: {'Erectile dysfunction'}
>> Nephrology not consulted in ED.'
----- Data j -----
>> Time: 2143-04-21 09:19:00
>> Type: sentence
>> Concepts: {'Refractory anemias', 'Erectile dysfunction'}
>> In ED sat 88%    on RA, felt to be fluid overloaded.'
********************************

In [68]:
#### Inserting contradictions to Sentence instances
# IMPORTANT: We should only insert contradictions if it is a sentence from a note ("type" should be sentence, not lab or prescription)! 

# Step 3: Get all the pairs about lab values
semantic_type_ids   = CONFLICT_TO_SEMANTIC_TYPE['diagnosis']
semantic_type_names = [SEMANTIC_TYPE_TO_NAME[st_id] for st_id in semantic_type_ids]

is_diagnosis = df['semantic type'].apply(lambda x: x in semantic_type_names)
diagnosis_pairs_df = df.loc[is_diagnosis]

diagnosis_pairs_df.head(2)

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,Hyponatremia: Likely due to heart failure.',Acute Renal Failure: [**Month (only) 60**] be due to decompensated heart failure.',2143-04-21 09:19:00,2143-04-21 09:19:00,sentence,sentence,"{Heart failure, Hyponatremia}","{Heart failure, Kidney Failure, Acute}",0.654654,Disease or Syndrome
1,Nephrology not consulted in ED.',"In ED sat 88% on RA, felt to be fluid overloaded.'",2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,{Erectile dysfunction},"{Refractory anemias, Erectile dysfunction}",0.707107,Disease or Syndrome


In [69]:
diagnosis_pairs_df 

Unnamed: 0,sentence 1,sentence 2,time 1,time 2,type 1,type 2,concepts 1,concepts 2,cosine similarity,semantic type
0,Hyponatremia: Likely due to heart failure.',Acute Renal Failure: [**Month (only) 60**] be due to decompensated heart failure.',2143-04-21 09:19:00,2143-04-21 09:19:00,sentence,sentence,"{Heart failure, Hyponatremia}","{Heart failure, Kidney Failure, Acute}",0.654654,Disease or Syndrome
1,Nephrology not consulted in ED.',"In ED sat 88% on RA, felt to be fluid overloaded.'",2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,{Erectile dysfunction},"{Refractory anemias, Erectile dysfunction}",0.707107,Disease or Syndrome
2,ARF: Picture consistent with pre-renal picture.',Acute Renal Failure: [**Month (only) 60**] be due to decompensated heart failure.',2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,{Acute respiratory failure},"{Heart failure, Kidney Failure, Acute}",0.654654,Disease or Syndrome
3,"ED Course Notables: brought in by family after being found confused.', ""She appeared grossly overloaded on exam, 88% RA, improved to mid-90's on 2L NC.""","In ED sat 88% on RA, felt to be fluid overloaded.'",2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,"{Refractory anemias, Erectile dysfunction}","{Refractory anemias, Erectile dysfunction}",1.0,Disease or Syndrome
4,"ED Course Notables: brought in by family after being found confused.', ""She appeared grossly overloaded on exam, 88% RA, improved to mid-90's on 2L NC.""",Nephrology not consulted in ED.',2143-04-21 01:24:00,2143-04-21 01:24:00,sentence,sentence,"{Refractory anemias, Erectile dysfunction}",{Erectile dysfunction},0.707107,Disease or Syndrome
5,"CTA thought about, but not done due to renal failure.'",Acute Renal Failure: [**Month (only) 60**] be due to decompensated heart failure.',2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,{Kidney Failure},"{Heart failure, Kidney Failure, Acute}",0.801784,Disease or Syndrome
6,Concern that confusion was due symptomatic hyponatremia.',Hyponatremia: Likely due to heart failure.',2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,{Hyponatremia},"{Heart failure, Hyponatremia}",0.57735,Disease or Syndrome
7,"Assessment and Plan 72F with metastatic ovarian and pancreatic cancer with lung involvement with recent admission for CAP, who is brought in by family for concerns of confusion, lethargy, DOE, poor PO intake and found to have hyponatremia.'","I agree with his / her note above, including assessment and plan.'",2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,"{Hyponatremia, Infantile Neuroaxonal Dystrophy, Dyspnea on exertion, Lethargy, Take Action, Pancreatin}",{Infantile Neuroaxonal Dystrophy},0.522233,Disease or Syndrome
49,EKG with ?',ECG: poor R wave progression.',2143-04-21 01:24:00,2143-04-21 09:19:00,sentence,sentence,{Electrocardiography},{Electrocardiography},1.0,Diagnostic Procedure
50,She has had LE edema in the past and was recently diagnosed with lymphangitis carcinomatosis.',"Patient without nausea or vomiting, but if this ensues, would quickly entertain diagnosis of obstruction due to underlying disease.'",2143-04-21 01:24:00,2143-04-21 01:24:00,sentence,sentence,{Diagnosis},"{Nausea, Diagnosis}",0.707107,Diagnostic Procedure


In [261]:
# Step 4: Insert contradictions

# We should probably aim for 1-2 contradictions per patient. 
# So basically, copy/paste code for Steps 1-4 for each patient, and push to Github.
# Small heads up -- for a given patient, try not to insert contradictions 
# into two sentences that look really really similar. 
# There's a chance this might refer to the same underlying Sentence instance, 
# which could overwrite a contradiction you previously inserted. 

# Look through the sentence pairs by going through `prescription_pairs_df`.
# If you find a good one you want to insert a contradiction for, 
# make note of the row index (i.e. the number at the left), 
# and set this to `pair_idx` below. 
# Also make note of which sentence (i.e. sentence 1 or sentence 2)
# you want to modify, and set the `is_sentence2` flag appropriately.

In [70]:
pair_idx = 6
is_sentence2 = False

data_1 = data_inst_pairs[pair_idx][0][0]
data_2 = data_inst_pairs[pair_idx][0][1]

print(f"{data_1.type} 1:\t{data_1.txt}")
print(f"{data_2.type} 2:\t{data_2.txt}")

sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

# Set `contradicting_txt` to the new contradicting sentence.
# This will just update the text for now.

contradicting_txt = "Concern that confusion was due symptomatic hypernatremia."
sentence_to_modify.update_text(contradicting_txt)

print(f"\nNew contradicting sentence: {contradicting_txt}")

# Store conflict
generated_data_dict[int(hadm_id)]['contradiction'][pair_idx] = (is_sentence2, contradicting_txt)

sentence 1:	Concern that    confusion was due symptomatic hyponatremia.'
sentence 2:	   Hyponatremia:  Likely due to heart failure.'

New contradicting sentence: Concern that confusion was due symptomatic hypernatremia.


In [71]:
no_contradiction_pair_idx = [68, 96]

print("Examples of non-contradictions")
print("*****************************")
for pair_idx in no_contradiction_pair_idx:
    data_1 = data_inst_pairs[pair_idx][0][0]
    data_2 = data_inst_pairs[pair_idx][0][1]
    
    print(f"{data_1.type} 1:\t{data_1.txt}")
    print(f"{data_2.type} 2:\t{data_2.txt}")
    print("*****************************")
    
# Store negative examples
generated_data_dict[int(hadm_id)]['none'] = no_contradiction_pair_idx

Examples of non-contradictions
*****************************
sentence 1:	Acute Renal Failure:   [**Month (only) 60**]  be due to decompensated heart failure.'
sentence 2:	   ARF: Picture consistent with pre-renal picture.'
*****************************
sentence 1:	KUB: Single supine view shows non-obstructive bowel gas pattern.'
sentence 2:	Had nausea overnight -> KUB unrevealing.'
*****************************


In [72]:
generated_data_dict

{155131: {'contradiction': {11: (True,
    'Compared to previous CXR, no LLL infiltrate; right lung hyperinflation more impressive.')},
  'none': [10, 17]},
 129414: {'contradiction': {6: (True,
    'Obesity: Glucose intolerance: Obstructive sleep apnea: mitigated with CPAP therapy')},
  'none': [3, 5]},
 133623: {'contradiction': {1: (True,
    'No reported Chest Pain, ETOH Intoxication HPI: 54M with hx of ETOH abuse, HCV, presented to the ED this evening intoxicated')},
  'none': [63, 64]},
 197325: {'contradiction': {1: (False, 'Hypotension.')}, 'none': [26, 95]},
 186291: {'contradiction': {2: (False,
    "NECK: No JVD, carotid pulses brisk, no bruits, cervical lymphadenopathy, trachea midline'")},
  'none': [13, 81]},
 180836: {'contradiction': {66: (True,
    'Attending 67M HIV (CD4 183/VL 96), severe COPD c FEV1 1.5 on 3-4L at home p/w cough, SOB and rising oxygen requirements.')},
  'none': [23, 76]},
 154802: {'contradiction': {}, 'none': []},
 133857: {'contradiction': {}, 'n

In [104]:
import pickle
data_dict_file = "generated_data_dict_diagnosis.pkl"
with open(data_dict_file, "wb") as f:
    pickle.dump(generated_data_dict, f)

# 5. Loading contradictions data for pipeline [skip 4 if pickle file already created]

If `generated_data_dict_lab.pkl` has already been created, skip part 4. You should still run the inital cells, above "README" in that section though.

About 2 min per HADM_ID, 20 min total

In [105]:
# 9 - positive examples
# 16 - negative examples

In [106]:
import pickle
data_dict_file = "generated_data_dict_lab.pkl"
with open(data_dict_file, "rb") as f:
    generated_data_dict = pickle.load(f)

In [107]:
generated_dataset = [] # list of tuples, ((data 1, data 2), label)

for hadm_id in hadm_ids[:10]:
    print("***********************************")
    print(f"Patient {int(hadm_id)}")
    try:
        hadm_generated_dict = generated_data_dict[int(hadm_id)]
    except KeyError:
        print("This patient does not exist in contradiction set.")
        continue
        
    # Step 1: Select a patient -- process all data
    pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
                  med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
                  physician_only=True)

    # Step 2: Generate pairs for this patient
    df, data_inst_pairs = generate_data_pairs(pat)

    # Step 3A: Insert contradictions 
    print("+++++ Inserting contradictions +++++")
    for pair_idx, (is_sentence2, contradicting_txt) in hadm_generated_dict['contradiction'].items():
        data_1 = data_inst_pairs[pair_idx][0][0]
        data_2 = data_inst_pairs[pair_idx][0][1]

        print(f"{data_1.type} 1:\t{data_1.txt}")
        print(f"{data_2.type} 2:\t{data_2.txt}")

        sentence_to_modify = data_inst_pairs[pair_idx][0][is_sentence2]

        # Set `contradicting_txt` to the new contradicting sentence.
        # Update text and reprocess features.
        sentence_to_modify.update_text(contradicting_txt, True)

        print(f"\nNew contradicting sentence: {contradicting_txt}")
        print("+++++++++++++++++++++++++++++++++++")
        
        # Add example to dataset
        if is_sentence2:
            sentences = (data_1, sentence_to_modify)
        else:
            sentences = (sentence_to_modify, data_2)
        generated_dataset.append((sentences, 1)) # these are all contradictions
    
    # Step 3B: Insert negative examples (not contradictions)
    print("+++++ Inserting negative examples +++++")
    for pair_idx in hadm_generated_dict['none']:
        data_1 = data_inst_pairs[pair_idx][0][0]
        data_2 = data_inst_pairs[pair_idx][0][1]

        print(f"{data_1.type} 1:\t{data_1.txt}")
        print(f"{data_2.type} 2:\t{data_2.txt}")
        
        generated_dataset.append(((data_1, data_2), 0))
        print("+++++++++++++++++++++++++++++++++++")

***********************************
Patient 155131


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boole

KeyboardInterrupt: 

In [None]:
n = len(generated_dataset)
n_negatives = len(list(filter(lambda x: x[1]==0, generated_dataset)))
n_positives = len(list(filter(lambda x: x[1]==1, generated_dataset)))

print(f"We have {n} total examples\n\t- {n_negatives} negative examples\n\t- {n_positives} positive examples")

# 6. Generating evaluation data (unlabeled) from MIMIC

We'll avoid the first 10 patients since they were used for generated contradictions

In [55]:
processed_dir = "processed"
os.makedirs(processed_dir, exist_ok=True)

In [56]:
per_pat_dataset_dict = {} # maps HADMID to patient's dataset in the form [((data 1, data 2), label), ...]
df_list = []
for hadm_id in hadm_ids[10:20]:
    print("***********************************")
    print(f"Patient {int(hadm_id)}")
        
    # Step 1: Select a patient -- process all data
    pat = Patient(hadm_id, notes_df, drug_df, lab_df, d_lab_df, \
                  med7_nlp, umls_nlp, rxnorm_nlp, umls_linker, rxnorm_linker, \
                  physician_only=True)

    # Step 2: Generate pairs for this patient
    df, data_inst_pairs = generate_data_pairs(pat)
    df['HADM_ID'] = hadm_id
    per_pat_dataset_dict[hadm_id] = data_inst_pairs
    df_list.append(df)
    
    df.to_csv(f"{processed_dir}/{int(hadm_id)}.csv")

***********************************
Patient 162197


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2189-09-07 **********
********** Processing data for 2189-09-08 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.9999999999999998
----- Data i -----
>> Time: 2189-09-08 00:49:00
>> Type: sentence
>> Concepts: {'Communicable Diseases'}
>> Most likely with ascending GU infection.'
----- Data j -----
>> Time: 2189-09-08 00:49:00
>> Type: sentence
>> Concepts: {'Communicable Diseases'}
>> Also sepsis criteria given identification of GU infection.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 1.0
----- Data i -----
>> Time: 2189-09-08 00:49:00
>> Type: sentence
>> Concepts: {'Pyelonephritis'}
>> Findings c/w pyelonephritis with    ureteritis bilaterally.'
----- Data j -----
>> Time: 2189-09-08 00:49:00
>> Type: sentence
>> Concepts: {'Pyelonephritis'}
>> She had a grossly positive U/A, and CT Abd/Pelvis    showed evidence of bilateral pyelonephritis.'
**********************************
***** PAIR INDEX 2 *****
Cosine s

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2119-04-06 **********
********** Processing data for 2119-04-07 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.6324555320336758
----- Data i -----
>> Time: 2119-04-07 02:21:00
>> Type: sentence
>> Concepts: {'X-Ray Computed Tomography', 'Communicable Diseases'}
>> Infection cannot be ruled out based on CT scan of    the neck.'
----- Data j -----
>> Time: 2119-04-07 02:21:00
>> Type: sentence
>> Concepts: {'Communicable Diseases'}
>> In    the ED she was given Azithromycin, Vancomycin, and Ceftriaxone for    presumed infection.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.9999999999999998
----- Data i -----
>> Time: 2119-04-07 02:21:00
>> Type: sentence
>> Concepts: {'Communicable Diseases'}
>> No infection of orbit or sella found on prelim read.'
----- Data j -----
>> Time: 2119-04-07 02:21:00
>> Type: sentence
>> Concepts: {'Communicable Diseases'}
>> In    the ED she was given Azithromycin, Vancomycin, and 

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice 

********** Processing data for 2132-04-10 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.9999999999999998
----- Data i -----
>> Time: 2132-04-10 19:21:00
>> Type: sentence
>> Concepts: {'Refractory anemias'}
>>    Vitals: T 98.7, P 91, BP 155/88, RR 12, O2 sat 98 RA'
----- Data j -----
>> Time: 2132-04-10 19:21:00
>> Type: sentence
>> Concepts: {'Refractory anemias'}
>> In the ED, initial VS were: T 97.1, P 84, BP 126/90, RR 18, O2sat 95    RA.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.8451542547285165
----- Data i -----
>> Time: 2132-04-10 19:21:00
>> Type: sentence
>> Concepts: {'Infantile Neuroaxonal Dystrophy', 'Gingi Med'}
>> Two    18 gauge PIVs placed for access, and patient typed & crossed for 2    units; given 2.5 L NS  GI evaluated him with a plan to scope him in ICU    while intubated.'
----- Data j -----
>> Time: 2132-04-10 19:21:00
>> Type: sentence
>> Concepts: {'Infantile Neuroaxonal Dystrophy', 'Gingi Med', 'Duodenal Ulc

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2173-06-30 **********
********** Processing data for 2173-07-01 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.816496580927726
----- Data i -----
>> Time: 2173-07-01 05:14:00
>> Type: sentence
>> Concepts: {'Kidney Failure, Chronic'}
>>    Fluids: ESRD.'
----- Data j -----
>> Time: 2173-07-01 05:14:00
>> Type: sentence
>> Concepts: {'Kidney Failure, Chronic', 'Kidney', 'Huntington Disease'}
>>    Renal: Foley, ESRD on HD (MWF).'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 1.0
----- Data i -----
>> Time: 2173-07-01 05:14:00
>> Type: sentence
>> Concepts: {'Americium'}
>> Re-attempt access in AM.'
----- Data j -----
>> Time: 2173-07-01 05:14:00
>> Type: sentence
>> Concepts: {'Americium'}
>> OR in Am for second look'
**********************************
***** PAIR INDEX 2 *****
Cosine similarity: 0.5773502691896258
----- Data i -----
>> Time: 2173-07-01 05:14:00
>> Type: sentence
>> Concepts: {'Dialysis procedure',

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2173-07-24 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.7647058823529411
----- Data i -----
>> Time: 2173-07-24 13:17:00
>> Type: sentence
>> Concepts: {'Atrial Fibrillation', 'Hematochezia', 'Hypertensive disease', 'Huntington Disease', 'International Normalized Ratio', 'Hematocrit procedure', 'Atrial Fibrillation Sotalol Hydrochloride 80 MG Oral Tablet [Betapace]', 'Kidney Failure, Chronic', 'Coumadin', 'Small bowel obstruction'}
>> Chief Complaint:  BRBPR    HPI:    78 year old male with a past medical history significant for DM, HTN,    atrial fibrillation on coumadin, hx tachy-brady s/p pacemaker, ESRD on    HD s/p recent ex-lap *2 for small bowel obstruction night prior to    admission BRBPR with INR 2.7, HCT 28 at rehab.'
----- Data j -----
>> Time: 2173-07-24 13:17:00
>> Type: sentence
>> Concepts: {'Atrial Fibrillation', 'Infantile Neuroaxonal Dystrophy', 'Hypertensive disease', 'Huntington Disease', 'Atrial Fibrillation Sotalol Hydro

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2173-08-27 **********
********** Processing data for 2173-08-28 **********
********** Processing data for 2173-08-29 **********
********** Processing data for 2173-08-30 **********
********** Processing data for 2173-08-31 **********
********** Processing data for 2173-09-01 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.8660254037844387
----- Data i -----
>> Time: 2173-09-01 22:13:00
>> Type: sentence
>> Concepts: {'Morphine Sulfate 10 MG Oral Tablet'}
>> Adequate MS  [**First Name (Titles) **]   [**Last Name (Titles) 10259**] way protection.'
----- Data j -----
>> Time: 2173-09-01 22:13:00
>> Type: sentence
>> Concepts: {'Morphine Sulfate 10 MG Oral Tablet', 'Acute Cholecystitis'}
>>    HPI: 78 male admitted to the west 1 service for acute cholecystitis now    s.p  perchole  [**8-30**]   presents with  evolving L posterior temp stroke,    worsening mental status, a flutter/ fib transferred to the unit due to    poor MS and concern for airway p

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a

********** Processing data for 2199-03-18 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.7071067811865475
----- Data i -----
>> Time: 2199-03-18 23:57:00
>> Type: sentence
>> Concepts: {'Hypertensive disease', 'Antihypertensive Agents'}
>> She does have a history of HTN and is compliant with her    antihypertensives.'
----- Data j -----
>> Time: 2199-03-18 23:57:00
>> Type: sentence
>> Concepts: {'Hypertensive disease'}
>> Father  [**Name (NI) 6008**]  HTN.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.9999999999999999
----- Data i -----
>> Time: 2199-03-18 23:57:00
>> Type: sentence
>> Concepts: {'Diabetes Mellitus, Non-Insulin-Dependent'}
>> NIDDM.'
----- Data j -----
>> Time: 2199-03-18 23:57:00
>> Type: sentence
>> Concepts: {'Diabetes Mellitus, Non-Insulin-Dependent'}
>> # Type 2 Diabetes: Hold po medications for now.'
**********************************
***** PAIR INDEX 2 *****
Cosine similarity: 0.6708203932499369
----- Data i -----
>

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice 

********** Processing data for 2164-11-23 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.5345224838248487
----- Data i -----
>> Time: 2164-11-23 11:19:00
>> Type: sentence
>> Concepts: {'Seizures', 'Lennox-Gastaut syndrome', 'Feverall', 'Morning After'}
>> Chief Complaint:  Fever, seizure    HPI:    49 year-old man with a history of presumed  [**Location (un) 6993**]  Gastaut Syndrome and    with a recent complicated medical history presents this morning for a    seizure in the setting of fever.'
----- Data j -----
>> Time: 2164-11-23 11:19:00
>> Type: sentence
>> Concepts: {'Lennox-Gastaut syndrome', 'Infantile Neuroaxonal Dystrophy', 'Seizures', 'Epilepsy'}
>> Assessment and Plan    49 year-old man with a history of presumed  [**Location (un) 6993**]  Gastaut Syndrome and    epilepsy presents with a seizure episode for greater than 10 minutes.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 1.0000000000000002
----- Data i -----
>> Time: 2164-

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.prescription_df[['START_DT']] = start_dt
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

********** Processing data for 2102-09-27 **********
***** PAIR INDEX 0 *****
Cosine similarity: 1.0000000000000002
----- Data i -----
>> Time: 2102-09-27 16:25:00
>> Type: sentence
>> Concepts: {'Chronic multifocal osteomyelitis'}
>> CMO but currently full code'
----- Data j -----
>> Time: 2102-09-27 16:25:00
>> Type: sentence
>> Concepts: {'Chronic multifocal osteomyelitis'}
>>    General: It has been reported to the nursing staff that the patient is    to be made CMO.'
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 1.0000000000000002
----- Data i -----
>> Time: 2102-09-27 16:25:00
>> Type: sentence
>> Concepts: {'Infantile Neuroaxonal Dystrophy'}
>> Given this situation the above plan is certainly    subject to change.'
----- Data j -----
>> Time: 2102-09-27 16:25:00
>> Type: sentence
>> Concepts: {'Infantile Neuroaxonal Dystrophy'}
>>    Assessment And Plan: 76yo female, retired nun presents from  [**Hospital 5417**]     Hospital s/p fall down down  

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
A value is trying to be set on a copy of

********** Processing data for 2195-11-23 **********
***** PAIR INDEX 0 *****
Cosine similarity: 0.7453559924999298
----- Data i -----
>> Time: 2195-11-23 17:43:00
>> Type: sentence
>> Concepts: {'MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME'}
>> Continue home    meds.'
----- Data j -----
>> Time: 2195-11-23 17:43:00
>> Type: sentence
>> Concepts: {'Sugars', 'Hypoglycemia', 'hyperglycemic agent', 'MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME'}
>>  HYPOGLYCEMIA:  Unclear etiology, taking po meds, poor po intake (down    from baseline) albeit family notes he was taking juices with low    sugars.', "Was taking his oral hyperglycemic agents, doesn't recall    taking extra."
**********************************
***** PAIR INDEX 1 *****
Cosine similarity: 0.6324555320336758
----- Data i -----
>> Time: 2195-11-23 17:43:00
>> Type: sentence
>> Concepts: {'Chronic anemia', 'Anemia'}
>>  ANEMIA:  Chronic anemia, hct stable.'
----- Data j -----
>> Time: 2195-11-23 17:43:00
>> Type: sentence
>> Co

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


KeyboardInterrupt: 

### Getting History + Allergy Information - @Sharon, you can ignore everytihng below

In [116]:
# todo: 
# - DONE function to re-process all data from Patient instance -- pat.process_notes(); pat.process_by_date()
# - function to update Note -- should update dataframe of patient directly
#   - can go back to dataframe, but can't map tokenized sentence to original note in df -- todo
#   - function to update tokenized sentence
# - later: function to update original dataframe from patient dataframe

import re

def get_section(regex_dict, txt):
    """ Given a dictionary of start and end regex's for a
        particular section, gets the start and endpoint of 
        section in the text and returns indices. 
        Returns None if section does not exist.
    """
    try:
        start    = re.search(regex_dict["start"], txt).start()
        end      = re.search(regex_dict["end"],   txt).start()
    except AttributeError:
        start, end = None, None
    
    return start, end

note = pat.notes[4]

# Sections to store 
# note: most of these sections have already been removed,
#       but if they haven't might have to remove then 
#       reprocess everything
allergy_regex = {"start": "Allergies:",
                 "end":   "Last dose of Antibiotics:"}
history_regex = {"start": "Past medical history:",
                 "end":   "Other:"}

allergy_start, allergy_end = get_section(allergy_regex, note.txt)
history_start, history_end = get_section(history_regex, note.txt)

pt_allergies = "" if allergy_start is None else note.txt[allergy_start:allergy_end]
pt_histories = "" if history_start is None else note.txt[history_start:history_end]

print("******** Allergies ********")
print(pt_allergies[:100])
print("******** Histories ********")
print(pt_histories[:100])

******** Allergies ********
Allergies:
   Bactrim (Oral) (Sulfamethoxazole/Trimethoprim)
   Nausea/Vomiting
   Amiodarone
   Ras
******** Histories ********

