# Learning to Extract Pain Outcomes from Clinical Text without Labeled Data
## I: Preprocessing Clinical Notes

This pipeline extracts present positive indications of pain and its anatomical location from patient clinical notes. To train our machine reading models, we use [*weak supervision*](https://hazyresearch.github.io/snorkel/blog/weak_supervision.html), a technique that enables training deep learning models using large collections of unlabeled clinical documents. All weak supervision models are trained using [Snorkel](https://hazyresearch.github.io/snorkel/).

This demo uses notes from [MIMIC-III](https://mimic.physionet.org/), a collection of electronic health record (EHR) data for ~40,000 critical care patients. For validating extraction performance, we manually annotated a subset of MIMIC clinical notes released as part of the [ShARe/CLEF eHealth Evaluation Lab 2014](https://link.springer.com/chapter/10.1007/978-3-319-11382-1_17).

<img align="center" src="pain-anatomy-relations.jpg" width="650px" style="border:1px solid black; margin-bottom:15px">
**Figure 1** Example pain/anatomy mention types. This demo shows a binary classifier for discrimitating present positive mentions from negated, hypothetical, and historical mentions.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import os
import re
import sys
import bz2
import glob
import codecs
import random
import shutil
import numpy as np

from snorkel import SnorkelSession

from rwe.extractlib.utils import *
from rwe.extractlib.corpora import ClefCorpus

session = SnorkelSession()

## 1. Parse Document Collection
All clinical documents live in the `../data/corpora/MIMIC-III/training/` directory.
We preprocess these documents using off-the-shelf natural language processing (NLP) tools to generate a collection of tokenized sentences.

In [None]:
from snorkel.models import Document, Sentence, Candidate
from snorkel.parser import Spacy, RegexTokenizer, RuleBasedParser
from snorkel.parser import CorpusParser, CorpusParserUDF, StanfordCoreNLPServer

In [None]:
from snorkel.parser import TextDocPreprocessor

doc_root = "../data/corpora/clef/2014ShAReCLEFeHealthTasks2/"
corpus   = ClefCorpus(doc_root + "training/")

filelist = "{}/*.txt".format(corpus.cachedir)

doc_preprocessor = TextDocPreprocessor(path=filelist, encoding=corpus.encoding)

### Selecting a Parser
You can use [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) or [spaCy](https://spacy.io/) for preprocessing. In some datasets, CoreNLP is more robust for clinical text sentence boundary detection, though significantly slower than spaCy. You may also specify a combination of regular expression / spaCy for finer control of sentence boundary detection and tokenization.

### MIMIC-II/CLEF 2014 Corpus Summary
Documents: 299
Sentences: 16164

### Parse Time Performance Comparisons

| Parser                 | Wall Time (secs) |
|------------------------|------------------|
| spaCy                  | 17.5             |
| CoreNLP                | 64.0             |
| RuleBased (spaCy+regex) | 12.3             |
| RuleBased (regex+regex) | 11.3             |


In [None]:
# custom CoreNLP configuration
annotator_opts = {}
annotator_opts['ssplit']   = {"newlineIsSentenceBreak": "two"}
annotator_opts['tokenize'] = {"invertible": True,
                              "normalizeFractions": False,
                              "normalizeParentheses": False,
                              "normalizeOtherBrackets": False,
                              "normalizeCurrency": False,
                              "asciiQuotes": False,
                              "latexQuotes": False,
                              "ptb3Ellipsis": False,
                              "ptb3Dashes": False,
                              "escapeForwardSlashAsterisk": False,
                              "strictTreebank3": True}

In [None]:
PARSER = "corenlp"

if PARSER == "corenlp":
    parser = StanfordCoreNLPServer(annotator_opts=annotator_opts, 
                                   verbose=False, version='3.6.0',
                                   split_newline=False,
                                   num_threads=4)
elif PARSER == "rgx":
    parser = RuleBasedParser(tokenizer=RegexTokenizer())

else:
    parser = Spacy(lang="en")

corpus_parser = CorpusParser(parser)

In [None]:
# sqlite3 database instances do not support parallelism > 1 
%time corpus_parser.apply(doc_preprocessor, parallelism=1)

In [None]:
print "Documents:", session.query(Document).count()
print "Sentences:", session.query(Sentence).count()

## 2. Candidate Extraction

### Create Matchers

Candidates are *possible* positive instances of a pain outcome. They are defined as the Cartesian product of all `(pain, anatomy)` entity pairs found in a given sentence. Generating these pairs requires creating 2 `Matcher` objects, which use dictionary string matching and (optional) regular expressions to identify entities.  

In [None]:
from snorkel.matchers import *
from rwe.extractlib.custom_matchers import *

dict_root       = "../data/supervision/dicts/"
dict_anatomy    = load_dict("{}anatomy/fma_human_anatomy.bz2".format(dict_root))
anatomy_matcher = AnatomicalSiteMatcher(dict_anatomy)

dict_pain       = load_dict("{}nociception/nociception.curated.txt".format(dict_root))
pain_matcher    = PainMatcher(dict_pain)

### Define Candidate Schema
This defines a relation type that Snorkel uses behind the scenes to generate candidates.

In [None]:
from snorkel.models import candidate_subclass
try:
    PainLocation = candidate_subclass('PainLocation', ['pain','anatomy'])
except:
    pass

### Extract Candidate Pain/Anatomy Relations

The `Ngrams` object defines the maximum token length for matching entities. 

Using CoreNLP for parsing, you should get:

```
Development Set Candidates: 288
Test Set Candidates:        168
```

In [None]:
from snorkel.candidates import Ngrams, CandidateExtractor

ngrams = Ngrams(n_max=5, split_tokens=["-","/",":",";",","])

cand_extractor = CandidateExtractor(PainLocation, 
                                    [ngrams, ngrams], [pain_matcher, anatomy_matcher],
                                    symmetric_relations=True, nested_relations=False, 
                                    self_relations=False)

In [None]:
from snorkel.models import Document
from rwe.extractlib.utils import split_training_test_dev

def filter_by_sentence_length(docs, max_length=85):
    '''
    Prevent pathological case where our sentence explodes the number
    of candidate relations.
    This happens offen with incorrectly parsed clinical text

    :param docs:
    :param max_length:
    :return:
    '''
    for i,doc in enumerate(docs):
        for s in doc.sentences:
            if len(s.words) <= max_length:
                yield (s,doc.stable_id)
                
docs = session.query(Document).all()
sentences = {fold:set() for fold in set(corpus.fold_idx.values())}

for s,doc_id in filter_by_sentence_length(docs):
    fold = corpus.fold_idx[doc_id]
    sentences[fold].add(s)

In [None]:
for i,fold in enumerate(sentences):
    %%time cand_extractor.apply(sentences[fold], split=i, parallelism=1)

In [None]:
train_cands = session.query(PainLocation).filter(PainLocation.split == 0).all()
print "Number of candidates:", len(train_cands)

dev_cands = session.query(PainLocation).filter(PainLocation.split == 1).all()
print "Number of candidates:", len(dev_cands)

test_cands = session.query(PainLocation).filter(PainLocation.split == 2).all()
print "Number of candidates:", len(test_cands)

print len(train_cands) + len(dev_cands) + len(test_cands)

## 3. Loading Gold Labels
Weakly supervised models require a small amount of hand labeled data in order to tune the hyperparamters of model and validate end model performance. 

Gold labels can be generated using the [BRAT](http://brat.nlplab.org/) annotation tool or Snorkel's internal candidate annotator (see supplimental notebooks).

We've provided annotations for 452 pain/anatomy candidate mentions using MIMIC documents released as part of the [ShARe/CLEF eHealth Evaluation Lab 2014](https://link.springer.com/chapter/10.1007/978-3-319-11382-1_17).


In [None]:
import pandas as pd
from snorkel.models import StableLabel
from snorkel.annotations import load_gold_labels
from snorkel.db_helpers import reload_annotator_labels

gold_fpath = "../data/annotations/clef.gold.2017.5.22.tsv"

def load_external_labels(session, df, annotator_name):
    """
    Load pandas dataframe of label annotations. These are imported as StableLabel
    objects are used to create labeled gold candidates

    :param session:
    :param df:  pandas data frame containing labels
    :param annotator_name:
    :return:
    """
    for index, row in df.iterrows():
        # We check if the label already exists, in case this cell was already executed
        context_stable_ids = row['context_stable_ids'] if 'context_stable_ids' in row else "~~".join([row['pain'], row['anatomy']])
        query = session.query(StableLabel).filter(StableLabel.context_stable_ids == context_stable_ids)
        query = query.filter(StableLabel.annotator_name == annotator_name)
        if query.count() == 0:
            session.add(StableLabel(context_stable_ids=context_stable_ids, annotator_name=annotator_name, value=row['label']))
    session.commit()

    
gold_labels = pd.read_csv(gold_fpath, sep="\t")
load_external_labels(session, gold_labels, annotator_name="gold")

reload_annotator_labels(session, PainLocation, "gold", split=0, 
                        filter_label_split=False, create_missing_cands=False)
reload_annotator_labels(session, PainLocation, "gold", split=1, 
                        filter_label_split=False, create_missing_cands=False)
reload_annotator_labels(session, PainLocation, "gold", split=2, 
                        filter_label_split=False, create_missing_cands=False)