# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part II: Candidate Extraction

In [1]:
%load_ext autoreload
%autoreload 2
import os

# Note: We run automated tests on this tutorial to make sure that it is always up to date! 
# However, certain interactive components cannot currently be tested automatically, and will 
# be skipped with if-then statements using the variable below
AUTOMATED_TESTING = os.environ.get('TESTING') is not None

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading the Corpus

First, we will load the corpus that we preprocessed in Part I:

In [2]:
from snorkel.models import Corpus
corpus = session.query(Corpus).filter(Corpus.name == 'CDR_corpus').one()
corpus

Corpus (CDR_corpus)

## Defining the schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`; we can manually define this class, or use a helper function (similar in spirit to the `collections.namedtuple` function).

Here we'll define a _chemical disease relation mention candidate class_ which is composed of two named contexts, corresponding to a _chemical mention_ and a _disease mention_.  Note that this function will create the table if it does not exist:

In [None]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', 'chemical_disease', ['chemical', 'disease'])

## Writing a basic candidate extractor

Next, we'll write a basic function to extract **candidate relations mentions** from the corpus.  For this first attempt, we'll just write a function that checks for matches against several dictionaries at the _entity mention level_--i.e. looking for candidate chemical and disease mentions--and then considering any co-occuring pairs in the same sentence as candidate relation mentions.

We'll use some precomputed disease and chemical dictionaries (see `tutorial/data/dicts/compile_dictionaries.py` for details)

In [3]:
# Load the dictionaries
ROOT = '%s/tutorial/data/dicts/' % os.environ['SNORKELHOME']
disease_phrases   = open(ROOT + 'disease_phrases.txt', 'rb').read().split('\n')
disease_acronyms  = open(ROOT + 'disease_acronyms.txt', 'rb').read().split('\n')
chemical_phrases  = open(ROOT + 'chemical_phrases.txt', 'rb').read().split('\n')
chemical_acronyms = open(ROOT + 'chemical_acronyms.txt', 'rb').read().split('\n')

In [None]:
from snorkel.candidates import Ngrams
from snorkel.matchers import DictionaryMatch, Union

# Define a candidate space
ngrams = Ngrams(n_max=3)

# Define a matcher for diseases
disease_matcher = Union(
    DictionaryMatch(d=disease_phrases, ignore_case=True),
    DictionaryMatch(d=disease_acronyms, ignore_case=False))

# Define a matcher for chemicals
chem_matcher = Union(
    DictionaryMatch(d=chemical_phrases, ignore_case=True),
    DictionaryMatch(d=chemical_acronyms, ignore_case=False))

Note that we set `longest_match_only=False`, which means that we _will_ consider subsequences of phrases that match our dictionary.

The `Ngrams` operator is applied over our `Sentence` objects and returns `Ngram` objects, and the `Matcher` then filters these, so we apply our operators over the sentences in the corpus, storing the results in a `Candidates` object for convenience:

In [None]:
from snorkel.candidates import CandidateExtractor
from snorkel.models import Sentence, Document

ce = CandidateExtractor(ChemicalDisease, [ngrams, ngrams], [chem_matcher, disease_matcher])
%time c = ce.extract([sent for doc in corpus for sent in doc.sentences], 'all', session)

## Testing on unary candidates

In [4]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams
from snorkel.matchers import DictionaryMatch, Union

# Define a candidate space
ngrams = Ngrams(n_max=3)

Disease = candidate_subclass('Disease', ['disease'])

disease_matcher_simple = DictionaryMatch(d=disease_phrases, ignore_case=True, longest_match_only=False)

In [5]:
from snorkel.candidates import CandidateExtractor
from snorkel.models import Sentence, Document

ce = CandidateExtractor(Disease, [ngrams], [disease_matcher_simple])
%time c_small = ce.extract([sent for doc in corpus for sent in doc.sentences][:100], 'diseases_small', session)

CPU times: user 2.05 s, sys: 51.8 ms, total: 2.1 s
Wall time: 2.1 s


In [6]:
c_small

Candidate Set (diseases_small)

# Testing `CandidateAnnotator`

In [None]:
from snorkel.models import CandidateSet, candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

c_small = session.query(CandidateSet).filter(CandidateSet.name == 'diseases_small').first()
c_small

In [7]:
from snorkel.annotations import LabelFunctionAnnotator
from random import random

lfs = LabelFunctionAnnotator()

In [8]:
def LF_random_labeler(c):
    return 1 if random() < 0.1 else 0

def LF_has_caps(c):
    return -1 if re.search(r'[A-Z]', c.disease.get_span()) is not None else 0

def LF_ends_with_itis(c):
    return 1 if re.search(r'itis$', c.disease.get_span()) is not None else 0

LFs = [LF_random_labeler, LF_has_caps, LF_ends_with_itis]

In [14]:
lfs.create(c_small, LFs, session, 'LF1')

Creating new key set


In [13]:
session.rollback()

## Profiling...

In [None]:
%time c3 = ce2.extract([sent for doc in corpus for sent in doc.sentences][:500], 'disease_mentions_500', session)

In [None]:
%time c4 = ce2.extract([sent for doc in corpus for sent in doc.sentences], 'disease_mentions_full', session)

In [None]:
len([sent for doc in corpus for sent in doc.sentences])

In [None]:
session.rollback()

In [None]:
ce2 = CandidateExtractor(Disease, [ngrams], [disease_matcher])
%time c = ce.extract([sent for doc in corpus for sent in doc.sentences], 'disease_mentions', session)

### Saving the extracted candidates

In [None]:
session.add(c)
session.commit()

### Reloading the candidates

In [None]:
session.rollback()

In [None]:
from snorkel.models import CandidateSet
c = session.query(CandidateSet).filter(CandidateSet.name == 'first-100').one()
c

## Evaluating our candidate recall on gold annotations

Next, we'll test our _candidate recall_--in other words, how many of the true disease mentions we picked up in our candidate set--using the gold annotations in our dataset.

The XML documents that we loaded using the `XMLDocParser` also contained annotations (this is why we kept the full xml tree using `keep_xml_tree=True`).  We'll load these annotations and map them to `Ngram` objects over our parsed sentences, that way we can easily compare our extracted candidate set with the gold annotations.  The code is fairly simple (see `tutorial/util.py`); note that we filter to only keep _disease_ annotations, and that the candidates should be uniquely identified by their `id` attribute:

In [None]:
from utils import collect_pubtator_annotations
gold = []
for doc, sents in corpus:
    gold += [a for a in collect_pubtator_annotations(doc, sents) if a.meta['type'] == 'Disease']
gold = frozenset(gold)

In [None]:
list(gold)[:5]

Now, we have a set of gold annotations of the same type as our candidates (`Ngram`), and can use set operations (where candidate objects are hashed by their `id` attribute), e.g.:

In [None]:
len(gold.intersection(c.candidates))

For convenience, we'll use a basic helper method of the `Candidates` object:

In [None]:
from snorkel.candidates import gold_stats
gold_stats(c, gold)

We note that our focus in this stage is on **acheiving high candidate recall, without considering an impractically large candidate set**.  Our main focus after this stage will be on training a classifier to select which candidates are true; this will raise precision while hopefully keeping recall high.  _Note however that candidate recall is an upper bound for the recall of this classifier!_

So, we have some work to do.

## Using the `Viewer` to inspect data

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

To start, we'll assemble a random set of all the sentences where there are gold annotations _not in our candidate set_, i.e. where we missed something, and then inspect these in the `Viewer`:

In [None]:
from collections import defaultdict
from random import shuffle
from snorkel.models import Span

# Index the gold annotations by sentence id
gold_by_sid = defaultdict(list)
for g in gold:
    gold_by_sid[g.context.id].append(g)

# Get sentences

view_sents = [s for s in corpus.get_sentences() \
              if session.query(Span).filter(Span.context == s).count() \
                  < len(gold_by_sid[s.id])]
shuffle(view_sents)
view_sents = view_sents[:50]

Now, we instantiate and render the `Viewer` object; note we're being a bit sloppy, passing in _all_ the candidates and gold labels, but the `Viewer` object will take care of indexing them by sentence, and will only render the sentences we pass in:

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
if not AUTOMATED_TESTING:
    sv = SentenceNgramViewer(c.candidates[:300], session)
else:
    sv = None

And, now we render the `Viewer`:

In [None]:
sv

In [None]:
sv._labels_serialized

In [None]:
sv.get_labels()

In [None]:
sv.cids

Note that we can **navigate using the provided buttons**, or **using the keyboard (hover over buttons to see controls)**, highlight candidates (even if they overlap), and also **apply binary labels** (more on where to use this later!).  In particular, note that **the Viewer is synced dynamically with the notebook**, so that we can for example get the `id` of the candidate that is currently selected, this candidate object itself, as well as any labels we've applied.  Try it out!

In [None]:
if not AUTOMATED_TESTING:
    print sv.get_selected()
    print sv.get_labels()

## Composing a better candidate extractor

Now, let's try to increase our candidate recall using more of the `Matcher` operators and their functionalities.  First, let's turn on **Porter stemming** in our dictionary matcher; Porter stemming is an aggressive rules-based method for normalizing word endings.

In [None]:
# Define a new matcher
matcher = DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter')

# Extract a new set of candidates
ce = CandidateExtractor(ngrams, matcher)
%time c = ce.extract(corpus.get_sentences())
#gold_stats(c, gold)

Next, note that *`Matcher` objects are compositional*. Observing in the `Viewer` that we are missing all of the acronyms, let's start with the `Union` operator, to integrate a dictionary for this:

In [None]:
from snorkel.matchers import Union
from load_dictionaries import load_acronym_dictionary

# Load the disease phrase dictionary
acronyms = load_acronym_dictionary()
print "Loaded %s acronyms!" % len(acronyms)

# Define a new matcher
matcher = Union(
    DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter'),
    DictionaryMatch(d=acronyms, ignore_case=False))

# Extract a new set of candidates
ce = CandidateExtractor(ngrams, matcher)
%time c = ce.extract(corpus.get_sentences())
gold_stats(c, gold)

Next, we try using the `Concat` and `RegexMatch` operators to find candidate mentions composed of an _adjective followed by a term matching our diseases dictionary_.  Note in particular that we set `left_required=False` so that exact matches to our dictionary (with no adjective prepended) will still work:

In [None]:
print corpus.get_sentences()[0].words
print corpus.get_sentences()[0].poses

In [None]:
from snorkel.matchers import Concat, RegexMatchEach
matcher = Union(
    Concat(
        RegexMatchEach(rgx=r'JJ*', attrib='poses'),
        DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter'),
        left_required=False),
    DictionaryMatch(d=acronyms, ignore_case=False))

# Extract a new set of candidates
ce = CandidateExtractor(ngrams, matcher)
%time c = ce.extract(corpus.get_sentences())
#gold_stats(c, gold)

In [None]:
matcher = DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter')
ce = CandidateExtractor(ngrams, matcher)
%time c = ce.extract(corpus.get_sentences())
c[:10]

In [None]:
from snorkel.matchers import Concat, RegexMatchSpan

matcher = Concat(
    RegexMatchSpan(rgx=r'acute'),
    DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter'))
ce = CandidateExtractor(ngrams, matcher)
%time c = ce.extract(corpus.get_sentences())
c[:10]

In [None]:
from snorkel.matchers import SlotFillMatch

matcher = SlotFillMatch(pattern="wt{0}-{1}", 
                        DictionaryMatch(d=proteins),
                        RegexMatchSpan(rgx="\d+"))

### Running candidate extraction in parallel

Note that **the candidate extraction procedure can be parallelized across multiple cores using the `parallelism=N`** optional argument to the `CandidateExtractor` object.

### More coming here...

We've increased the candidate recall (on the development set) by ~ 9% using some simple compositional `Matcher` operators.  We'll be adding more here soon!