# Tutorial, Part I: Candidate Extraction

In this example, we'll be writing an application to extract **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate chemical-disease (C-D) relation mentions_ as either true or false.  To do this, we first need a set of such candidates- in this notebook, we'll use `DDLite` utilities to extract these candidates.

## Loading the Corpus

First, we will load and pre-process the corpus, storing it for convenience in a `Corpus` object

### Configuring a document parser

We'll start by defining a `DocParser` class to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi)), where they are stored along with "gold" (i.e. hand-annotated by experts) *chemical* and *disease mention* annotations. We'll use the `XMLDocParser` class, which allows us to use [XPath queries](https://en.wikipedia.org/wiki/XPath) to specify the relevant sections of the XML format.

_Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the `DocParser` classes to preserve information about document structure._

In [1]:
from ddlite_parser import XMLDocParser
xml_parser = XMLDocParser(
    path='data/CDR_DevelopmentSet.xml',
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()',
    keep_xml_tree=True)

### Selecting a sentence parser

Next, we'll use an NLP preprocessing tool to split the `Document` objects into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.  Here we use the default `SentenceParser` class.

In [2]:
from ddlite_parser import SentenceParser
sent_parser = SentenceParser()

### Pre-processing & loading the corpus

Finally, we'll put this all together using a `Corpus` object, which will execute the parsers and store the results as an iterator:

In [3]:
from ddlite_parser import Corpus
%time corpus = Corpus(xml_parser, sent_parser)

Parsing documents...
Parsing sentences...
CPU times: user 4.28 s, sys: 130 ms, total: 4.41 s
Wall time: 29.8 s


In [4]:
doc  = corpus.get_docs()[2]
print doc

Document(id='1445986', file='CDR_DevelopmentSet.xml', text="Cefotetan-induced immune hemolytic anemia.\nImmune hemolytic anemia due to a drug-adsorption mechanism has been described primarily in patients receiving penicillins and first-generation cephalosporins. We describe a patient who developed anemia while receiving intravenous cefotetan. Cefotetan-dependent antibodies were detected in the patient's serum and in an eluate prepared from his red blood cells. The eluate also reacted weakly with red blood cells in the absence of cefotetan, suggesting the concomitant formation of warm-reactive autoantibodies. These observations, in conjunction with clinical and laboratory evidence of extravascular hemolysis, are consistent with drug-induced hemolytic anemia, possibly involving both drug-adsorption and autoantibody formation mechanisms. This case emphasizes the need for increased awareness of hemolytic reactions to all cephalosporins.", attribs={'root': <Element document at 0x10b85fe60>}

In [5]:
sent = corpus.get_sentences_in(doc.id)[0]
print sent

Sentence(id='1445986-0', words=[u'Cefotetan-induced', u'immune', u'hemolytic', u'anemia', u'.'], lemmas=[u'cefotetan-induced', u'immune', u'hemolytic', u'anemia', u'.'], poses=[u'JJ', u'JJ', u'JJ', u'NN', u'.'], dep_parents=[4, 4, 4, 0, 4], dep_labels=[u'amod', u'amod', u'amod', u'ROOT', u'punct'], sent_id=0, doc_id='1445986', text=u'Cefotetan-induced immune hemolytic anemia.', char_offsets=[0, 18, 25, 35, 41], doc_name='CDR_DevelopmentSet.xml')


## Writing a basic candidate extractor

Next, we'll write a basic function to extract **candidate disease mentions** from the corpus.  For this first attempt, we'll just write a function that checks for matches against a list (or _"dictionary"_) of disease phrases, constructed using some pre-compiled ontologies ([UMLS](https://www.nlm.nih.gov/research/umls/), [ORDO](http://www.orphadata.org/cgi-bin/inc/ordo_orphanet.inc.php), [DOID](http://www.obofoundry.org/ontology/doid.html), [NCBI Diseases](http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/); see `tutorial/data/diseases.py`).

We'll do this using a `CandidateSpace` object--which defines the basic candidates we consider, in this case n-grams up to a certain length--and a `Matcher` object, which filters this candidate space down.

In [6]:
from load_dictionaries import load_disease_dictionary
from ddlite_candidates import Ngrams
from ddlite_matchers import DictionaryMatch

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

# Define a candidate space
ngrams = Ngrams(n_max=3)

# Define a matcher
matcher = DictionaryMatch(d=diseases, longest_match_only=False)

Loaded 507899 disease phrases!


Note that we set `longest_match_only=False`, which means that we _will_ consider subsequences of phrases that match our dictionary.

The `Ngrams` operator is applied over our `Sentence` objects and returns `Ngram` objects, and the `Matcher` then filters these, so we apply our operators over the sentences in the corpus, storing the results in a `Candidates` object for convenience:

In [None]:
from multiprocessing import Queue, JoinableQueue, Process
from Queue import Empty
from time import sleep

In [None]:
# Create Process subclass
class IncrementerProcess(Process):
    def __init__(self, qin, qout, inc):
        Process.__init__(self)
        self.qin  = qin
        self.qout = qout
        self.inc  = inc
    
    def run(self):
        while not self.qin.empty():
            sleep(5)
            try:
                n = self.qin.get(False)
                n += self.inc
                self.qout.put(n)
                self.qin.task_done()
            except Empty:
                break

# Initialize the queues
nums_out = Queue()
nums_in  = JoinableQueue()
for i in range(0,20):
    nums_in.put(i)

# Start the processes
PARALLELISM = 4
ps = []
for i in range(PARALLELISM):
    p = IncrementerProcess(nums_in, nums_out, 10)
    p.start()
    ps.append(p)

# Join on the tasks queue
nums_in.join()
    
# Collect from the out queue and return
result = []
while not nums_out.empty():
    result.append(nums_out.get(False))
print result

In [26]:
from ddlite_candidates import Candidates
%time c = Candidates(ngrams, matcher, corpus.get_sentences(), parallelism=4)
c.get_candidates()[:5]
print len(c.get_candidates())

Extracting candidates...
CPU times: user 2.13 s, sys: 365 ms, total: 2.49 s
Wall time: 13 s
6944


## Evaluating our candidate recall on gold annotations

Next, we'll test our _candidate recall_--in other words, how many of the true disease mentions we picked up in our candidate set--using the gold annotations in our dataset.

The XML documents that we loaded using the `XMLDocParser` also contained annotations (this is why we kept the full xml tree using `keep_xml_tree=True`).  We'll load these annotations and map them to `Ngram` objects over our parsed sentences, that way we can easily compare our extracted candidate set with the gold annotations.  The code is fairly simple (see `tutorial/util.py`); note that we filter to only keep _disease_ annotations, and that the candidates should be uniquely identified by their `id` attribute:

In [31]:
from utils import collect_pubtator_annotations
gold = []
for doc, sents in corpus:
    gold += [a for a in collect_pubtator_annotations(doc, sents) if a.metadata['type'] == 'Disease']
gold = frozenset(gold)

Now, we have a set of gold annotations of the same type as our candidates (`Ngram`), and can use set operations (where candidate objects are hashed by their `id` attribute), e.g.:

In [32]:
len(gold.intersection(c.get_candidates()))

3039

For convenience, we'll use a basic helper method of the `Candidates` object:

In [33]:
c.gold_stats(gold)

# of gold annotations	= 4244
# of candidates		= 6944
Candidate recall	= 0.716
Candidate precision	= 0.438


We note that our focus in this stage is on **acheiving high candidate recall, without considering an impractically large candidate set**.  Our main focus after this stage will be on training a classifier to select which candidates are true; this will raise precision while hopefully keeping recall high.  _Note however that candidate recall is an upper bound for the recall of this classifier!_

So, we have some work to do.

## Using the `Viewer` to inspect data

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

To start, we'll assemble a random set of all the sentences where there are gold annotations _not in our candidate set_, i.e. where we missed something, and then inspect these in the `Viewer`:

In [34]:
from collections import defaultdict
from random import shuffle

# Index the gold annotations by sentence id
gold_by_sid = defaultdict(list)
for g in gold:
    gold_by_sid[g.sent_id].append(g)

# Get sentences
view_sents = [s for s in corpus.get_sentences() if len(c.get_candidates_in(s.id)) < len(gold_by_sid[s.id])]
shuffle(view_sents)
view_sents = view_sents[:50]

Now, we instantiate and render the `Viewer` object; note we're being a bit sloppy, passing in _all_ the candidates and gold labels, but the `Viewer` object will take care of indexing them by sentence, and will only render the sentences we pass in:

In [35]:
from ddlite_viewer import SentenceNgramViewer
sv = SentenceNgramViewer(view_sents, c.get_candidates(), gold=gold)
sv.render(n_per_page=3, height=225)

## Composing a better candidate extractor

Let's try to increase our candidate recall using more of the `Matcher` operators and their functionalities.  First, let's turn on **Porter stemming** in our dictionary matcher; Porter stemming is an aggressive rules-based method for normalizing word endings.

In [36]:
# Define a new matcher
matcher = DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter')

# Extract a new set of candidates
%time c = Candidates(ngrams, matcher, corpus.get_sentences())
c.gold_stats(gold)

Extracting candidates...
CPU times: user 10.1 s, sys: 295 ms, total: 10.4 s
Wall time: 10.2 s
# of gold annotations	= 4244
# of candidates		= 9266
Candidate recall	= 0.749
Candidate precision	= 0.343


  if word[-1] == 's':


Next, note that *`Matcher` objects are compositional*. Observing in the `Viewer` that we are missing all of the acronyms, let's start with the `Union` operator, to integrate a dictionary for this:

In [37]:
from ddlite_matchers import Union
from load_dictionaries import load_acronym_dictionary

# Load the disease phrase dictionary
acronyms = load_acronym_dictionary()
print "Loaded %s acronyms!" % len(acronyms)

# Define a new matcher
matcher = Union(
    DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter'),
    DictionaryMatch(d=acronyms, ignore_case=False))

# Extract a new set of candidates
%time c = Candidates(ngrams, matcher, corpus.get_sentences())
c.gold_stats(gold)

Loaded 36904 acronyms!
Extracting candidates...
CPU times: user 10.8 s, sys: 112 ms, total: 11 s
Wall time: 10.9 s
# of gold annotations	= 4244
# of candidates		= 9811
Candidate recall	= 0.778
Candidate precision	= 0.336


Next, we try using the `Concat` and `RegexMatch` operators to find candidate mentions composed of an _adjective followed by a term matching our diseases dictionary_.  Note in particular that we set `left_required=False` so that exact matches to our dictionary (with no adjective prepended) will still work:

In [40]:
from ddlite_matchers import Concat, RegexMatchEach
matcher = Union(
    Concat(
        RegexMatchEach(rgx=r'JJ*', attrib='poses'),
        DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter'),
        left_required=False),
    DictionaryMatch(d=acronyms, ignore_case=False))

# Extract a new set of candidates
%time c = Candidates(ngrams, matcher, corpus.get_sentences())
c.gold_stats(gold)

Extracting candidates...
CPU times: user 20.6 s, sys: 558 ms, total: 21.1 s
Wall time: 20.8 s
# of gold annotations	= 4244
# of candidates		= 12292
Candidate recall	= 0.821
Candidate precision	= 0.284


In [41]:
from ddlite_matchers import Concat, RegexMatchEach
matcher = Union(
    Concat(
        RegexMatchEach(rgx=r'JJ*', attrib='poses'),
        DictionaryMatch(d=diseases, longest_match_only=False, stemmer='porter'),
        left_required=False),
    DictionaryMatch(d=acronyms, ignore_case=False))

# Extract a new set of candidates
%time c = Candidates(ngrams, matcher, corpus.get_sentences(), parallelism=4)
c.gold_stats(gold)

Extracting candidates...
CPU times: user 2.83 s, sys: 435 ms, total: 3.27 s
Wall time: 29.2 s
# of gold annotations	= 4244
# of candidates		= 12292
Candidate recall	= 0.821
Candidate precision	= 0.284


### More coming here...

We've increased the candidate recall (on the development set) by 7.3% using some simple compositional `Matcher` operators.  We'll be adding more here soon!

## Connecting with the rest of the `DDLite` workflow

We are in the process of a big code refactor!  For now, to connect the candidates derived as in above to the rest of the `DDLite` workflow, create a `CandidateExtractor` class as follows:

In [None]:
from ddlite_matchers import CandidateExtractor
ce = CandidateExtractor(ngrams, matcher)

This object can now be used in place of the candidate extractors in the other tutorials!