# Tutorial, Part I: Candidate Extraction

In this example, we'll be writing an application to extract **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate chemical-disease (C-D) relation mentions_ as either true or false.  To do this, we first need a set of such candidates- in this notebook, we'll use `DDLite` utilities to extract these candidates.

## Loading the Corpus

First, we will load and pre-process the corpus, storing it for convenience in a `Corpus` object

### Configuring a document parser

We'll start by defining a `DocParser` class to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi)), where they are stored along with "gold" (i.e. hand-annotated by experts) *chemical* and *disease mention* annotations. We'll use the `XMLDocParser` class, which allows us to use [XPath queries](https://en.wikipedia.org/wiki/XPath) to specify the relevant sections of the XML format.

_Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the `DocParser` classes to preserve information about document structure._

In [1]:
from ddlite_parser import XMLDocParser
xml_parser = XMLDocParser(
    path='data/CDR_DevelopmentSet.xml',
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()',
    keep_xml_tree=True)

### Selecting a sentence parser

Next, we'll use an NLP preprocessing tool to split the `Document` objects into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.  Here we use the default `SentenceParser` class.

In [2]:
from ddlite_parser import SentenceParser
sent_parser = SentenceParser()

### Pre-processing & loading the corpus

Finally, we'll put this all together using a `Corpus` object, which will execute the parsers and store the results as an iterator:

In [3]:
from ddlite_parser import Corpus
corpus = Corpus(xml_parser, sent_parser)

Parsing documents...
Parsing sentences...






In [4]:
for doc, sentences in corpus:
    print doc
    print "\n"
    print sentences[0]
    break

Document(id='6794356', file='CDR_DevelopmentSet.xml', text='Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant.\nA newborn with massive tricuspid regurgitation, atrial flutter, congestive heart failure, and a high serum lithium level is described. This is the first patient to initially manifest tricuspid regurgitation and atrial flutter, and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy. Sixty-three percent of these infants had tricuspid valve involvement. Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy. It also causes neurologic depression, cyanosis, and cardiac arrhythmia when consumed prior to delivery.', attribs={'root': <Element document at 0x10fab0a50>})


Sentence(id='6794356-0', words=[u'Tricuspid', u'valve', u'regurgitation', u'and', u'lithium', u'carbonate', u'toxicity', u'in', u'a', u'newbor

## Writing a basic candidate extractor

Next, we'll write a basic function to extract **candidate disease mentions** from the corpus.  For this first attempt, we'll just write a function that checks for matches against a list (or _"dictionary"_) of disease phrases, constructed using some pre-compiled ontologies ([UMLS](https://www.nlm.nih.gov/research/umls/), [ORDO](http://www.orphadata.org/cgi-bin/inc/ordo_orphanet.inc.php), [DOID](http://www.obofoundry.org/ontology/doid.html), [NCBI Diseases](http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/); see `tutorial/data/diseases.py`).

We'll do this using a `CandidateSpace` object--which defines the basic candidates we consider, in this case n-grams up to a certain length--and a `Matcher` object, which filters this candidate space down.

In [5]:
from load_dictionaries import load_disease_dictionary
from ddlite_candidates import Ngrams
from ddlite_matchers import DictionaryMatch

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

# Define a candidate space
ngrams = Ngrams(n_max=3)

# Define a matcher
matcher = DictionaryMatch(d=diseases, longest_match_only=True)

Loaded 507899 disease phrases!


Note that we set `longest_match_only=True`, which means we won't consider subsequences of phrases that match our dictionary.

The `Ngrams` operator is applied over our `Sentence` objects and returns `Ngram` objects, and the `Matcher` then filters these, so we apply our operators over the corpus (also storing the results in our `Corpus` object):

In [10]:
corpus.extract_candidates(ngrams, matcher)
candidates = list(corpus.iter_candidates())
candidates[:5]

[<Ngram("blood pressure", id=1969772-5:640-653, chars=[640,653], words=[2,3]),
 <Ngram("anaesthesia", id=1969772-5:925-935, chars=[925,935], words=[44,44]),
 <Ngram("hypotension", id=1969772-4:562-572, chars=[562,572], words=[9,9]),
 <Ngram("Mean arterial pressure", id=1969772-6:938-959, chars=[938,959], words=[0,2]),
 <Ngram("hypotension", id=1969772-1:129-139, chars=[129,139], words=[7,7])]

## Evaluating our candidate recall on gold annotations

Next, we'll test our _candidate recall_--in other words, how many of the true disease mentions we picked up in our candidate set--using the gold annotations in our dataset.

The XML documents that we loaded using the `XMLDocParser` also contained annotations (this is why we kept the full xml tree using `keep_xml_tree=True`).  We'll load these annotations and map them to `Ngram` objects over our parsed sentences, that way we can easily compare our extracted candidate set with the gold annotations.  The code is fairly simple (see `tutorial/util.py`); note that we filter to only keep _disease_ annotations:

In [11]:
from utils import collect_pubtator_annotations
gold = []
for doc, sents in corpus:
    gold += [a for a in collect_pubtator_annotations(doc, sents) if a.metadata['type'] == 'Disease']
gold = frozenset(gold)

Now, we can get candidate recall using simple set operations (note that the candidates are identified based on their `id` attribute):

In [12]:
print "Candidate recall = %0.3f" % (len(gold.intersection(candidates)) / float(len(gold)),)

Candidate recall = 0.676


## Using the `Viewer` to inspect data

We see that our candidate recall is fairly low, considering that the **candidate recall is an upper bound for our total system recall**.  Next, we'll use the `Viewer` object to do error analysis and get ideas of how to build a better candidate extractor!

In [24]:
from ddlite_viewer import SentenceViewer
sv = SentenceViewer(corpus)
sv.render()

# TO-DO:

1. Handle span insertion & overlapping candidates!
2. Easier way to input custom sets (i.e.: DECOUPLE!)
3. Handle multiple sets of candidates / annotations
4. Display id / caption
5. L/R/U/D navigation / pagination capability!

## Testing `CandidateSpace` and `Matcher`

In [None]:
from ddlite_candidates import Ngrams
from ddlite_matchers import DictionaryMatch, Union, Concat

In [None]:
cs = Ngrams(n_max=3)

In [None]:
matcher = DictionaryMatch(d=diseases, longest_match_only=True)
matches = []
for match in matcher.apply(cs.apply(sents[0])):
    matches.append(match)
    print match

In [None]:
matches[1][:5]

In [None]:
matcher = DictionaryMatch(d=diseases, longest_match_only=False)
for match in matcher.apply(cs.apply(sents[0])):
    print match

In [None]:
dA = ['tricuspid valve', 'lithium']
dB = ['regurgitation','carbonate']
matcher = Concat(DictionaryMatch(d=dA), DictionaryMatch(d=dB))
for match in matcher.apply(cs.apply(sent)):
    print match

### Comparing against gold candidate set

#### TODO: DictionaryMatch accepts either list or dict; in latter case, assumes vals are the IDs!

#### ALSO: Add estimate_size method to CandidateExtraction operators