# Tutorial: Chemical-Disease Extraction

## Part I: Candidate Extraction

In this example, we'll be writing an application to extract **chemical-disease** relationships from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate chemical-disease (C-D) relation mentions_ as either true or false.  To do this, we first need a set of such candidates.

In this notebook, we'll use `DDLite` utilities to extract these candidates.  In _part II_, we'll start with the gold set of candidates and just focus on the core candidate classification task.

### Parsing from XML format

We'll start by using `DDLite`'s `DocParser` class to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi)), where they are stored along with gold chemical and disease mention annotations.

We'll use the `XMLDocParser` class, which allows us to use XPath queries to specify the relevant sections of the XML format.  Note that we are simply newline-concatenating text from the title and abstract together for simplicity.

In [32]:
from ddlite_parser import XMLDocParser
xml_parser = XMLDocParser(path='data/CDR_DevelopmentSet.xml',
    doc='.//document', text='.//passage/text/text()',
    id='.//id/text()', keep_tree=True)
documents = list(xml_parser.parse())
documents[0]

Document(id='6794356', file='CDR_DevelopmentSet.xml', text='Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant.\nA newborn with massive tricuspid regurgitation, atrial flutter, congestive heart failure, and a high serum lithium level is described. This is the first patient to initially manifest tricuspid regurgitation and atrial flutter, and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy. Sixty-three percent of these infants had tricuspid valve involvement. Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy. It also causes neurologic depression, cyanosis, and cardiac arrhythmia when consumed prior to delivery.', attribs={'root': <Element document at 0x10db9fc30>})

### Pre-processing the sentences

Next, we'll use an NLP preprocessing tool to split the `Document` objects into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.

In [33]:
from ddlite_parser import SentenceParser
parser = SentenceParser()
sentences  = parser.parse_docs(docs)
sentences[0]

Sentence(id='6794356-0', words=[u'Tricuspid', u'valve', u'regurgitation', u'and', u'lithium', u'carbonate', u'toxicity', u'in', u'a', u'newborn', u'infant', u'.'], lemmas=[u'tricuspid', u'valve', u'regurgitation', u'and', u'lithium', u'carbonate', u'toxicity', u'in', u'a', u'newborn', u'infant', u'.'], poses=[u'JJ', u'NN', u'NN', u'CC', u'NN', u'NN', u'NN', u'IN', u'DT', u'JJ', u'NN', u'.'], dep_parents=[3, 3, 0, 3, 7, 7, 3, 11, 11, 11, 3, 3], dep_labels=[u'amod', u'compound', u'ROOT', u'cc', u'compound', u'compound', u'conj', u'case', u'det', u'amod', u'nmod', u'punct'], sent_id=0, doc_id='6794356', text=u'Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant.', char_offsets=[0, 10, 16, 30, 34, 42, 52, 61, 64, 66, 74, 80], doc_name='CDR_DevelopmentSet.xml')

#### Disease mentions
We'll build our disease mention extractor using some pre-compiled ontologies ([UMLS](https://www.nlm.nih.gov/research/umls/), [ORDO](http://www.orphadata.org/cgi-bin/inc/ordo_orphanet.inc.php), [DOID](http://www.obofoundry.org/ontology/doid.html), [NCBI Diseases](http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/); see `tutorial/data/diseases.py`) and `DDLite`'s `CandidateExtractor` operators.

In [34]:
from load_dictionaries import load_disease_dictionary, \
                              load_acronym_dictionary
diseases = load_disease_dictionary()
len(diseases)

507899

## Testing `CandidateSpace` and `Matcher`

In [None]:
from ddlite_candidates import Ngrams
from ddlite_matchers import DictionaryMatch, Union, Concat

In [None]:
cs = Ngrams(n_max=3)

In [None]:
matcher = DictionaryMatch(d=diseases, longest_match_only=True)
matches = []
for match in matcher.apply(cs.apply(sents[0])):
    matches.append(match)
    print match

In [None]:
matches[1][:5]

In [None]:
matcher = DictionaryMatch(d=diseases, longest_match_only=False)
for match in matcher.apply(cs.apply(sents[0])):
    print match

In [None]:
dA = ['tricuspid valve', 'lithium']
dB = ['regurgitation','carbonate']
matcher = Concat(DictionaryMatch(d=dA), DictionaryMatch(d=dB))
for match in matcher.apply(cs.apply(sent)):
    print match

### Writing a `CDR` candidate extractor

We'll start by writing a _disease mention extractor_, then a _chemical mention extractor_, and finally will take all _C-D_ pairs co-occuring in a sentence as our C-D relation candidates.

#### Loading a gold candidate set
First, we'll load in the gold annotations from the CDR development set, so that we can test our _candidate recall_, in other words how good our candidate extractor's coverage is.

In [4]:
docs[0]
doc

Document(id='6794356', file='CDR_DevelopmentSet.xml', text='Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant.\nA newborn with massive tricuspid regurgitation, atrial flutter, congestive heart failure, and a high serum lithium level is described. This is the first patient to initially manifest tricuspid regurgitation and atrial flutter, and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy. Sixty-three percent of these infants had tricuspid valve involvement. Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy. It also causes neurologic depression, cyanosis, and cardiac arrhythmia when consumed prior to delivery.', attribs={'root': <Element document at 0x1092ebe60>})

In [13]:
from utils import collect_pubtator_annotations
annos = collect_pubtator_annotations(doc, filter(lambda s : s.doc_id == doc.id, sents))
annos = [a for a in annos if a.metadata['type'] == 'Disease']
annos

[<Ngram("Tricuspid valve regurgitation", id=6794356-0:0-28, chars=[0,28], words=[0,2]),
 <Ngram("toxicity", id=6794356-0:52-59, chars=[52,59], words=[6,6]),
 <Ngram("tricuspid regurgitation", id=6794356-1:105-127, chars=[105,127], words=[4,5]),
 <Ngram("atrial flutter", id=6794356-1:130-143, chars=[130,143], words=[7,8]),
 <Ngram("congestive heart failure", id=6794356-1:146-169, chars=[146,169], words=[10,12]),
 <Ngram("tricuspid regurgitation", id=6794356-2:265-287, chars=[265,287], words=[8,9]),
 <Ngram("atrial flutter", id=6794356-2:293-306, chars=[293,306], words=[11,12]),
 <Ngram("cardiac disease", id=6794356-2:345-359, chars=[345,359], words=[20,21]),
 <Ngram("congenital heart disease", id=6794356-4:576-599, chars=[576,599], words=[11,13]),
 <Ngram("neurologic depression", id=6794356-5:651-671, chars=[651,671], words=[3,4]),
 <Ngram("cyanosis", id=6794356-5:674-681, chars=[674,681], words=[6,6]),
 <Ngram("cardiac arrhythmia", id=6794356-5:688-705, chars=[688,705], words=[9,10])]

In [14]:
from ddlite_candidates import Ngrams
cs = Ngrams(n_max=5)

from ddlite_matchers import DictionaryMatch, Union, Concat
matcher = DictionaryMatch(d=diseases, longest_match_only=True)
matches = []
for sent in sents[:6]:
    for match in matcher.apply(cs.apply(sent)):
        matches.append(match)
matches

[<Ngram("Tricuspid valve regurgitation", id=6794356-0:0-28, chars=[0,28], words=[0,2]),
 <Ngram("toxicity", id=6794356-0:52-59, chars=[52,59], words=[6,6]),
 <Ngram("congestive heart failure", id=6794356-1:146-169, chars=[146,169], words=[10,12]),
 <Ngram("serum lithium level", id=6794356-1:183-201, chars=[183,201], words=[17,19]),
 <Ngram("tricuspid regurgitation", id=6794356-1:105-127, chars=[105,127], words=[4,5]),
 <Ngram("atrial flutter", id=6794356-1:130-143, chars=[130,143], words=[7,8]),
 <Ngram("tricuspid regurgitation", id=6794356-2:265-287, chars=[265,287], words=[8,9]),
 <Ngram("atrial flutter", id=6794356-2:293-306, chars=[293,306], words=[11,12]),
 <Ngram("cardiac disease", id=6794356-2:345-359, chars=[345,359], words=[20,21]),
 <Ngram("congenital heart disease", id=6794356-4:576-599, chars=[576,599], words=[11,13]),
 <Ngram("early pregnancy", id=6794356-4:619-633, chars=[619,633], words=[17,18]),
 <Ngram("cardiac arrhythmia", id=6794356-5:688-705, chars=[688,705], words=

In [25]:
print "Total gold candidates: %s" % len(annos)
print "Total candidate matches: %s" % len(matches)
print "CANDIDATE RECALL: %s" % (len(set(annos).intersection(matches)) / float(len(annos)),)
print "Missing matches: %s" % list(set(annos).difference(matches))

Total gold candidates: 12
Total candidate matches: 14
CANDIDATE RECALL: 0.916666666667
Missing matches: [<Ngram("neurologic depression", id=6794356-5:651-671, chars=[651,671], words=[3,4])]


In [45]:
from ddlite_candidates import Ngrams
from ddlite_matchers import DictionaryMatch, Union, Concat

cs = Ngrams(n_max=5)
matcher = DictionaryMatch(d=diseases, longest_match_only=True)

disease_annotations = []
disease_candidates  = []
for i,doc in enumerate(documents):
    if i % 100 == 0:
        print i
    
    # Get the Sentences associated with each Document
    sents = [s for s in sentences if s.doc_id == doc.id]
    
    # Get the gold annotations
    disease_annotations += [a for a in collect_pubtator_annotations(doc, sents) if a.metadata['type'] == 'Disease']
    
    # Extract Candidates using our Matchers
    for sent in sents:
        disease_candidates += list(matcher.apply(cs.apply(sent)))
disease_annotations = set(disease_annotations)

N = len(disease_annotations)
print "Total gold candidates: %s" % N
print "Total candidate matches: %s" % len(disease_candidates)
print "CANDIDATE RECALL: %s" % (len(disease_annotations.intersection(disease_candidates)) / float(N),)

0
100
200
300
400
Total gold candidates: 4244
Total candidate matches: 5403
CANDIDATE RECALL: 0.678840716305


In [38]:
docs[131]

Document(id='17285209', file='CDR_DevelopmentSet.xml', text='Angiotensin-converting enzyme (ACE) inhibitor-associated angioedema of the stomach and small intestine: a case report.\nThis is a case report on a 45-year old African-American female with newly diagnosed hypertension, who was started on a combination pill of amlodipine/benazapril 10/5 mg. The very next day, she presented at the emergency room (ER) with abdominal pain, nausea and vomiting. Physical exam, complete metabolic panel, and hemogram were in the normal range. She was discharged from the ER after a few hours of treatment with fluid and analgesics. However, she returned to the ER the next day with the same complaints. This time the physical exam was significant for a distended abdomen with dullness to percussion. CT scan of the abdomen revealed markedly thickened antrum of the stomach, duodenum and jejunum, along with fluid in the abdominal and pelvic cavity. Angiotensin-converting enzyme inhibitor (ACEI)-induced angioe

In [30]:
docs[0]

Document(id='6794356', file='CDR_DevelopmentSet.xml', text='Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant.\nA newborn with massive tricuspid regurgitation, atrial flutter, congestive heart failure, and a high serum lithium level is described. This is the first patient to initially manifest tricuspid regurgitation and atrial flutter, and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy. Sixty-three percent of these infants had tricuspid valve involvement. Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy. It also causes neurologic depression, cyanosis, and cardiac arrhythmia when consumed prior to delivery.', attribs={'root': <Element document at 0x1092ebe60>})

### Comparing against gold candidate set

#### TODO: DictionaryMatch accepts either list or dict; in latter case, assumes vals are the IDs!

#### ALSO: Add estimate_size method to CandidateExtraction operators

### Saving to disk

# TODO: Write tests!