# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part I: Preprocessing

In this notebook, we'll preprocess several documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data--which we refer to as _contexts_--as well as extracting standard linguistic features from each context.

In this example, we will extract two types of contexts, represented as `Context` subclasses: `Document` and constituent `Sentence` objects.  In particular, we'll do this using [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), which will also extract a number of standard linguistic features which will be used downstream.

All of this preprocessed input data will be saved to a database.  In Snorkel, if no database is specified, then a SQLite database is created by default- so no setup is needed here!

In [1]:
%load_ext autoreload
%autoreload 2
import os

# Note: We run automated tests on this tutorial to make sure that it is always up to date! 
# However, certain interactive components cannot currently be tested automatically, and will 
# be skipped with if-then statements using the variable below
AUTOMATED_TESTING = os.environ.get('TESTING') is not None

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading the Corpus

First, we will load and pre-process the corpus, storing it for convenience in a `Corpus` object

### Configuring a document parser

We'll start by defining a `DocParser` class to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi)), where they are stored along with "gold" (i.e. hand-annotated by experts) *chemical* and *disease mention* annotations. We'll use the `XMLDocParser` class, which allows us to use [XPath queries](https://en.wikipedia.org/wiki/XPath) to specify the relevant sections of the XML format.

_Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the `DocParser` classes to preserve information about document structure._

In [2]:
from snorkel.parser import XMLMultiDocParser

xml_parser = XMLMultiDocParser(
    path='data/CDR_DevelopmentSet.xml',
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()',
    keep_xml_tree=True)

### Selecting a sentence parser

Next, we'll use an NLP preprocessing tool to split the `Document` objects into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.  Here we use the default `SentenceParser` class.

In [3]:
from snorkel.parser import SentenceParser

sent_parser = SentenceParser()

### Pre-processing & loading the corpus

Finally, we'll put this all together using a `CorpusParser` object, which will execute the parsers and store the results as a `Corpus`:

In [4]:
from snorkel.parser import CorpusParser

cp = CorpusParser(xml_parser, sent_parser)
%time corpus = cp.parse_corpus(name='CDR_corpus', session=session)

CPU times: user 7.07 s, sys: 299 ms, total: 7.36 s
Wall time: 30.8 s


In [5]:
doc = corpus.documents[0]
doc

Document 9121607

In [6]:
sent = doc.sentences[0]
print sent
print sent.words
print sent.poses

Sentence(Document 9121607, 0, u'Neuroactive steroids protect against pilocarpine- and kainic acid-induced limbic seizures and status epilepticus in mice.')
[u'Neuroactive', u'steroids', u'protect', u'against', u'pilocarpine', u'-', u'and', u'kainic', u'acid-induced', u'limbic', u'seizures', u'and', u'status', u'epilepticus', u'in', u'mice', u'.']
[u'JJ', u'NNS', u'VBP', u'IN', u'NN', u':', u'CC', u'JJ', u'JJ', u'JJ', u'NNS', u'CC', u'NN', u'NN', u'IN', u'NNS', u'.']


### Saving the corpus
Finally, we persist the parsed corpus in Snorkel's database backend:

In [7]:
session.add(corpus)
session.commit()