# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

### Task Description

The CDR task is comprised of three sets of 500 documents each, called training, development, and test. A document consists of the title and abstract of an article from [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/), an archive of biomedical and life sciences journal literature. The documents have been hand-annotated with
* Mentions of chemicals and diseases along with their [MESH](https://meshb.nlm.nih.gov/#/fieldSearch) IDs, canonical IDs for medical entities. For example, mentions of "warfarin" in two different documents will have the same ID.
* Chemical-disease relations at the document-level. That is, if some piece of text in the document implies that a chemical with MESH ID `X` induces a disease with MESH ID `Y`, the document will be annotated with `Relation(X, Y)`.

The goal is to extract the document-level relations on the test set (without accessing the entity or relation annotations). For this tutorial, we make the following assumptions and alterations to the task:
* We discard all of the entity mention annotations and assume we have access to a state-of-the-art entity tagger (see Part I) to identify chemical and disease mentions, and link them to their canonical IDs.
* We shuffle the training and development sets a bit, producing a new training set with 900 documents and a new development set with 100 documents. We discard the training set relation annotations, but keep the development set to evaluate our labeling functions and extraction model.
* We evaluate the task at the mention-level, rather than the document-level. We will convert the document-level relation annotations to mention-level by simply saying that a mention pair `(X, Y)` in document `D` if `Relation(X, Y)` was hand-annotated at the document-level for `D`.

In effect, the only inputs to this application are the plain text of the documents, a pre-trained entity tagger, and a small development set of annotated documents. This is representative of many information extraction tasks, and Snorkel is the perfect tool to bootstrap the extraction process with weak supervision. Let's get going.

## Part 0: Initial Prep

In your shell, download the raw data by running:
```bash
cd tutorials/cdr
./download_data.sh
```

Note that if you've previously run this tutorial (using SQLite), you can delete the old database by running (in the same directory as above):
```bash
rm snorkel.db
```

# Part I: Corpus Preprocessing

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
#os.environ['SNORKELDB'] = "postgres://localhost:4554/biocorpus"
os.environ['SNORKELDB'] = "postgres:///biocorpus"

from snorkel import SnorkelSession
session = SnorkelSession()

### Configuring a `DocPreprocessor`

We'll start by defining a `DocPreprocessor` object to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi). There some extra annotation information in the file, while we'll skip for now. We'll use the `XMLMultiDocPreprocessor` class, which allows us to use [XPath queries](https://en.wikipedia.org/wiki/XPath) to specify the relevant sections of the XML format.

Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the `DocPreprocessor` classes to preserve information about document structure.

In [4]:
import os
from snorkel.models import SequenceTag, Document
from snorkel.parser import XMLMultiDocPreprocessor

doc_preprocessor = XMLMultiDocPreprocessor(
    path="",
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()'
)

# create standard CDR folds
folds = {}
fold_defs = {0:'CDR_TrainingSet.BioC.xml', 1:'CDR_DevelopmentSet.BioC.xml', 2:'CDR_TestSet.BioC.xml'}

for split,file_name in fold_defs.items():
    folds[split] = [doc.name for doc,text in doc_preprocessor.parse_file("data/" + file_name, "")]

### Creating a `CorpusParser`

Similar to the Intro tutorial, we'll now construct a `CorpusParser` using the preprocessor we just defined. However, this one has an extra ingredient: an entity tagger. [TaggerOne](https://www.ncbi.nlm.nih.gov/pubmed/27283952) is a popular entity tagger for PubMed, so we went ahead and preprocessed its tags on the CDR corpus for you. The function `TaggerOneTagger.tag` (in `utils.py`) tags sentences with mentions of chemicals and diseases. We'll use these tags to extract candidates in Part II. The tags are stored in `Sentence.entity_cids` and `Sentence.entity_types`, which are analog to `Sentence.words`.

Recall that in the wild, we wouldn't have the manual labels included with the CDR data, and we'd have to use an automated tagger (like TaggerOne) to tag entity mentions. That's what we're doing here.

In [5]:
from snorkel.parser import Spacy

parser = Spacy()

for split in folds:
    # Check if documents already exist in database. 
    # If not, manually parse and add to database
    documents = session.query(Document).filter(Document.name.in_(folds[split])).all()
    missing = set(folds[split]).difference([doc.name for doc in documents])
    if missing:
        print "Parsing {} documents...".format(len(missing))
        documents = [(doc,text) for doc,text in doc_preprocessor.parse_file("data/" + fold_defs[split], "") 
                     if doc.name in missing]
        sentences = []
        for doc, text in documents:
            for parts in parser.parse(doc, text):
                s = Sentence(**parts)
                session.add(s)
    
session.commit()

### Load TaggerOne Labels

In [6]:
name2id = dict(session.query(Document.name, Document.id).all())

# load taggerone tags
seq_tags = []
taggerone = [line.split("\t") for line in open("data/taggerone.tags.tsv","rU").read().splitlines()]
for row in taggerone:
    pmid, start, end, concept_type, concept_uid, source = row
    seq_tags.append(SequenceTag(document_id=name2id[pmid], abs_char_start=start, abs_char_end=end,
                                concept_type=concept_type, concept_uid=concept_uid, source=source))

session.bulk_save_objects(seq_tags)
session.commit()
print len(seq_tags)

27807


In [7]:
from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

('Documents:', 2151L)
('Sentences:', 20552L)


# Part II: Candidate Extraction

With the TaggerOne entity tags, candidate extraction is pretty easy! We split into some preset training, development, and test sets. Then we'll use PretaggedCandidateExtractor to extract candidates using the TaggerOne entity tags.

## CDR Train/Dev/Test Folds

In [13]:
import itertools
from custom_cand_generator import SequenceTagCandidateExtractor

# bin sentences by CDR fold
sentences = {}
for split in folds:
    documents = session.query(Document).filter(Document.name.in_(folds[split])).all() 
    sentences[split] = list(itertools.chain.from_iterable([doc.sentences for doc in documents]))

In [14]:
from snorkel.models import Candidate, candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

In [15]:
candidate_extractor1 = SequenceTagCandidateExtractor(
    ChemicalDisease, ['Disease', 'Chemical'], tag_sources=['TaggerOne']
)

for split, sents in sentences.items():
    candidate_extractor1.apply(sents, clear=True, split=split)
    print("Number of candidates:", session.query(ChemicalDisease).filter(ChemicalDisease.split == split).count())

Clearing existing...
Running UDF...

('Number of candidates:', 4465L)
Clearing existing...
Running UDF...

('Number of candidates:', 4920L)
Clearing existing...
Running UDF...

('Number of candidates:', 4727L)


We should get 8268 candidates in the training set, 888 candidates in the development set, and 4620 candidates in the test set.

## PubTator Random/Query Folds
We generate 2 sets of 10K documents, generated via resevoir sampling. Random is sampled unformly from all docs; query is sampled from the set of documents matching any NER term found in CDR training data. 

In [17]:
unlabeled = {
    'random': open("data/random.pmids.txt","rU").read().splitlines(),
    'query': open("data/query.pmids.txt","rU").read().splitlines()
}

random_docs = session.query(Document).filter(Document.name.in_(unlabeled['random'])).all()
query_docs  = session.query(Document).filter(Document.name.in_(unlabeled['query'])).all()

print "Found {}/{} random sampled documents".format(len(random_docs),len(unlabeled['random']))
print "Found {}/{} query sampled documents".format(len(query_docs),len(unlabeled['query']))

random_sents = list(itertools.chain.from_iterable([doc.sentences for doc in random_docs]))
query_sents = list(itertools.chain.from_iterable([doc.sentences for doc in query_docs]))

Found 0/99986 random sampled documents
Found 0/99997 query sampled documents


In [None]:
candidate_extractor2 = SequenceTagCandidateExtractor(
    ChemicalDisease, ['Disease', 'Chemical'], tag_sources=['PubTator']
)

for split,sents in [(3,random_sents), (4,query_sents)]:
    candidate_extractor2.apply(sents, clear=True, split=split)
    print("Number of candidates:", session.query(ChemicalDisease).filter(ChemicalDisease.split == split).count())