# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

### Task Description

The CDR task is comprised of three sets of 500 documents each, called training, development, and test. A document consists of the title and abstract of an article from [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/), an archive of biomedical and life sciences journal literature. The documents have been hand-annotated with
* Mentions of chemicals and diseases along with their [MESH](https://meshb.nlm.nih.gov/#/fieldSearch) IDs, canonical IDs for medical entities. For example, mentions of "warfarin" in two different documents will have the same ID.
* Chemical-disease relations at the document-level. That is, if some piece of text in the document implies that a chemical with MESH ID `X` induces a disease with MESH ID `Y`, the document will be annotated with `Relation(X, Y)`.

The goal is to extract the document-level relations on the test set (without accessing the entity or relation annotations). For this tutorial, we make the following assumptions and alterations to the task:
* We discard all of the entity mention annotations and assume we have access to a state-of-the-art entity tagger (see Part I) to identify chemical and disease mentions, and link them to their canonical IDs.
* We shuffle the training and development sets a bit, producing a new training set with 900 documents and a new development set with 100 documents. We discard the training set relation annotations, but keep the development set to evaluate our labeling functions and extraction model.
* We evaluate the task at the mention-level, rather than the document-level. We will convert the document-level relation annotations to mention-level by simply saying that a mention pair `(X, Y)` in document `D` if `Relation(X, Y)` was hand-annotated at the document-level for `D`.

In effect, the only inputs to this application are the plain text of the documents, a pre-trained entity tagger, and a small development set of annotated documents. This is representative of many information extraction tasks, and Snorkel is the perfect tool to bootstrap the extraction process with weak supervision. Let's get going.

# Part I: Corpus Preprocessing

### Before starting, make sure to run the download_data.sh script!

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
os.environ['SNORKELDB'] = 'postgres:///cdr-structure-learning-2'

from snorkel import SnorkelSession
session = SnorkelSession()

### Configuring a `DocPreprocessor`

We'll start by defining a `DocPreprocessor` object to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi). There some extra annotation information in the file, while we'll skip for now. We'll use the `XMLMultiDocPreprocessor` class, which allows us to use [XPath queries](https://en.wikipedia.org/wiki/XPath) to specify the relevant sections of the XML format.

Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the `DocPreprocessor` classes to preserve information about document structure.

In [None]:
import os
from snorkel.parser import XMLMultiDocPreprocessor

# The following line is for testing only. Feel free to ignore it.
file_path = 'data/CDR.BioC.small.xml' if 'CI' in os.environ else 'data/CDR.BioC.xml'

doc_preprocessor = XMLMultiDocPreprocessor(
    path=file_path,
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()'
)

### Creating a `CorpusParser`

Similar to the Intro tutorial, we'll now construct a `CorpusParser` using the preprocessor we just defined. However, this one has an extra ingredient: an entity tagger. [TaggerOne](https://www.ncbi.nlm.nih.gov/pubmed/27283952) is a popular entity tagger for PubMed, so we went ahead and preprocessed its tags on the CDR corpus for you. The function `TaggerOneTagger.tag` (in `utils.py`) tags sentences with mentions of chemicals and diseases. We'll use these tags to extract candidates in Part II. The tags are stored in `Sentence.entity_cids` and `Sentence.entity_types`, which are analog to `Sentence.words`.

Recall that in the wild, we wouldn't have the manual labels included with the CDR data, and we'd have to use an automated tagger (like TaggerOne) to tag entity mentions. That's what we're doing here.

In [None]:
from snorkel.parser import CorpusParser
from utils import TaggerOneTagger

tagger_one = TaggerOneTagger()
corpus_parser = CorpusParser(fn=tagger_one.tag)
corpus_parser.apply(list(doc_preprocessor), clear=False)

In [2]:
from snorkel.models import Document, Sentence

print "Documents:", session.query(Document).count()
print "Sentences:", session.query(Sentence).count()

Documents: 143381
Sentences: 1543259


# Integrating data from `snorkel-biocorpus-new`

First, run the script `transfer_tables.sh`.

Then--obviously not the cleanest / most efficient way to do this--but...

### Step 1: Create a mapping of old ids -> new ids in memory

This will allow us to (a) use `bulk_save_objects` and (b) insert documents and sentences separately

In [None]:
res = session.execute("""SELECT id FROM context_filtered_kw""")
old_ids = [r[0] for r in res.fetchall()]
print len(old_ids)

In [None]:
old_to_new_ids = {}
for i, old_id in enumerate(old_ids):
    old_to_new_ids[old_id] = 15502 + i

### Step 2: Insert `Documents`

In [None]:
%%time
res = session.execute("""
    SELECT d.id, d.name, c.stable_id, d.meta
    FROM document_filtered_kw d, context_filtered_kw c
    WHERE d.id = c.id""")

docs = [Document(id=old_to_new_ids[r.id], name=r.name, stable_id=r.stable_id, meta=r.meta) for r in res.fetchall()]
print "Fetched %s docs..." % len(docs)

session.bulk_save_objects(docs)
session.commit()

### Step 3: Insert `Sentences`

In [None]:
%%time
res = session.execute("""
    SELECT s.*, c.stable_id
    FROM sentence_filtered_kw s, context_filtered_kw c
    WHERE s.id = c.id""")

rows = res.fetchall()
print len(rows)

In [None]:
sents = []
for r in rows:
    sent = Sentence(
                id=old_to_new_ids[r.id],
                document_id=old_to_new_ids[r.document_id],
                position=r.position,
                text=r.text,
                words=r.words,
                char_offsets=r.char_offsets,
                lemmas=r.lemmas,
                pos_tags=r.pos_tags,
                ner_tags=r.ner_tags,
                dep_parents=r.dep_parents,
                dep_labels=r.dep_labels,
                entity_cids=r.entity_cids,
                entity_types=r.entity_types,
                stable_id=r.stable_id)
    sents.append(sent)

In [None]:
session.bulk_save_objects(sents)
session.commit()

In [None]:
print "Documents:", session.query(Document).count()
print "Sentences:", session.query(Sentence).count()

# Part II: Candidate Extraction

With the TaggerOne entity tags, candidate extraction is pretty easy! We split into some preset training, development, and test sets. Then we'll use PretaggedCandidateExtractor to extract candidates using the TaggerOne entity tags.

In [None]:
import cPickle

with open('data/doc_ids.pkl', 'rb') as f:
    train_ids, dev_ids, test_ids = cPickle.load(f)
train_ids, dev_ids, test_ids = set(train_ids), set(dev_ids), set(test_ids)

train_sents, dev_sents, test_sents = set(), set(), set()
docs = session.query(Document).order_by(Document.name).all()
for i, doc in enumerate(docs):
    for s in doc.sentences:
        if doc.name in train_ids:
            train_sents.add(s)
        elif doc.name in dev_ids:
            dev_sents.add(s)
        elif doc.name in test_ids:
            test_sents.add(s)
        
        # New docs we added from scale up set
        elif doc.id > 15502:
            train_sents.add(s)
        else:
            raise Exception('ID <{0}> not found in any id set'.format(doc.name))

In [3]:
from snorkel.models import Candidate, candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

In [None]:
from snorkel.candidates import PretaggedCandidateExtractor

candidate_extractor = PretaggedCandidateExtractor(ChemicalDisease, ['Chemical', 'Disease'])

We should get 8268 candidates in the training set, 888 candidates in the development set, and 4620 candidates in the test set.

In [None]:
print len(train_sents)
print len(dev_sents)
print len(test_sents)

# Subsample the training sents...

In [None]:
import numpy as np
train_sents = list(train_sents)
np.random.shuffle(train_sents)
train_sents_ss = train_sents[:int(0.1*len(train_sents))]
print len(train_sents_ss)

### Update `context.id` increment value

Or else we get IntegrityErrors coming up...

In [None]:
res = session.execute("SELECT MAX(id) FROM context;")
max_cid = res.fetchall()[0][0]
max_cid

In [None]:
session.execute("ALTER SEQUENCE context_id_seq RESTART WITH %s;" % (max_cid+1,))

In [None]:
for k, sents in enumerate([train_sents_ss, dev_sents, test_sents]):
    candidate_extractor.apply(sents, split=k, check_for_existing=False)
    print "Number of candidates:", session.query(ChemicalDisease).filter(ChemicalDisease.split == k).count()

### Candidate Recall
We will briefly discuss the issue of candidate recall. The end-recall of the extraction is effectively upper-bounded by our candidate set: any chemical-disease pair that is present in a document but not identified as a candidate cannot be extracted by our end extraction model. Below are some example reasons for missing a candidate<sup>1</sup>.
* The tagger is imperfect, and may miss a chemical or disease mention.
* The tagger is imperfect, and may attach an incorrect entity ID to a correctly identified chemical or disease mention. For example, "stomach pain" might get attached to the entity ID for "digestive track infection" rather than "stomach illness".
* A relation occurs across multiple sentences. For example, "**Artery calcification** is more prominient in older populations. It can be induced by **warfarin**."

If we just look at the set of extractions at the end of this tutorial, we won't be able to account for some false negatives that we missed at the candidate extraction stage. For simplicity, we ignore candidate recall in this tutorial and evaluate our extraction model just on the set of extractions made by the end model. However, when you're developing information extraction applications in the future, it's important to keep candidate recall in mind.

<sup>1</sup>Note that these specific issues can be combatted with advanced techniques like noun-phrase chunking to expand the entity mention set, or coreference parsing for cross-sentence candidates. We don't employ these here in order to focus on weak supervision.

In [4]:
from snorkel.models import Candidate
print "Training candidates:", session.query(ChemicalDisease).filter(Candidate.split == 0).count()
print "Dev candidates:", session.query(ChemicalDisease).filter(Candidate.split == 1).count()
print "Test candidates:", session.query(ChemicalDisease).filter(Candidate.split == 2).count()

Training candidates: 34283
Dev candidates: 888
Test candidates: 4620


In [7]:
session.query(Candidate).count()

70828L

In [8]:
session.query(ChemicalDisease).count()

39791L

# Featurization

In [6]:
session.query(ChemicalDisease).filter(ChemicalDisease.id == 36331).first()

In [5]:
from snorkel.models import Candidate

c = session.query(Candidate).filter(Candidate.id == 36331).first()
c

ObjectDeletedError: Instance '<ChemicalDisease at 0x7f647e8f5450>' has been deleted, or its row is otherwise not present.

In [None]:
session.rollback()

In [13]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()

In [None]:
%time F_train = featurizer.apply(split=0, cand_class=ChemicalDisease, parallelism=20)
F_train

Clearing existing...
Running UDF...


In [None]:
%%time
F_dev  = featurizer.apply_existing(split=1)
F_test = featurizer.apply_existing(split=2)