# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part III: Creating or Loading Evaluation Labels

In [1]:
%load_ext autoreload
%autoreload 2
import os

# Note: We run automated tests on this tutorial to make sure that it is always up to date! 
# However, certain interactive components cannot currently be tested automatically, and will 
# be skipped with if-then statements using the variable below
AUTOMATED_TESTING = os.environ.get('TESTING') is not None

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Part III(a): Creating Evaluation Labels in the `Viewer`

We repeat our definition of the `ChemicalDisease` `Candidate` subclass from Part II.

In [2]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

## Loading the development `CandidateSet`

We will start by viewing the development `CandidateSet` we created in Part II in the `Viewer`.

First we reload the development `CandidateSet`.

In [3]:
from snorkel.models import CandidateSet

cs = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()
cs

Candidate Set (CDR Development Candidates)

## Labeling the `CandidateSet` in the `Viewer`

We create a `Viewer` to annotate them manually.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
if not AUTOMATED_TESTING:
    sv = SentenceNgramViewer(cs[:300], session, annotator_name="Tutorial Part III User")
else:
    sv = None

We now open the Viewer.

You can mark each `Candidate` as true or false. Remember that <span style="color:red">red</span> denotes the first argument (chemical) and <span style="color:blue">blue</span> denotes the second (disease). Try it!

These labels are automatically saved in the database backend, and can be accessed using the annotator's name ('Tutorial Part III User') as the AnnotationKey.

In [None]:
sv

## Loading CDR Labels

Resume here.

## Part III(a): Loading External Evaluation Labels

Loading in external annotations can be a bit messier, since these external annotations could be in any format.  Here, we'll provide an example of how to use the `ExternalAnnotationsLoader` helper class to make this a bit simpler.

**Note that most of the code below is custom code just for this particular example's external annotations format;** we start, however, by creating the loader helper.  Note that we use it to create a `CandidateSet` (named "Gold Candidates") and `AnnotationKey` (named "Gold Labels") for the annotations we load.

Note in particular that we need to define a new candidate set because _the external annotations we load might be over candidates not in our candidate set._

In [4]:
from snorkel.loaders import ExternalAnnotationsLoader

dev_loader = ExternalAnnotationsLoader(session, ChemicalDisease, 
                                       'CDR Development Candidates -- Gold',
                                       'CDR Development Labels -- Gold')

Next, we use custom scripts to extract this particular type of annotations.  **The details of these scripts are mostly left out as they are particular to this example (see `tutorial/utils.py`).**

The key part is that we need to form a _dictionary of `TemporaryContexts`_ to pass into the loader:

In [5]:
from utils import get_docs_xml, get_CID_relations
from snorkel.models import Document, TemporarySpan
import os
ROOT = os.environ['SNORKELHOME'] + '/tutorial/data/'

def load_BioC_CDR_labels(loader, file_name):
    # Get all the annotated Pubtator documents as XML trees
    doc_xmls = get_docs_xml(ROOT + file_name)
    for doc_id, doc_xml in doc_xmls.iteritems():
    
        # Get the corresponding Document object
        stable_id = "%s::document:0:0" % doc_id
        doc       = session.query(Document).filter(Document.stable_id == stable_id).one()
    
        # Use custom script to extract the annotations as (sentence, char_start, char_end, text) tuples
        for c, d in get_CID_relations(doc_xml, doc):
            sent, c_char_start, c_char_end, _ = c
            _, d_char_start, d_char_end, _    = d
        
            # Create a dictionary of TemporarySpans
            temp_contexts = {
                'chemical' : TemporarySpan(parent=sent, char_start=c_char_start, char_end=c_char_end),
                'disease'  : TemporarySpan(parent=sent, char_start=d_char_start, char_end=d_char_end)
            }
        
            # Add using the loader
            loader.add(temp_contexts)

load_BioC_CDR_labels(dev_loader, 'CDR_DevelopmentSet.BioC.xml')

We can see that we've loaded 291 annotations:

In [6]:
from snorkel.models import Label

session.query(Label).filter(Label.key == dev_loader.annotation_key).count()

1879

Finally, since they're available, we'll also load the external annotations for the Train and Test sets:

In [7]:
# Training set
train_loader = ExternalAnnotationsLoader(session, ChemicalDisease, 
                                       'CDR Training Candidates -- Gold',
                                       'CDR Training Labels -- Gold')
load_BioC_CDR_labels(train_loader, 'CDR_TrainingSet.BioC.xml')
print session.query(Label).filter(Label.key == train_loader.annotation_key).count()

# Test set
test_loader = ExternalAnnotationsLoader(session, ChemicalDisease, 
                                       'CDR Test Candidates -- Gold',
                                       'CDR Test Labels -- Gold')
load_BioC_CDR_labels(test_loader, 'CDR_TestSet.BioC.xml')
print session.query(Label).filter(Label.key == test_loader.annotation_key).count()

1745
1857


### A Note on the BioCreative CDR Task Annotations

Talk about the difference between mention- and entity-level annotations... here we have mention-level for the unary spans, but then entity-level for relations, which is potentially problematic...