# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part III: Creating or Loading Evaluation Labels

In [1]:
%load_ext autoreload
%autoreload 2
import os

# Note: We run automated tests on this tutorial to make sure that it is always up to date! 
# However, certain interactive components cannot currently be tested automatically, and will 
# be skipped with if-then statements using the variable below
AUTOMATED_TESTING = os.environ.get('TESTING') is not None

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Part III(a): Creating Evaluation Labels in the `Viewer`

We repeat our definition of the `ChemicalDisease` `Candidate` subclass from Part II.

In [2]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

## Loading the development `CandidateSet`

We will start by viewing the development `CandidateSet` we created in Part II in the `Viewer`.

First we reload the development `CandidateSet`.

In [3]:
from snorkel.models import CandidateSet

cs = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()
cs

Candidate Set (CDR Development Candidates)

## Labeling the `CandidateSet` in the `Viewer`

We create a `Viewer` to annotate them manually.

In [4]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
if not AUTOMATED_TESTING:
    sv = SentenceNgramViewer(cs[:300], session, annotator_name="Tutorial Part III User")
else:
    sv = None

<IPython.core.display.Javascript object>

We now open the Viewer.

You can mark each `Candidate` as true or false. Remember that <span style="color:red">red</span> denotes the first argument (chemical) and <span style="color:blue">blue</span> denotes the second (disease). Try it!

These labels are automatically saved in the database backend, and can be accessed using the annotator's name ('Tutorial Part III User') as the AnnotationKey.

In [5]:
sv

## Loading CDR Labels

Resume here.

## Part III(a): Loading External Evaluation Labels

In [14]:
from utils import get_docs_xml, get_CID_relations
from snorkel.models import Span, Document
import os
ROOT = os.environ['SNORKELHOME']

# Get all the annotated Pubtator documents as XML trees
doc_xmls = get_docs_xml(ROOT + '/tutorial/data/CDR_DevelopmentSet.BioC.xml')

# Create a new CandidateSet
gold_dev = CandidateSet(name='Gold Development Set 3')
session.add(gold_dev)
session.commit()

# Iterate through; note that all docs are in our set by definition
chem_added         = 0
disease_added      = 0
chem_disease_added = 0
for doc_id, doc_xml in doc_xmls.iteritems():
    doc = session.query(Document).filter(Document.stable_id == doc_id).one()
    for cid in get_CID_relations(doc_xml, doc):
        c, d = cid
        sent = c[0]
        
        # Create a Span for chem, add to DB if does not exist
        chem_stable_id = "%s:%s-%s" % (sent.stable_id, c[1], c[2])
        chem = session.query(Span).filter(Span.stable_id == chem_stable_id).first()
        if chem is None:
            chem = Span(stable_id=chem_stable_id, parent=sent, char_start=c[1], char_end=c[2])
            chem_added += 1
         
        # Create a Span for disease, add to DB if does not exist
        disease_stable_id = "%s:%s-%s" % (sent.stable_id, d[1], d[2])
        disease = session.query(Span).filter(Span.stable_id == disease_stable_id).first()
        if disease is None:
            disease = Span(stable_id=disease_stable_id, parent=sent, char_start=d[1], char_end=d[2])
            disease_added += 1
        
        # Create a Candidate for ChemicalDisease, add to DB if does not exist
        chem_disease = session.query(ChemicalDisease).filter(ChemicalDisease.chemical == chem)\
                              .filter(ChemicalDisease.disease == disease).first()
        if chem_disease is None:
            chem_disease = ChemicalDisease(chemical=chem, disease=disease)
            chem_disease_added += 1
        
        # Add to dev set
        gold_dev.append(chem_disease)
        
        # Add annotation!
        # TODO
        
session.commit()
print len(gold_dev)
print chem_added
print disease_added
print chem_disease_added

291
0
0
0


In [15]:
for cd in gold_dev[:10]:
    print cd

ChemicalDisease(Span("Methadone", parent=8973, chars=[13,21], words=[2,2]), Span("QT prolongation", parent=8973, chars=[42,56], words=[6,7]))
ChemicalDisease(Span("Methadone", parent=8973, chars=[13,21], words=[2,2]), Span("syncope", parent=8973, chars=[82,88], words=[12,12]))
ChemicalDisease(Span("PAN", parent=8701, chars=[229,231], words=[38,38]), Span("nephrotic syndrome", parent=8701, chars=[241,258], words=[39,40]))
ChemicalDisease(Span("etoposide", parent=7502, chars=[84,92], words=[11,11]), Span("myocardial infarction", parent=7502, chars=[20,40], words=[4,5]))
ChemicalDisease(Span("echothiophate iodide", parent=10118, chars=[66,85], words=[11,12]), Span("muscle weakness", parent=10118, chars=[129,143], words=[19,20]))
ChemicalDisease(Span("succinylcholine", parent=7602, chars=[35,49], words=[6,6]), Span("apnoea", parent=7602, chars=[51,56], words=[7,7]))
ChemicalDisease(Span("dexamethasone", parent=9319, chars=[152,164], words=[23,23]), Span("ocular hypertension", parent=9319, 