# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part IV: Training a Model with Data Programming

In this part of the tutorial, we will train a statistical model to differentiate between true and false `ChemicalDisease` mentions.

We will train this model using _data programming_, and we will **ignore** the training labels provided with the training data. This is a more realistic scenario; in the wild, hand-labeled training data is rare and expensive. Data programming enables us to train a model using only a modest amount of hand-labeled data for validation and testing. For more information on data programming, see the [NIPS 2016 paper](https://arxiv.org/abs/1605.07723).

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We repeat our definition of the `ChemicalDisease` `Candidate` subclass from Parts II and III.

In [2]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

## Loading `CandidateSet` objects

We reload the `CandidateSet` objects from the previous parts of the tutorial. Note that we will now process all three (training, validation, and test) as we go, because each plays a distinct role in Parts IV and V.

In [3]:
from snorkel.models import CandidateSet

train = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates').one()
dev = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()
test = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Test Candidates').one()

candidate_sets = [train, dev, test]

## _TEMPORARY:_ We'll now create a small subsample to use

Create a small subset to proceed with:

In [None]:
train_1K = CandidateSet(name='CDR Training Candidates 1K')
for candidate in train.candidates[:1000]:
    train_1K.append(candidate)
session.add(train_1K)
session.commit()

**OR** load if already created:

In [4]:
train_1K = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates 1K').one()

## Automatically Creating Features
Recall that our goal is to distinguish between true and false mentions of chemical-disease relations. To train a model for this task, we first embed our `ChemicalDisease` candidates in a feature space.

In [5]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()

We can create a new feature set:

**_NOTE: This may take ~1 hour currently_**

In [None]:
%time train_features = feature_manager.create(session, train_1K, 'Train 1K Features')

**OR** if we've already created one, we can simply load as follows:

In [6]:
%time F = feature_manager.load(session, train_1K, 'Train 1K Features')

CPU times: user 23.5 s, sys: 59.6 ms, total: 23.5 s
Wall time: 23.5 s


Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [7]:
F

<1000x11524 sparse matrix of type '<type 'numpy.float64'>'
	with 48318 stored elements in Compressed Sparse Row format>

In [8]:
F.get_candidate(0)

AnnotationKey (Tutorial Part 2 User)

In [9]:
F.get_key(0)

AnnotationKey (TDL_LEMMA:PARENTS-OF-BETWEEN-MENTION-and-MENTION[None])

## Creating Labeling Functions
Labeling functions are a core tool of data programming. They are heuristic functions that aim to classify candidates correctly. Their outputs will be automatically combined and denoised to estimate the probabilities of training labels for the training data.

In [10]:
import random
def LF_test(candidate):
    return 1 if random.random() < 0.01 else 0
def LF_test_2(candidate):
    return 1 if random.random() < 0.1 else 0

We maintain a list of all LFs for convenience.

In [11]:
LFs = [LF_test, LF_test_2]

## Applying Labeling Functions

First we construct a `CandidateLabeler`.

In [12]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()

Next we run the `CandidateLabeler` to to apply the labeling functions to the training `CandidateSet`.

In [None]:
%time L = label_manager.create(session, train_1K, 'LF Labels', f=LFs)
L

**OR** load if we've already created:

In [13]:
L = label_manager.load(session, train_1K, 'LF Labels')
L

<1000x3 sparse matrix of type '<type 'numpy.float64'>'
	with 309 stored elements in Compressed Sparse Row format>

We can also rerun a single labeling function (or more!) with the below command.

Do this to test changes to the labeling functions.

In [None]:
def LF_test_3(candidate):
    return -1 if random.random() < 0.2 else 0

L = label_manager.update(session, train_1K, 'LF Labels', f=[LF_test_3])
L

We can view statistics about the resulting label matrix:

In [14]:
L.lf_stats()

Unnamed: 0,conflicts,coverage,j,overlaps
LF_test_2,0.021,0.097,0,0.023
LF_test,0.002,0.007,1,0.004
LF_test_3,0.022,0.205,2,0.022


## Fitting the Generative Model
We estimate the accuracies of the labeling functions without supervision. Specifically, we estimate the parameters of a `NaiveBayes` generative model.

In [None]:
from snorkel.learning import NaiveBayes

gen_model = NaiveBayes(bias_term=False)
gen_model.train(labels, w0=np.ones(labels.shape[1]), n_iter=1000)

We now apply the generative model to the training candidates.

In [None]:
train_marginals = gen_model.marginals(labels)

## Training the Discriminative Model
We use the estimated probabilites to train a discriminative model that classifies each `Candidate` as a true or false mention.

In [None]:
disc_model = LogReg(bias_term=True)
disc_model.train(f_train, train_marginals, n_iter=1500, rate=1e-5)

## Evaluating on the Development `CandidateSet`

In [None]:
test_labels=[]
for candidate in sorted_test_candidates:
    test_labels.append(1 if candidate in gold_candidate_set else -1)
test_labels = np.asarray(test_labels)

score(sorted_test_candidates, test_labels, pred, gold_candidate_set, \
      train_marginals=train_marginals, test_marginals=test_marginals)

After evaluating on the development `CandidateSet`, the labeling functions can be modified. Try changing the labeling functions to improve performance. You can view the true positives, false positives, true negatives, and false negatives using the `Viewer`.

## Saving the Discriminative Model's Parameters
We save the model's parameters for use in Part V.

Next, in Part V, we will test our model on the test `CandidateSet`.