# Disease Tagging Tutorial

In this example, we'll be writing an application to extract *mentions of* diseases from Pubmed abstracts, using annotations from the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial, which has 5 parts, walks through the process of constructing a model to classify _candidate_ disease mentions as either true (i.e., that it is truly a mention of a disease) or false.

## Part V: Evaluating the Model on the Test Set

In this final part of the tutorial, we will reload the model developed in Part IV and test it on the test `CandidateSet`.

A benchmark score for this task, using [TaggerOne](http://www.ncbi.nlm.nih.gov/pubmed/27283952), is **79.6 F1**; we'll see that we will come within 1-2 points F1 score of this already!  (And, if we utilize additional unlabeled data--not covered in this tutorial, but easy to implement!--we can match or exceed it!)

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We repeat our definition of the `Disease` `Candidate` subclass from Parts II, III, and IV.

In [2]:
from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

## Loading Test `CandidateSet`

We reload the test `CandidateSet` object from the previous parts of the tutorial.

In [3]:
from snorkel.models import CandidateSet

test = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Test Candidates').one()

## Automatically Creating Features
We also generate features for the test set.

In [4]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()

Note that we use the training features feature set, because those are the only features for which we have learned parameters. Features that were not encountered during training, e.g., a token that does not appear in the training set, are ignored, because we do not have any information about them.

To do so with the `FeatureManager`, we call update with the new `CandidateSet`, the name of the training `AnnotationKeySet`, and the value `False` for the parameter `extend_key_set` to indicate that the `AnnotationKeySet` should not be expanded with new `Feature` keys encountered during processing.

In [5]:
%time F_test = feature_manager.update(session, test, 'Train Features', False)

[=                                       ] 0%

  sys.stdout.write("\r[%s%s] %d%%" % ("="*b, " "*(self.length-b), 100*((i+1) / self.nf)))



Loading sparse Feature matrix...
CPU times: user 3min 54s, sys: 7.15 s, total: 4min 1s
Wall time: 4min


**OR** if we've already created one, we can simply load as follows:

In [None]:
%time F_test = feature_manager.load(session, test, 'Train Features')

## Reloading the Discriminative Model

In [6]:
from snorkel.learning import LogReg

disc_model = LogReg()
%time disc_model.load(session, "Discriminative Params")

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



CPU times: user 967 ms, sys: 76.6 ms, total: 1.04 s
Wall time: 1.06 s


## Evaluating on the Test `CandidateSet`

First, we load the test set labels and gold candidates we made in Part III.

In [7]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()
L_test = label_manager.load(session, test, "CDR Test Labels -- Gold")

In [8]:
gold_test_set = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Test Candidates -- Gold').one()

In [9]:
tp, fp, tn, fn = disc_model.score(F_test, L_test, gold_test_set)

Test set size:	6407
----------------------------------------
Pos. class accuracy: 0.948176583493
Neg. class accuracy: 0.625724637681
----------------------------------------
Precision:	0.769984413271
Recall:		0.948176583493
F1 Score:	0.849840255591
----------------------------------------
TP: 3458 | FP: 1033 | TN: 1727 | FN: 189
Recall-corrected Noise-aware Model
Pos. class accuracy: 0.78164556962
Neg. class accuracy: 0.625724637681
Corpus Precision 0.77
Corpus Recall    0.782
Corpus F1        0.776
----------------------------------------
TP: 3458 | FP: 1033 | TN: 1727 | FN: 966



You've completed the tutorial!