# Disease Norm

In this example, we'll be writing an application to extract *mentions of* diseases from Pubmed abstracts, using annotations from the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial, which has 5 parts, walks through the process of constructing a model to classify _candidate_ disease mentions as either true (i.e., that it is truly a mention of a disease) or false.

## Plan of action:

Two types of LFs:
1. TYPE I: Leveraging sources of WS (e.g. DS)
2. TYPE II: Expressing heuristics (e.g. magnifying user effort)

TYPE I:
- Need to break up MESH into subtrees and have each one be an LF!
- Need to provide negative signal

TYPE II:
- Conduct "simulated expert" experiment: go through, label examples, write LFs- what is the effective multiplier over binary labeling??
    * E.g. "renal failure"; add {"renal" -> "kidney"} to synonym map

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

## Loading `CandidateSet` objects

We reload the training and development `CandidateSet` objects from the previous parts of the tutorial.

In [3]:
from snorkel.models import CandidateSet

train = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates').one()
dev = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()

# Writing some multinomial LFs

## TYPE I LF: Subsets of MESH dictionary

In [4]:
from cPickle import load

MESH_to_CID = load(open('MESH_to_CID.pkl', 'rb'))

In [5]:
from utils import load_mesh_raw

mesh_entries = load_mesh_raw('data/desc2017.xml')

Loaded 28472 entries


# Write the LFGs as key-value generators

In [6]:
mesh_tree = defaultdict(list)
for entry in mesh_entries:
    mid, tree_nums, terms = entry
    for tn in tree_nums:
        path = [tn[0]] + tn[1:].split(".")
        for term in terms:
            mesh_tree[term].append((mid, path))

In [7]:
mesh_tree.values()[0]

[('D007952', ['C', '04', '557', '337', '595']),
 ('D007952', ['C', '04', '557', '595', '500', '500']),
 ('D007952', ['C', '20', '683', '515', '845', '500'])]

In [8]:
def LFG_MESH_exact(c):
    p = c.disease.get_span().lower()
    if p in mesh_tree:
        seen = set()
        for mid, path in mesh_tree[p]:
            key   = "_".join(path[:2])
            if key not in seen:
                seen.add(key)
                value = MESH_to_CID[mid] if path[0] in ['C', 'F'] else -1
                yield key, value

In [9]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()
L_gold_dev = label_manager.load(session, dev, "CDR Development Label Set")

In [10]:
L_gold_dev

<29853x1 sparse matrix of type '<type 'numpy.float64'>'
	with 29853 stored elements in Compressed Sparse Row format>

In [11]:
%time L_dev = label_manager.create(session, dev, 'LF Labels', f=LFG_MESH_exact)
L_dev


Loading sparse Label matrix...
CPU times: user 1min 37s, sys: 17.9 s, total: 1min 55s
Wall time: 1min 41s


<29853x103 sparse matrix of type '<type 'numpy.float64'>'
	with 10832 stored elements in Compressed Sparse Row format>

In [12]:
L_dev.lf_stats(labels=L_gold_dev)

Unnamed: 0,j,coverage,overlaps,conflicts,accuracy
G_07,0,0.003517,0.003015,0.000435,0.933333
G_08,1,0.002244,0.000770,0.000033,0.985075
C_23,2,0.018524,0.013734,0.002412,0.833635
C_08,3,0.001306,0.000770,0.000000,0.871795
D_03,4,0.028004,0.012394,0.000000,1.000000
F_01,5,0.004757,0.001943,0.000134,0.169014
B_01,6,0.013868,0.000000,0.000000,0.995169
C_10,7,0.016347,0.012394,0.001072,0.838115
E_03,8,0.000804,0.000033,0.000000,0.791667
D_02,9,0.038690,0.015945,0.000000,1.000000


In [13]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



ImportError: No module named numbskull

# TODO:

0. _DONE: Add negative labels to candidates..._
1. _DONE: Get empirical LF accs up and running..._
0. Add TF-IDF matching LFs
2. **Conduct TYPE II "experiment"!**
2. Binarize LFs + run in binary gen model
3. Add in simple DDLIB + WS feats from new_features -> run disc. model

In [None]:
target_dicts, neg_sets = split_MESH(mesh_entries, target_split_level=1, neg_split_level=1)
print len(target_dicts)
print np.mean([len(set(td.values())) for td in target_dicts.values()])
print len(neg_sets)
print np.mean([len(ns) for ns in neg_sets.values()])

In [None]:
def LFGen_MESH_exact(target_dictionaries, neg_sets):
    LFs = []
    for name, td in target_dictionaries.iteritems():
        
        # Simple exact match LF
        def lf(c):
            td = target_dictionaries[name]
            print td
            p  = c.disease.get_span().lower()
            if p in td:
                return MESH_to_CID[td[p]]
            else:
                return 0
        lf.__name__ = 'LF_MESH_exact_%s' % name
        LFs.append(lf)
        
    for name, ns in neg_sets.iteritems():
        
        # Simple exact negative match LF
        def lf(c):
            p = c.disease.get_span().lower()
            if p in ns:
                return -1
            else:
                return 0
        lf.__name__ = 'LF_MESH_exact_neg_%s' % name
        LFs.append(lf)
    return LFs

In [None]:
def LF_accuracy(LF, candidates, gold):
    correct = 0
    labeled = 0
    for i,c in enumerate(candidates):
        l = LF(c)
        if l != 0:
            labeled += 1
            if l == gold[i,0]:
                correct += 1
    print "Number labeled:\t", labeled
    print "Correct:\t", correct / float(labeled)

In [None]:
all_diseases = {}
for td in target_dictset_dictionaries.values():
    for term, mid in td.iteritems():
        all_diseases[term] = mid

In [None]:
def split_MESH(mesh_entries, target_prefixes=['C', 'F'], target_split_level=2, neg_split_level=1):
    """
    Split the MESH entries list into a list of *target dictionaries* and a list of *negative sets*
    Each target dictionary maps terms -> MESH IDs
    Each negative set just contains terms that are negative matches
    """
    target_dictionaries = defaultdict(dict)
    neg_sets            = defaultdict(set)
    
    for entry in mesh_entries:
        mid, tree_nums, terms = entry
        for tn in tree_nums:
            prefix = tn[0]
            path   = tn[1:].split(".")
            
            # Targets
            if prefix in target_prefixes:
                tag = "_".join(path[:target_split_level])
                for term in terms:
                    target_dictionaries[tag][term] = mid
            
            # Negative sets
            else:
                tag = "_".join(path[:neg_split_level])
                for term in terms:
                    neg_sets[tag].add(term)
    return target_dictionaries, neg_sets

In [None]:
def generate_tfidf_MESH_LFs(target_dictionaries, neg_sets):
    LFs = []
    Ds  = []
    for name, td in target_dictionaries.iteritems():
        
        # Simple exact match LF
        def lf(c):
            p = c.disease.get_span().lower()
            if p in td:
                return MESH_to_CID[td[p]]
            else:
                return 0
        lf.__name__ = 'LF_MESH_exact_%s' % name
        LFs.append(lf)
        
    for name, ns in neg_sets.iteritems():
        
        # Simple exact negative match LF
        def lf(c):
            p = c.disease.get_span().lower()
            if p in ns:
                return -1
            else:
                return 0
        lf.__name__ = 'LF_MESH_exact_neg_%s' % name
        LFs.append(lf)
    return LFs

#### Evaluating the LFs

In [None]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()
L_gold_dev = label_manager.load(session, dev, "CDR Development Label Set")

#### Exact dictionary match

In [None]:
def LF_exact_MESH_match(c):
    p = c.disease.get_span().lower()
    if p in diseases:
        return MESH_to_CID[diseases[p]]
    else:
        return 0

In [None]:
%time LF_accuracy(LF_exact_MESH_match, dev, L_gold_dev)

#### Bag match

In [None]:
disease_bags = {}
for term, mid in diseases.iteritems():
    disease_bags[frozenset(re.findall(r'\w+', term))] = mid

In [None]:
def LF_bag_MESH_match(c):
    bag = frozenset(re.findall(r'\w+', c.disease.get_span().lower()))
    if bag in disease_bags:
        return MESH_to_CID[disease_bags[bag]]
    else:
        return 0

In [None]:
%time LF_accuracy(LF_bag_MESH_match, dev, L_gold_dev)

#### TF-IDF match

In [None]:
from entity_norm import CanonDictVectorizer

cd_vectorizer = CanonDictVectorizer(diseases, other_phrases=[])

In [None]:
disease_phrases = []
disease_cids    = []
for term, mid in diseases.iteritems():
    disease_phrases.append(term)
    disease_cids.append(mid)
    
D = cd_vectorizer.vectorize_phrases(disease_phrases)
D

def LF_tfidf_MESH_match(c, thresh):
    cx = cd_vectorizer.vectorize_phrases([c.disease.get_span()])
    m  = (cx * D.T).tocoo()
    if m.data.shape[0] > 0 and m.data.max() > thresh:
        return MESH_to_CID[disease_cids[m.col[m.data.argmax()]]]
    else:
        return 0

In [None]:
def LF_tfidf_MESH_match(c, thresh):
    cx = cd_vectorizer.vectorize_phrases([c.disease.get_span()])
    m  = (cx * D.T).tocoo()
    if m.data.shape[0] > 0 and m.data.max() > thresh:
        return MESH_to_CID[disease_cids[m.col[m.data.argmax()]]]
    else:
        return 0

In [None]:
def LF_tfidf_MESH_match_0(c):
    cx = cd_vectorizer.vectorize_phrases([c.disease.get_span()])
    m  = (cx * D.T).tocoo()
    if m.data.shape[0] > 0:
        return MESH_to_CID[disease_cids[m.col[m.data.argmax()]]]
    else:
        return 0

In [None]:
%time LF_accuracy(LF_tfidf_MESH_match_0, dev, L_gold_dev)

In [None]:
def LF_tfidf_MESH_match_05(c):
    return LF_tfidf_MESH_match(c, 0.5)

In [None]:
%time LF_accuracy(LF_tfidf_MESH_match_05, dev, L_gold_dev)

In [None]:
def LF_tfidf_MESH_match_08(c):
    return LF_tfidf_MESH_match(c, 0.8)

In [None]:
%time LF_accuracy(LF_tfidf_MESH_match_08, dev, L_gold_dev)

In [None]:
def LF_tfidf_MESH_match_1(c):
    return LF_tfidf_MESH_match(c, 1.0)

In [None]:
%time LF_accuracy(LF_tfidf_MESH_match_1, dev, L_gold_dev)

In [None]:
from snorkel.viewer import SentenceNgramViewernceNgramViewerenceNgramViewer

sv = SentenceNgramViewer(, session)

## Automatically Creating Features
Recall that our goal is to distinguish between true and false mentions of chemical-disease relations. To train a model for this task, we first embed our `ChemicalDisease` candidates in a feature space.

In [None]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()

We can create a new feature set:

In [None]:
%time F_train = feature_manager.create(session, train, 'Train Features')

**OR** if we've already created one, we can simply load as follows:

In [None]:
%time F_train = feature_manager.load(session, train, 'Train Features')

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [None]:
F_train

In [None]:
F_train.get_candidate(0)

In [None]:
F_train.get_key(0)

## Creating Labeling Functions
Labeling functions are a core tool of data programming. They are heuristic functions that aim to classify candidates correctly. Their outputs will be automatically combined and denoised to estimate the probabilities of training labels for the training data.

In [None]:
import re
from lf_terms import *
from snorkel.lf_helpers import get_left_tokens, get_right_tokens

We also load some publicly-available biomedical dictionaries, which we will leverage in some of our LFs below as a source of weak supervision:

In [None]:
from utils import *

umls_dict              = load_umls_dictionary()
chemicals              = load_chemdner_dictionary()
abbrv2text, text2abbrv = load_specialist_abbreviations()

#### Document-Level Labeling Functions
We start with some labeling functions that label candidates based on document-level features.

In [None]:
from snorkel.lf_helpers import get_doc_candidate_spans

def LF_undefined_abbreviation(c):
    '''Candidate is a known abbreviation, but no corresponding full name in document'''
    doc_spans = get_doc_candidate_spans(c)
    phrase = c[0].get_span().lower()
    mentions = set([s.get_span().lower() for s in doc_spans])
    if len(phrase) > 1 and phrase in abbrv2text and not set(abbrv2text[phrase].keys()).intersection(mentions):
        return -1
    return 0

#### Sentence-Level Labeling Functions
We also include some labeling functions that label candidates based on sentence-level features.

In [None]:
from snorkel.lf_helpers import get_sent_candidate_spans

def LF_contiguous_mentions(c):
    '''Contiguous candidates are likely wrong'''
    neighbor_spans = get_sent_candidate_spans(c)
    start, end = c[0].get_word_start(), c[0].get_word_end()
    for s in neighbor_spans:
        if s.get_word_end() + 1 == start or s.get_word_start() - 1 == end:
            return -1
    return 0

#### Mention-Level Labeling Functions
We now define a number of labeling functions that label candidates based on attributes related to the mention.

In [None]:
from snorkel.lf_helpers import get_left_tokens, get_right_tokens

def LF_tumors_growths(c):
    phrase = " ".join(c[0].get_attrib_tokens('lemmas'))
    return 1 if re.search("^(\w* ){0,2}(['] )*(tumor|tumour|polyp|pilomatricoma|cyst|lipoma)$", phrase) else 0

def LF_cancer(c):
    '''<TYPE> cancer'''
    phrase = " ".join(c[0].get_attrib_tokens('lemmas'))
    return 1 if re.search("\w* cancer",phrase) else 0

def LF_disease_syndrome(c):
    '''<TYPE> disease or <TYPE> syndrome'''
    phrase = " ".join(c[0].get_attrib_tokens('lemmas'))
    return 1 if re.search("\w* (disease|syndrome)+",phrase) else 0

def LF_indicators(c):
    '''Indicator words'''
    return 1 if " ".join(c[0].get_attrib_tokens()).lower() in indicators else 0

def LF_common_disease(c):
    '''Common disease'''
    return 1 if " ".join(c[0].get_attrib_tokens()).lower() in common_disease else 0

*For a few more examples of LFs of this style that we'll use, see [Disease_Tagging_Tutorial_LFs.py](Disease_Tagging_Tutorial_LFs.py).*

#### Dictionary Labeling Functions
We can use existing dictionaries for distant supervision.

In [None]:
def LF_SNOWMED_CT_sign_or_symptom(c):
    return 1 if c[0].get_span() in umls_dict["snomedct"]["sign_or_symptom"] else 0

def LF_SNOWMED_CT_disease_or_syndrome(c):
    return 1 if c[0].get_span() in umls_dict["snomedct"]["disease_or_syndrome"] else 0

def LF_MESH_disease_or_syndrome(c):
    return 1 if c[0].get_span() in umls_dict["mesh"]["disease_or_syndrome"] else 0

def LF_MESH_sign_or_symptom(c):
    return 1 if c[0].get_span() in umls_dict["mesh"]["sign_or_symptom"] else 0

#### Negative Labeling Functions
When writing labeling functions, it is important to provide negative supervision in addition to positive supervision.

In [None]:
def LF_organs(c):
    phrase = " ".join(c[0].get_attrib_tokens()).lower()
    return -1 if phrase in organs else 0      

def LF_chemical_name(c):
    phrase = " ".join(c[0].get_attrib_tokens())
    return -1 if phrase in chemicals and not phrase.isupper() else 0

def LF_bodysym(c):
    phrase = " ".join(c[0].get_attrib_tokens()).lower()
    return -1 if phrase in bodysym else 0  

def LF_protein_chemical_abbrv(c):
    '''Gene/protein/chemical name'''
    lemma = " ".join(c[0].get_attrib_tokens('lemmas'))
    return -1 if re.search("\d+",lemma) else 0

def LF_base_pair_seq(c): 
    lemma = " ".join(c[0].get_attrib_tokens('lemmas'))
    return -1 if re.search("^[GACT]{2,}$",lemma) else 0

*For a few more examples of LFs of this style that we'll use, see [Disease_Tagging_Tutorial_LFs.py](Disease_Tagging_Tutorial_LFs.py).*

We maintain a list of all LFs for convenience.

In [None]:
from Disease_Tagging_Tutorial_LFs import *

LFs_doc = [LF_undefined_abbreviation]

LFs_sent = [LF_contiguous_mentions]

LFs_mention = [LF_tumors_growths,
               LF_cancer,
               LF_disease_syndrome,
               LF_indicators,
               LF_common_disease,
               LF_common_disease_acronyms,
               LF_deficiency_of,
               LF_positive_indicator,
               LF_left_positive_argument,
               LF_right_negative_argument,
               LF_medical_afixes,
               LF_adj_diseases
              ]

LFs_dicts =  [LF_SNOWMED_CT_sign_or_symptom,
              LF_SNOWMED_CT_disease_or_syndrome,
              LF_MESH_disease_or_syndrome,
              LF_MESH_sign_or_symptom
            ]

LFs_false = [LF_chemical_name,
             LF_organs,
             LF_bodysym,
             LF_protein_chemical_abbrv,
             LF_base_pair_seq,
             LF_too_vague,
             LF_neg_surfix,
             LF_non_common_disease,
             LF_non_disease_acronyms,
             LF_pos_in,
             LF_gene_chromosome_link,
             LF_right_window_incomplete,
             LF_negative_indicator
            ]

## Applying Labeling Functions

First we construct a `CandidateLabeler`.

In [None]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()

Next we run the `CandidateLabeler` to to apply the labeling functions to the training `CandidateSet`.  We'll start with some of our labeling functions:

In [None]:
LFs = LFs_mention + LFs_dicts + LFs_false
%time L_train = label_manager.create(session, train, 'LF Labels', f=LFs)
L_train

**OR** load if we've already created:

In [None]:
%time L_train = label_manager.load(session, train, 'LF Labels')
L_train

We can also add or rerun a single labeling function (or more!) with the below command. Note that we set the argument `expand_key_set` to `True` to indicate that the set of matrix columns should be allowed to expand:

In [None]:
LFs_2   = LFs_doc + LFs_sent
L_train = label_manager.update(session, train, 'LF Labels', True, f=LFs_2)
L_train

We can view statistics about the resulting label matrix:

In [None]:
L_train.lf_stats()

## Fitting the Generative Model
We estimate the accuracies of the labeling functions without supervision. Specifically, we estimate the parameters of a `NaiveBayes` generative model.

In [None]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train)

In [None]:
gen_model.save(session, 'Generative Params')

We now apply the generative model to the training candidates.

In [None]:
train_marginals = gen_model.marginals(L_train)

## Training the Discriminative Model
We use the estimated probabilites to train a discriminative model that classifies each `Candidate` as a true or false mention.

In [None]:
from snorkel.learning import LogReg

disc_model = LogReg()
disc_model.train(F_train, train_marginals, n_iter=5000, rate=1e-3)

In [None]:
disc_model.w.shape

In [None]:
%time disc_model.save(session, "Discriminative Params")

## Evaluating on the Development `CandidateSet`

First, we create features for the development set.

Note that we use the training features feature set, because those are the only features for which we have learned parameters. Features that were not encountered during training, e.g., a token that does not appear in the training set, are ignored, because we do not have any information about them.

To do so with the `FeatureManager`, we call update with the new `CandidateSet`, the name of the training `AnnotationKeySet`, and the value `False` for the parameter `extend_key_set` to indicate that the `AnnotationKeySet` should not be expanded with new `Feature` keys encountered during processing.

In [None]:
%time F_dev = feature_manager.update(session, dev, 'Train Features', False)

**OR** if we've already created one, we can simply load as follows:

In [None]:
%time F_dev = feature_manager.load(session, dev, 'Train Features')

Next, we load the development set labels and gold candidates we made in Part III.

In [None]:
L_gold_dev = label_manager.load(session, dev, "CDR Development Labels -- Gold")

In [None]:
gold_dev_set = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates -- Gold').one()

Now we can evaluate the discriminative model on the development set.

In [None]:
tp, fp, tn, fn = disc_model.score(F_dev, L_gold_dev, gold_dev_set)

## Viewing Examples
After evaluating on the development `CandidateSet`, the labeling functions can be modified. Try changing the labeling functions to improve performance. You can view the true positives, false positives, true negatives, and false negatives using the `Viewer`.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(tp, session, annotator_name="Tutorial Part IV User")
else:
    sv = None

In [None]:
sv

Next, in Part V, we will test our model on the test `CandidateSet`.