# Disease Norm

In this example, we'll be writing an application to extract *mentions of* diseases from Pubmed abstracts, using annotations from the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial, which has 5 parts, walks through the process of constructing a model to classify _candidate_ disease mentions as either true (i.e., that it is truly a mention of a disease) or false.

## Plan of action:

Two types of LFs:
1. TYPE I: Leveraging sources of WS (e.g. DS)
2. TYPE II: Expressing heuristics (e.g. magnifying user effort)

TYPE I:
- Need to break up MESH into subtrees and have each one be an LF!
- Need to provide negative signal

TYPE II:
- Conduct "simulated expert" experiment: go through, label examples, write LFs- what is the effective multiplier over binary labeling??
    * E.g. "renal failure"; add {"renal" -> "kidney"} to synonym map

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

In [None]:
from snorkel.models import CandidateSet

train = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates').one()
print len(train)
dev = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()
print len(dev)

In [None]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()

L_gold_train = label_manager.load(session, train, "CDR Training Label Set")
print L_gold_train.shape
L_gold_dev = label_manager.load(session, dev, "CDR Development Label Set")
print L_gold_dev.shape

# Writing some multinomial LFs

## TYPE I LF: Subsets of MESH dictionary

In [None]:
from cPickle import load
from utils import load_mesh_raw

MESH_to_CID = load(open('MESH_to_CID.pkl', 'rb'))
mesh_entries = load_mesh_raw('data/desc2017.xml')

## MESH exact match

In [None]:
mesh_tree = defaultdict(list)
for entry in mesh_entries:
    mid, tree_nums, terms = entry
    for tn in tree_nums:
        path = [tn[0]] + tn[1:].split(".")
        for term in terms:
            mesh_tree[term].append((mid, path))

### Augmenting MESH with UMLS

In [None]:
mid_to_paths = defaultdict(list)
for entry in mesh_entries:
    mid, tree_nums, terms = entry
    for tn in tree_nums:
        path = [tn[0]] + tn[1:].split(".")
        mid_to_paths[mid].append(path)

In [None]:
len(mesh_tree)

In [None]:
with open('cui2mesh.tsv', 'rb') as f:
    for line in f:
        term, cui, mid = line.rstrip('\n').split('\t')
        for path in mid_to_paths[mid]:
            x = (mid, path)
            t = term.lower()
            if x not in mesh_tree[t]:
                mesh_tree[t].append(x)

In [None]:
len(mesh_tree)

In [None]:
POS_DEPTH = 2
NEG_DEPTH = 2

def LFG_MESH_exact(c):
    p = c.disease.get_span().lower()
    if p in mesh_tree:
        seen = set()
        for mid, path in mesh_tree[p]:
            
            # Why are we all of a sudden missing entries?
            if mid in MESH_to_CID:
                value = MESH_to_CID[mid] if path[0] in ['C', 'F'] else -1
                key   = "_".join(path[:POS_DEPTH]) if value > 0 else "_".join(path[:NEG_DEPTH])
                if key not in seen:
                    seen.add(key)
                    yield key, value

In [None]:
%time L_train = label_manager.create(session, train, 'LF Training Labels -- E2/2 + C2/2 + NEGS 2', f=LFG_MESH_exact)
L_train

In [None]:
# LOAD if already computed
L_train = label_manager.load(session, train, 'LF Training Labels -- MESH Exact + Cosine')
L_train

In [None]:
L_train.lf_stats(labels=L_gold_train)

### Drop JJs

In [None]:
def drop_jjs(c):
    toks  = []
    words = c.disease.get_attrib_tokens()
    for i, tag in enumerate(c.disease.get_attrib_tokens('pos_tags')):
        if re.match(r'JJ.*', tag) is None:
            toks.append(words[i])
    return " ".join(toks).lower()

In [None]:
POS_DEPTH = 2
NEG_DEPTH = 2

def LFG_MESH_exact_drop_JJs(c):
    p = drop_jjs(c)
    if p in mesh_tree:
        seen = set()
        for mid, path in mesh_tree[p]:
            value = MESH_to_CID[mid] if path[0] in ['C', 'F'] else -1
            key   = "_".join(path[:POS_DEPTH]) if value > 0 else "_".join(path[:NEG_DEPTH])
            if key not in seen:
                seen.add(key)
                yield key, value

In [None]:
%time L_train = label_manager.update(session, train, 'LF Training Labels -- E2/2 + C2/2 + NEGS', True, LFG_MESH_exact_drop_JJs)
L_train

## MESH TF-IDF cosine match

In [None]:
%%time
from entity_norm import CanonDictVectorizer

# Compile all terms into one dictionary
all_diseases = {}
for term, entries in mesh_tree.iteritems():
    if len(entries) > 0:
        mid, path = entries[0]  # Hack: should take the most frequent?
        all_diseases[term] = mid

# Create a vectorizer based around this 
cd_vectorizer = CanonDictVectorizer(all_diseases, other_phrases=[])

# Vectorize the dictionary
disease_phrases = []
disease_cids    = []
for term, mid in all_diseases.iteritems():
    disease_phrases.append(term)
    disease_cids.append(mid)   
D  = cd_vectorizer.vectorize_phrases(disease_phrases)
Dt = D.T
Dt

In [None]:
Dt

In [None]:
mesh_tree_index = []
for dp in disease_phrases:
    mesh_tree_index.append(mesh_tree[dp])

In [None]:
POS_DEPTH = 2
NEG_DEPTH = 2

#THRESHs = [0.5, 0.75]
THRESHs = [0.75]

def LFG_MESH_cosine(c):
    mt = min(THRESHs)
    
    # Vectorize the phrase
    p  = c.disease.get_span().lower()
    cx = cd_vectorizer.vectorize_phrases([p])
    m  = cx * Dt
    
    # Keep track of the highest-score match so far _for each LF_
    highest_score = defaultdict(float)
    
    # Iterate over non-zero dictionary term matches > THRESH
    # Note: changing to COO and iterating over the data direcltly is ~OM faster
    m = m.tocoo()
    for i, s in enumerate(m.data):
        if s > mt:
            j = m.col[i]
            for entry in mesh_tree_index[j]:
                mid, path = entry
                value     = MESH_to_CID[mid] if path[0] in ['C', 'F'] else -1
                
                # We define each LF by a tree path code
                key = "_".join(path[:POS_DEPTH]) + "_c" if value > 0 else "_".join(path[:NEG_DEPTH]) + "_c"
                
                # Only yield this value if higher than highest current emitted
                # Note: This will just update the current value in the DB
                if s > highest_score[key]:
                    for t in THRESHs:
                        if s > t:
                            highest_score[key] = s
                            yield key + "_%s" % t, value

In [None]:
%time L_train = label_manager.update(session, train, 'LF Training Labels -- E2/2 + C2/2 + NEGS', True, LFG_MESH_cosine)
L_train

In [None]:
# LOAD if already computed
L_train = label_manager.load(session, train, 'LF Training Labels -- MESH Exact + Cosine')
L_train

# Putting in some negative LFs

In [None]:
import re
from lf_terms import *
from snorkel.lf_helpers import get_left_tokens, get_right_tokens
from utils import *
from Disease_Tagging_Tutorial_LFs import *
chemicals = load_chemdner_dictionary()

def LF_organs(c):
    phrase = " ".join(c[0].get_attrib_tokens()).lower()
    return -1 if phrase in organs else 0      

def LF_chemical_name(c):
    phrase = " ".join(c[0].get_attrib_tokens())
    return -1 if phrase in chemicals and not phrase.isupper() else 0

def LF_bodysym(c):
    phrase = " ".join(c[0].get_attrib_tokens()).lower()
    return -1 if phrase in bodysym else 0  

def LF_protein_chemical_abbrv(c):
    '''Gene/protein/chemical name'''
    lemma = " ".join(c[0].get_attrib_tokens('lemmas'))
    return -1 if re.search("\d+",lemma) else 0

def LF_base_pair_seq(c): 
    lemma = " ".join(c[0].get_attrib_tokens('lemmas'))
    return -1 if re.search("^[GACT]{2,}$",lemma) else 0

LFs_false = [LF_chemical_name,
             LF_organs,
             LF_bodysym,
             LF_protein_chemical_abbrv,
             LF_base_pair_seq,
             LF_too_vague,
             LF_neg_surfix,
             LF_non_common_disease,
             LF_non_disease_acronyms,
             LF_pos_in,
             LF_gene_chromosome_link,
             LF_right_window_incomplete,
             LF_negative_indicator
            ]

In [None]:
%time L_train = label_manager.update(session, train, 'LF Training Labels -- E2/2 + C2/2 + NEGS', True, LFs_false)
L_train

In [None]:
# LOAD if already computed
L_train = label_manager.load(session, train, 'LF Training Labels -- E3/3 + C3/3')
L_train

# Different cut levels:

Note: G = gen model on training set, D = disc. model on test set

* Pos: 1, Neg: 1, Pos-cosine: 1, Neg-cosine: 1, Thresh-cosine: 0.75 = 56 F1 G / 63 F1 D
* TODO...
* Pos: 3, Neg: 3, Pos-cosine: 3, Neg-cosine: 3, Thresh-cosine: 0.75 = 61 F1 G / 68 F1 D
* Pos: 3, Neg: 3, Pos-cosine: 3, Neg-cosine: 3, Thresh-cosine: [0.5, 0.75] = 63 F1 G / 65 F1 D
* Pos: 4, Neg: 4, Pos-cosine: 4, Neg-cosine: 4, Thresh-cosine: 0.75 = 60 F1 G / 64 F1 D

### Adding in drop_JJs + NEG LFs:

* Pos: 2, Neg: 2, Pos-cosine: 2, Neg-cosine: 2, Thresh-cosine: 0.75 = 69 F1 G / 71 F1 D
* Pos: 3, Neg: 3, Pos-cosine: 3, Neg-cosine: 3, Thresh-cosine: 0.75 = 70 F1 G / 73 F1 D


#### Note: we're not yet dealing with acronyms!!!

In [None]:
L_train.lf_stats(labels=L_gold_train)

# Running gen. model

In [None]:
from scipy.sparse import lil_matrix

def binarize_LF_matrix(X):
    X_b = lil_matrix(X.shape)
    for i, j in zip(*X.nonzero()):
        X_b[i,j] = np.sign(X[i,j])
    return X_b.tocsr()

In [None]:
def get_score(predicted, gold):
    tp = 0
    pp = 0
    p  = 0
    for i in range(gold.shape[0]):
        if gold[i] > 0:
            p += 1
        
        if predicted[i] == 1:
            pp += 1
            if gold[i] > 0:
                tp += 1
    
    prec   = tp / float(pp)
    recall = tp / float(p)
    f1     = (2*prec*recall) / (prec+recall)
    print "P :\t", prec
    print "R :\t", recall
    print "F1:\t", f1

In [None]:
L_train_b = binarize_LF_matrix(L_train)
L_train_b

In [None]:
from snorkel.learning import NaiveBayes

gen_model = NaiveBayes()
%time gen_model.train(L_train_b, n_iter=5000, rate=1e-1, verbose=True)

In [None]:
yp = gen_model.predict(L_train_b)
get_score(yp, L_gold_train)

In [None]:
rate    = [1e-1, 1e-2]
precs   = [0.80, 0.69]
recalls = [0.61, 0.65]

In [None]:
from snorkel.learning import odds_to_prob

L_dev.lf_stats(labels=L_gold_dev, est_accs=odds_to_prob(gen_model.w))

# Error analysis

In [None]:
len(train)

In [None]:
fps = []
fns = []
for i,c in enumerate(train):
    if L_gold_train[i] < 0 and yp[i] > 0:
        fps.append(c)
    elif L_gold_train[i] > 0 and yp[i] <= 0:
        fns.append(c)
print "FPs:", len(fps)
print "FNs:", len(fns)

from random import shuffle
shuffle(fps)
shuffle(fns)

In [None]:
# Index the gold MESH IDs
CID_to_MESH = {}
for mid, cid in MESH_to_CID.iteritems():
    CID_to_MESH[cid] = mid
    
mesh_label_by_candidate_id = {}
for i, c in enumerate(train):
    l = int(L_gold_train[i,0])
    mesh_label_by_candidate_id[c.id] = CID_to_MESH[l] if l > 0 else l

In [None]:
mesh_to_terms = defaultdict(set)
for term, entries in mesh_tree.iteritems():
    for entry in entries:
        mid, path = entry
        mesh_to_terms[mid].add(term)

In [None]:
from snorkel.viewer import SentenceNgramViewer

sv = SentenceNgramViewer(fns[:100], session)

In [None]:
sv

# Experiments to run:
* Clean up!
* Partial matches (first k words)
* TF-IDF at various thresholds
* Split dictionary more!
* Remove JJ?
* **NEED TO CORRECT FOR GOLD ANNOTATIONS NOT IN OUR CANDIDATE SET!!!**

In [None]:
c = sv.get_selected()
c

In [None]:
mid = mesh_label_by_candidate_id[sv.get_selected().id]
print mid
mesh_to_terms[mid]

In [None]:
list(LFG_MESH_exact(c))

In [None]:
list(LFG_MESH_cosine(c))

### Notes:
* Looking at FNs:

## Automatically Creating Features
Recall that our goal is to distinguish between true and false mentions of chemical-disease relations. To train a model for this task, we first embed our `ChemicalDisease` candidates in a feature space.

In [None]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()

We can create a new feature set:

In [None]:
%time F_train = feature_manager.create(session, train, 'Train Features')

**OR** if we've already created one, we can simply load as follows:

In [None]:
%time F_train = feature_manager.load(session, train, 'Train Features')

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [None]:
F_train

In [None]:
%time F_dev = feature_manager.update(session, dev, 'Train Features', False)

In [None]:
%time F_dev = feature_manager.load(session, dev, 'Train Features')

In [None]:
F_train.get_candidate(0)

In [None]:
F_train.get_key(0)

In [None]:
from snorkel.learning import LogReg

train_marginals = gen_model.marginals(L_train_b)

disc_model = LogReg()
disc_model.train(F_train, train_marginals, n_iter=2000, rate=1e-3, mu=1e-4)

In [None]:
yp = disc_model.predict(F_train)
get_score(yp, L_gold_train)

In [None]:
yp = disc_model.predict(F_dev, b=0.4)
get_score(yp, L_gold_dev)

In [None]:
plt.hist(disc_model.marginals(F_dev))

### Quickly checking against SKL (w/ hard-thresholded marginals)

In [None]:
from sklearn.linear_model import LogisticRegression

covered = np.where(np.abs(train_marginals - 0.5) > 1e-3)[0]
tms     = train_marginals[covered]
X_cov   = F_train[covered]
tms     = np.array([1.0 if x > 0.5 else 0.0 for x in tms])

model = LogisticRegression(C=1e2, fit_intercept=True)

%time model.fit(X_cov, tms)

In [None]:
yp_sk = model.predict(F_train)
get_score(yp_sk, L_gold_train)

In [None]:
yp_sk = model.predict(F_dev)
get_score(yp_sk, L_gold_dev)

In [None]:
plt.hist(train_marginals)

In [None]:
from snorkel.learning.gen_learning import odds_to_prob

plt.hist(odds_to_prob(gen_model.w))

In [None]:
L_train.lf_stats(labels=L_gold_train, est_accs=odds_to_prob(gen_model.w))

# ERROR ANALYSIS

In [None]:
CID_to_MESH = {}
for mid, cid in MESH_to_CID.iteritems():
    CID_to_MESH[cid] = mid

In [None]:
from random import shuffle
N_dev = L_gold_dev.shape[0]

fps = []
fns = []
for i in range(N_dev):
    if yp[i] > 0 and L_gold_dev[i] < 0:
        fps.append(i)
    elif yp[i] < 0 and L_gold_dev[i] > 0:
        fns.append(i)

shuffle(fps)
shuffle(fns)

print len(fps)
print len(fns)

In [None]:
fn_cands = [F_dev.get_candidate(i) for i in fns[:100]]
svn      = SentenceNgramViewer(fn_cands, session)
svn

In [None]:
exact_match = 0
for i in fns:
    c = F_dev.get_candidate(i)
    if c.disease.get_span() in mesh_tree:
        exact_match += 1

In [None]:
exact_match

In [None]:
c = svn.get_selected()

mesh_tree[c.disease.get_span()]

In [None]:
c.disease.get_attrib_tokens('pos_tags')

In [None]:
mesh_tree['alcohol abuse']

In [None]:
from snorkel.models import Label

l = session.query(Label).filter(Label.candidate == c).one()
CID_to_MESH[l.value]

In [None]:
i = F_dev.get_row_index(c)
[(F_dev.get_key(k), disc_model.w[k]) for k in F_dev.getrow(i).nonzero()[1]]

In [None]:
F_dev.get_key(1)

* Why is Parkinson's disease not caught?

In [None]:
from snorkel.viewer import SentenceNgramViewer
fp_cands = [F_dev.get_candidate(i) for i in fps[:100]]
sv       = SentenceNgramViewer(fp_cands, session)
sv

In [None]:
NEG_PHRASES = [
    'stenosis',
    'further attention',
    'presence',
    'absence',
    'syndrome',
    'association',
    'strain',
    'progression'
]

NEG_END_WORDS = [
    'therapies',
    'muscles',
    'concentrations',
    'normal',
    'heart',
    'side',
    'sinus',
    'convulsants',
    'latencies',
    'findings',
    'doses',
    'remission'
]

def end_in_plural(c):
    pass

def body_part(c):
    pass

def not_exact_single_word(d):
    pass

# TODO:

0. _DONE: Add negative labels to candidates..._
1. _DONE: Get empirical LF accs up and running..._
2. _DONE: Binarize LFs + run in binary gen model_
0. _DONE: Add TF-IDF matching LFs_
3. _DONE: Add in simple DDLIB + WS feats from new-features -> run disc. model_
2. **Conduct TYPE II "experiment"!**

## Creating Labeling Functions
Labeling functions are a core tool of data programming. They are heuristic functions that aim to classify candidates correctly. Their outputs will be automatically combined and denoised to estimate the probabilities of training labels for the training data.

In [None]:
import re
from lf_terms import *
from snorkel.lf_helpers import get_left_tokens, get_right_tokens
from utils import *

umls_dict              = load_umls_dictionary()
chemicals              = load_chemdner_dictionary()
abbrv2text, text2abbrv = load_specialist_abbreviations()

We also load some publicly-available biomedical dictionaries, which we will leverage in some of our LFs below as a source of weak supervision:

In [None]:
from utils import *

umls_dict              = load_umls_dictionary()
chemicals              = load_chemdner_dictionary()
abbrv2text, text2abbrv = load_specialist_abbreviations()

#### Document-Level Labeling Functions
We start with some labeling functions that label candidates based on document-level features.

In [None]:
from snorkel.lf_helpers import get_doc_candidate_spans

def LF_undefined_abbreviation(c):
    '''Candidate is a known abbreviation, but no corresponding full name in document'''
    doc_spans = get_doc_candidate_spans(c)
    phrase = c[0].get_span().lower()
    mentions = set([s.get_span().lower() for s in doc_spans])
    if len(phrase) > 1 and phrase in abbrv2text and not set(abbrv2text[phrase].keys()).intersection(mentions):
        return -1
    return 0

#### Sentence-Level Labeling Functions
We also include some labeling functions that label candidates based on sentence-level features.

In [None]:
from snorkel.lf_helpers import get_sent_candidate_spans

def LF_contiguous_mentions(c):
    '''Contiguous candidates are likely wrong'''
    neighbor_spans = get_sent_candidate_spans(c)
    start, end = c[0].get_word_start(), c[0].get_word_end()
    for s in neighbor_spans:
        if s.get_word_end() + 1 == start or s.get_word_start() - 1 == end:
            return -1
    return 0

#### Mention-Level Labeling Functions
We now define a number of labeling functions that label candidates based on attributes related to the mention.

In [None]:
from snorkel.lf_helpers import get_left_tokens, get_right_tokens

def LF_tumors_growths(c):
    phrase = " ".join(c[0].get_attrib_tokens('lemmas'))
    return 1 if re.search("^(\w* ){0,2}(['] )*(tumor|tumour|polyp|pilomatricoma|cyst|lipoma)$", phrase) else 0

def LF_cancer(c):
    '''<TYPE> cancer'''
    phrase = " ".join(c[0].get_attrib_tokens('lemmas'))
    return 1 if re.search("\w* cancer",phrase) else 0

def LF_disease_syndrome(c):
    '''<TYPE> disease or <TYPE> syndrome'''
    phrase = " ".join(c[0].get_attrib_tokens('lemmas'))
    return 1 if re.search("\w* (disease|syndrome)+",phrase) else 0

def LF_indicators(c):
    '''Indicator words'''
    return 1 if " ".join(c[0].get_attrib_tokens()).lower() in indicators else 0

def LF_common_disease(c):
    '''Common disease'''
    return 1 if " ".join(c[0].get_attrib_tokens()).lower() in common_disease else 0

*For a few more examples of LFs of this style that we'll use, see [Disease_Tagging_Tutorial_LFs.py](Disease_Tagging_Tutorial_LFs.py).*

#### Dictionary Labeling Functions
We can use existing dictionaries for distant supervision.

In [None]:
def LF_SNOWMED_CT_sign_or_symptom(c):
    return 1 if c[0].get_span() in umls_dict["snomedct"]["sign_or_symptom"] else 0

def LF_SNOWMED_CT_disease_or_syndrome(c):
    return 1 if c[0].get_span() in umls_dict["snomedct"]["disease_or_syndrome"] else 0

def LF_MESH_disease_or_syndrome(c):
    return 1 if c[0].get_span() in umls_dict["mesh"]["disease_or_syndrome"] else 0

def LF_MESH_sign_or_symptom(c):
    return 1 if c[0].get_span() in umls_dict["mesh"]["sign_or_symptom"] else 0

#### Negative Labeling Functions
When writing labeling functions, it is important to provide negative supervision in addition to positive supervision.

In [None]:
def LF_organs(c):
    phrase = " ".join(c[0].get_attrib_tokens()).lower()
    return -1 if phrase in organs else 0      

def LF_chemical_name(c):
    phrase = " ".join(c[0].get_attrib_tokens())
    return -1 if phrase in chemicals and not phrase.isupper() else 0

def LF_bodysym(c):
    phrase = " ".join(c[0].get_attrib_tokens()).lower()
    return -1 if phrase in bodysym else 0  

def LF_protein_chemical_abbrv(c):
    '''Gene/protein/chemical name'''
    lemma = " ".join(c[0].get_attrib_tokens('lemmas'))
    return -1 if re.search("\d+",lemma) else 0

def LF_base_pair_seq(c): 
    lemma = " ".join(c[0].get_attrib_tokens('lemmas'))
    return -1 if re.search("^[GACT]{2,}$",lemma) else 0

*For a few more examples of LFs of this style that we'll use, see [Disease_Tagging_Tutorial_LFs.py](Disease_Tagging_Tutorial_LFs.py).*

We maintain a list of all LFs for convenience.

In [None]:
from Disease_Tagging_Tutorial_LFs import *

LFs_doc = [LF_undefined_abbreviation]

LFs_sent = [LF_contiguous_mentions]

LFs_mention = [LF_tumors_growths,
               LF_cancer,
               LF_disease_syndrome,
               LF_indicators,
               LF_common_disease,
               LF_common_disease_acronyms,
               LF_deficiency_of,
               LF_positive_indicator,
               LF_left_positive_argument,
               LF_right_negative_argument,
               LF_medical_afixes,
               LF_adj_diseases
              ]

LFs_dicts =  [LF_SNOWMED_CT_sign_or_symptom,
              LF_SNOWMED_CT_disease_or_syndrome,
              LF_MESH_disease_or_syndrome,
              LF_MESH_sign_or_symptom
            ]

LFs_false = [LF_chemical_name,
             LF_organs,
             LF_bodysym,
             LF_protein_chemical_abbrv,
             LF_base_pair_seq,
             LF_too_vague,
             LF_neg_surfix,
             LF_non_common_disease,
             LF_non_disease_acronyms,
             LF_pos_in,
             LF_gene_chromosome_link,
             LF_right_window_incomplete,
             LF_negative_indicator
            ]

## Applying Labeling Functions

First we construct a `CandidateLabeler`.

In [None]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()

Next we run the `CandidateLabeler` to to apply the labeling functions to the training `CandidateSet`.  We'll start with some of our labeling functions:

In [None]:
LFs = LFs_mention + LFs_dicts + LFs_false
%time L_train = label_manager.create(session, train, 'LF Labels', f=LFs)
L_train

**OR** load if we've already created:

In [None]:
%time L_train = label_manager.load(session, train, 'LF Labels')
L_train

We can also add or rerun a single labeling function (or more!) with the below command. Note that we set the argument `expand_key_set` to `True` to indicate that the set of matrix columns should be allowed to expand:

In [None]:
LFs_2   = LFs_doc + LFs_sent
L_train = label_manager.update(session, train, 'LF Labels', True, f=LFs_2)
L_train

We can view statistics about the resulting label matrix:

In [None]:
L_train.lf_stats()

## Fitting the Generative Model
We estimate the accuracies of the labeling functions without supervision. Specifically, we estimate the parameters of a `NaiveBayes` generative model.

In [None]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train)

In [None]:
gen_model.save(session, 'Generative Params')

We now apply the generative model to the training candidates.

In [None]:
train_marginals = gen_model.marginals(L_train)

## Training the Discriminative Model
We use the estimated probabilites to train a discriminative model that classifies each `Candidate` as a true or false mention.

In [None]:
from snorkel.learning import LogReg

disc_model = LogReg()
disc_model.train(F_train, train_marginals, n_iter=5000, rate=1e-3)

In [None]:
disc_model.w.shape

In [None]:
%time disc_model.save(session, "Discriminative Params")

## Evaluating on the Development `CandidateSet`

First, we create features for the development set.

Note that we use the training features feature set, because those are the only features for which we have learned parameters. Features that were not encountered during training, e.g., a token that does not appear in the training set, are ignored, because we do not have any information about them.

To do so with the `FeatureManager`, we call update with the new `CandidateSet`, the name of the training `AnnotationKeySet`, and the value `False` for the parameter `extend_key_set` to indicate that the `AnnotationKeySet` should not be expanded with new `Feature` keys encountered during processing.

In [None]:
%time F_dev = feature_manager.update(session, dev, 'Train Features', False)

**OR** if we've already created one, we can simply load as follows:

In [None]:
%time F_dev = feature_manager.load(session, dev, 'Train Features')

Next, we load the development set labels and gold candidates we made in Part III.

In [None]:
L_gold_dev = label_manager.load(session, dev, "CDR Development Labels -- Gold")

In [None]:
gold_dev_set = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates -- Gold').one()

Now we can evaluate the discriminative model on the development set.

In [None]:
tp, fp, tn, fn = disc_model.score(F_dev, L_gold_dev, gold_dev_set)

## Viewing Examples
After evaluating on the development `CandidateSet`, the labeling functions can be modified. Try changing the labeling functions to improve performance. You can view the true positives, false positives, true negatives, and false negatives using the `Viewer`.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(tp, session, annotator_name="Tutorial Part IV User")
else:
    sv = None

In [None]:
sv

Next, in Part V, we will test our model on the test `CandidateSet`.