# Categorical Variables in Snorkel

This is a short tutorial on how to use categorical variables (i.e. more values than binary) in Snorkel.  We'll use a very toy scenario with three sentences and two LFs just to demonstrate the mechanics. Please see the main tutorial for a more comprehensive intro!

We'll **highlight in bold all parts focusing on the categorical aspect.**

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np

from snorkel import SnorkelSession
session = SnorkelSession()

## Step 1: Preprocessing the data

In [2]:
from snorkel.parser import TSVDocPreprocessor, CorpusParser

doc_preprocessor = TSVDocPreprocessor('data/categorical_example.tsv') 
corpus_parser = CorpusParser()
%time corpus_parser.apply(doc_preprocessor)

Clearing existing...
Running UDF...
CPU times: user 54.9 ms, sys: 15.8 ms, total: 70.7 ms
Wall time: 8.39 s


## Step 2: Defining candidates

We'll define candidate relations between person mentions **that now can take on one of three values:**
```python
['Married', 'Employs', False]
```
Note the importance of including a value for "not a relation of interest"- here we've used `False`, but any value could do.
Also note that `None` is a protected value -- denoting a labeling function abstaining -- so this cannot be used as a value.

In [3]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'], values=['Married', 'Employs', False])

Now we extract candidates the same as in the Intro Tutorial (simplified here slightly):

In [4]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher
from snorkel.models import Sentence

# Define a Person-Person candidate extractor
ngrams = Ngrams(n_max=3)
person_matcher = PersonMatcher(longest_match_only=True)
cand_extractor = CandidateExtractor(
    Spouse, 
    [ngrams, ngrams],
    [person_matcher, person_matcher],
    symmetric_relations=False
)

# Apply to all (three) of the sentences for this simple example
sents = session.query(Sentence).all()

# Run the candidate extractor
%time cand_extractor.apply(sents, split=0)

Clearing existing...
Running UDF...

CPU times: user 22.9 ms, sys: 3.69 ms, total: 26.5 ms
Wall time: 31.3 ms


In [5]:
train_cands = session.query(Spouse).filter(Spouse.split == 0).all()
print "Number of candidates:", len(train_cands)

Number of candidates: 3


In [6]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(train_cands, session)
else:
    sv = None

<IPython.core.display.Javascript object>

In [7]:
sv

## Step 3: Writing Labeling Functions

**The _categorical_ labeling functions (LFs) we now write can output the following values:**

* Abstain: `None` OR 0
* Categorical values: The literal values in `Spouse.values` OR their integer indices.

We'll write two simple LFs to illustrate.

*Tip: we can get the example highlighted in the viewer above via `sv.get_selected()`, and then use this to test as we write the LFs!*

In [8]:
import re
from snorkel.lf_helpers import get_between_tokens

# Getting an example candidate from the Viewer
c = sv.get_selected()

# Traversing the context hierarchy...
print c.get_contexts()[0].get_parent().text

# Using a helper function
get_between_tokens(c)

John is married to Susan.


[u'is', u'married', u'to']

In [9]:
def LF_married(c):
    return 'Married' if 'married' in get_between_tokens(c) else None

WORKPLACE_RGX = r'employ|boss|company'
def LF_workplace(c):
    sent = c.get_contexts()[0].get_parent()
    matches = re.search(WORKPLACE_RGX, sent.text)
    return 'Employs' if matches else None

LFs = [
    LF_married,
    LF_workplace
]

Now we apply the LFs to the candidates to produce our label matrix $L$:

In [10]:
from snorkel.annotations import LabelAnnotator

labeler = LabelAnnotator(lfs=LFs)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...

CPU times: user 44.2 ms, sys: 5.93 ms, total: 50.2 ms
Wall time: 48.3 ms


<3x2 sparse matrix of type '<type 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [11]:
L_train.todense()

matrix([[ 1.,  0.],
        [ 0.,  2.],
        [ 1.,  2.]])

## Step 4: Training the Generative Model

In [12]:
# TODO

Next, we can save the training marginals:

In [13]:
# DUMMY VALUES: N x K marginals matrix
training_marginals = np.random.random((3, 3))
training_marginals /= training_marginals.sum(axis=1).reshape(-1,1)
training_marginals

array([[ 0.19694663,  0.24641   ,  0.55664336],
       [ 0.26928762,  0.25726439,  0.47344799],
       [ 0.4041731 ,  0.07146065,  0.52436625]])

In [15]:
from snorkel.annotations import save_marginals, load_marginals

save_marginals(session, L_train, training_marginals)

Saved 3 marginals


In [23]:
load_marginals(session, L_train)

array([[ 0.19694663,  0.24641   ,  0.55664336],
       [ 0.26928762,  0.25726439,  0.47344799],
       [ 0.4041731 ,  0.07146065,  0.52436625]])

## Step 5: Training the End Model