# Categorical Variables in Snorkel

This is a short tutorial on how to use categorical variables (i.e. more values than binary) in Snorkel.  We'll use a completely toy scenario with three sentences and two LFs just to demonstrate the mechanics. Please see the main tutorial for a more comprehensive intro!

We'll **highlight in bold all parts focusing on the categorical aspect.**

### Notes on Current Categorical Support:
* The `Viewer` works in the categorical setting, _but labeling `Candidate`s in the `Viewer` does not._
* The `LogisticRegression` and `SparseLogisticRegression` end models have been extended to the categorical setting, but other end models in `contrib` may not have been
    - _Note: It's simple to make this change, so feel free to post an issue with requests for other end models!_

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np

from snorkel import SnorkelSession
session = SnorkelSession()

## Step 1: Preprocessing the data

In [2]:
from snorkel.parser import TSVDocPreprocessor, CorpusParser

doc_preprocessor = TSVDocPreprocessor('data/categorical_example.tsv') 
corpus_parser = CorpusParser()
%time corpus_parser.apply(doc_preprocessor)

Clearing existing...
Running UDF...
CPU times: user 51.9 ms, sys: 15.6 ms, total: 67.5 ms
Wall time: 5.1 s


## Step 2: Defining candidates

We'll define candidate relations between person mentions **that now can take on one of three values:**
```python
['Married', 'Employs', False]
```
Note the importance of including a value for "not a relation of interest"- here we've used `False`, but any value could do.
Also note that `None` is a protected value -- denoting a labeling function abstaining -- so this cannot be used as a value.

In [3]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'], values=['Married', 'Employs', False])

Now we extract candidates the same as in the Intro Tutorial (simplified here slightly):

In [4]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher
from snorkel.models import Sentence

# Define a Person-Person candidate extractor
ngrams = Ngrams(n_max=3)
person_matcher = PersonMatcher(longest_match_only=True)
cand_extractor = CandidateExtractor(
    Spouse, 
    [ngrams, ngrams],
    [person_matcher, person_matcher],
    symmetric_relations=False
)

# Apply to all (three) of the sentences for this simple example
sents = session.query(Sentence).all()

# Run the candidate extractor
%time cand_extractor.apply(sents, split=0)

Clearing existing...
Running UDF...

CPU times: user 25 ms, sys: 5.86 ms, total: 30.8 ms
Wall time: 30.7 ms


In [5]:
train_cands = session.query(Spouse).filter(Spouse.split == 0).all()
print "Number of candidates:", len(train_cands)

Number of candidates: 3


In [6]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(train_cands, session)
else:
    sv = None

<IPython.core.display.Javascript object>

In [7]:
sv

## Step 3: Writing Labeling Functions

**The _categorical_ labeling functions (LFs) we now write can output the following values:**

* Abstain: `None` OR 0
* Categorical values: The literal values in `Spouse.values` OR their integer indices.

We'll write two simple LFs to illustrate.

*Tip: we can get the example highlighted in the viewer above via `sv.get_selected()`, and then use this to test as we write the LFs!*

In [8]:
import re
from snorkel.lf_helpers import get_between_tokens

# Getting an example candidate from the Viewer
c = sv.get_selected()

# Traversing the context hierarchy...
print c.get_contexts()[0].get_parent().text

# Using a helper function
get_between_tokens(c)

John is married to Susan.


[u'is', u'married', u'to']

In [9]:
def LF_married(c):
    return 'Married' if 'married' in get_between_tokens(c) else None

WORKPLACE_RGX = r'employ|boss|company'
def LF_workplace(c):
    sent = c.get_contexts()[0].get_parent()
    matches = re.search(WORKPLACE_RGX, sent.text)
    return 'Employs' if matches else None

LFs = [
    LF_married,
    LF_workplace
]

Now we apply the LFs to the candidates to produce our label matrix $L$:

In [10]:
from snorkel.annotations import LabelAnnotator

labeler = LabelAnnotator(lfs=LFs)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...

CPU times: user 46.3 ms, sys: 5.78 ms, total: 52.1 ms
Wall time: 51.1 ms


<3x2 sparse matrix of type '<type 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [11]:
L_train.todense()

matrix([[ 1.,  0.],
        [ 0.,  2.],
        [ 1.,  2.]])

## Step 4: Training the Generative Model

In [12]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()

# Note: We pass cardinality explicitly here to be safe
# Can usually be inferred, except we have no labels with value=3
gen_model.train(L_train, cardinality=3)

In [13]:
train_marginals = gen_model.marginals(L_train)

assert np.all(train_marginals.sum(axis=1) - np.ones(3) < 1e-10)
train_marginals

array([[ 0.53463339,  0.2326833 ,  0.2326833 ],
       [ 0.23270825,  0.5345835 ,  0.23270825],
       [ 0.41067428,  0.41059193,  0.17873378]])

Next, we can save the training marginals:

In [14]:
from snorkel.annotations import save_marginals, load_marginals

save_marginals(session, L_train, train_marginals)

Saved 3 marginals


And then reload (e.g. in another notebook):

In [15]:
load_marginals(session, L_train)

array([[ 0.53463339,  0.2326833 ,  0.2326833 ],
       [ 0.23270825,  0.5345835 ,  0.23270825],
       [ 0.41067428,  0.41059193,  0.17873378]])

## Step 5: Training the End Model

Now we create features and train the `LogisticRegression` model, which now will use a softmax to handle the categorical setting:

In [16]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()

%time F_train = featurizer.apply(split=0)
F_train

Clearing existing...
Running UDF...

CPU times: user 83.3 ms, sys: 7 ms, total: 90.3 ms
Wall time: 90.8 ms


<3x52 sparse matrix of type '<type 'numpy.float64'>'
	with 65 stored elements in Compressed Sparse Row format>

In [17]:
from snorkel.learning import LogisticRegression
disc_model = LogisticRegression()

disc_model.train(F_train.todense(), train_marginals, n_epochs=20, lr=0.001)

[LR] lr=0.001 l1=0.0 l2=0.0
[LR] Building model
[LR] Training model
[LR] #examples=3  #epochs=20  batch size=3
[LR] Epoch 0 (0.11s)	Avg. loss=0.725897	NNZ=156
[LR] Epoch 5 (0.11s)	Avg. loss=0.713601	NNZ=156
[LR] Epoch 10 (0.12s)	Avg. loss=0.703993	NNZ=156
[LR] Epoch 15 (0.12s)	Avg. loss=0.696839	NNZ=156
[LR] Epoch 19 (0.12s)	Avg. loss=0.692405	NNZ=156
[LR] Training done (0.12s)


In [19]:
from snorkel.learning import SparseLogisticRegression
disc_model = SparseLogisticRegression()

disc_model.train(F_train, train_marginals, n_epochs=20, lr=0.001)

[SparseLR] lr=0.001 l1=0.0 l2=0.0
[SparseLR] Building model
[SparseLR] Training model
[SparseLR] #examples=3  #epochs=20  batch size=3


ValueError: Cannot feed value of shape (2, 3) for Tensor u'Placeholder_6:0', which has shape '(?,)'