## Part II: `Candidate` Extraction

In [1]:
%load_ext autoreload
%autoreload 2
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

In [2]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

In [3]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher

ngrams         = Ngrams(n_max=3)
person_matcher = PersonMatcher(longest_match_only=True)
cand_extractor = CandidateExtractor(Spouse, 
                                    [ngrams, ngrams], [person_matcher, person_matcher],
                                    symmetric_relations=False)

In [4]:
def number_of_people(sentence):
    active_sequence = False
    count = 0
    for tag in sentence.ner_tags:
        if tag == 'PERSON' and not active_sequence:
            active_sequence = True
            count += 1
        elif tag != 'PERSON' and active_sequence:
            active_sequence = False
    return count

In [15]:
docs = session.query(Document).order_by(Document.name).all()

In [16]:
import os
import csv

labeled_docs = set()
with open(os.environ['SNORKELHOME'] + '/tutorials/semparse/data/labeled_docs.csv') as csvin:
    reader = csv.reader(csvin)
    for row in reader:
        doc = row[0]
        labeled_docs.add(doc)
print "Labeled documents: {}".format(len(labeled_docs))
print "Unlabeled documents: {}".format(len(docs) - len(labeled_docs))

Labeled documents: 71
Unlabeled documents: 929


In [6]:
from snorkel.models import Document
import random

random.seed(0)

filtered_sents = 0
train_sents = set()
dev_sents   = set()
test_sents  = set()
unlabeled_sents = set()
splits = [0.3, 0.4, 0.3] # train, dev, test
for i, doc in enumerate(docs):
    for s in doc.sentences:
        if number_of_people(s) < 5:
            if doc.name in labeled_docs:
                r = random.random()
                if r < splits[0]:
                    train_sents.add(s)
                elif r < (splits[0] + splits[1]):
                    dev_sents.add(s)
                else:
                    test_sents.add(s)
            else:
                unlabeled_sents.add(s)
        else:
            filtered_sents += 1

In [7]:
print "Train sentences: %d" % len(train_sents)
print "Dev sentences: %d" % len(dev_sents)
print "Test sentences: %d" % len(test_sents)
print "Unlabeled sentences: %d" % len(unlabeled_sents)
print "Filtered sentences: %d" % filtered_sents

Train sentences: 610
Dev sentences: 851
Test sentences: 646
Unlabeled sentences: 26975
Filtered sentences: 200


## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling extract with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [8]:
for i, sents in enumerate([train_sents, dev_sents, test_sents, unlabeled_sents]):
    %time cand_extractor.apply(sents, split=i)
    print "Number of candidates: %d" % session.query(Spouse).filter(Spouse.split == i).count()
    print

Clearing existing...
Running UDF...

CPU times: user 2.64 s, sys: 47.9 ms, total: 2.68 s
Wall time: 3.22 s
Number of candidates: 141

Clearing existing...
Running UDF...

CPU times: user 3.1 s, sys: 41.9 ms, total: 3.14 s
Wall time: 3.27 s
Number of candidates: 231

Clearing existing...
Running UDF...

CPU times: user 2.43 s, sys: 38 ms, total: 2.47 s
Wall time: 2.63 s
Number of candidates: 130

Clearing existing...
Running UDF...

CPU times: user 1min 11s, sys: 539 ms, total: 1min 12s
Wall time: 1min 13s
Number of candidates: 4780



Here we specified that these `Candidates` belong to the training set by specifying `split=0`; recall that we're referring to train / dev / test as splits 0 / 1 / 2.

Note also that again, we could have specified a `parallelism` parameter to execute in parralel, if we had a non-SQLite database set up. Now let's get the candidates we just extracted:

## Using the `Viewer` to inspect candidates

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

It is important to note, our goal here is to **maximize the recall of true candidates** extracted, **not** to extract _only_ the correct candidates. Learning to distinguish true candidates from false candidates is covered in Tutorial 4.

First, we instantiate the `Viewer` object, which groups the input `Candidate` objects by `Sentence`:

In [9]:
from snorkel.viewer import SentenceNgramViewer

train_cands = session.query(Spouse).filter(Spouse.split == 0).all()
sv = SentenceNgramViewer(train_cands[:300], session)

<IPython.core.display.Javascript object>

Next, we render the `Viewer.

In [10]:
sv

In [11]:
if 'CI' not in os.environ:
    print unicode(sv.get_selected())

Spouse(Span("David Jordan", sentence=28892, chars=[0,11], words=[0,1]), Span("Grayson", sentence=28892, chars=[177,183], words=[32,32]))


Next, in Part 3, we will annotate some candidates with labels so that we can evaluate performance.