# Snorkel Workshop: Extracting Spouse Relations from the News
## Part 1: Snorkel API Cheat Sheet

Complete API documentation is available at:

http://snorkel.readthedocs.io/en/latest/

However, we provide several detailed examples below that are useful when you are using Snorkel for the first time. 

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# Connect to the database backend and initalize a Snorkel session
from lib.init import *

## 1: `Candidate` and `Span` Objects
----

`Candidate` objects represent potential mentions found in text and are a core abstraction used in Snorkel. Candidates consist of 1 or more `Span` objects. 

For real applications, we have to define a candidate relation type consisting named `Span` objects. In this API, we'll use the pre-defined Spouse candidate, initialized below:

In [2]:
try:
    Spouse = candidate_subclass('Spouse', ['person1', 'person2'])
except:
    print>>sys.stderr,"Info: Candidate type already defined"

### Querying `Candidate` Sets
We assume that our Candidates have already been extracted and partitioned
into `train`, `dev`, and `test` sets.

See our preprocessing tutorial <a href="Workshop_Advanced_Preprocessing.ipynb">Workshop_Advanced_Preprocessing.ipynb</a> for more information on how candidates are generated. 

For now, we will just load our `train` set candidates.

In [3]:
cands = session.query(Candidate).filter(Candidate.split == 0).all()

### Accessing `Span` Arguments
The `Spouse` example above consists of 2 arguments: `person1` and `person2`. These are represented as `Span` objects. We can access these directy as member variables of a candidate as so:

In [4]:
c = cands[134]

In [5]:
print c.person1
print c.person2

print c[0], c[1]

Span("Daniella Topol", sentence=34178, chars=[127,140], words=[23,24])
Span("Majok’s", sentence=34178, chars=[15,21], words=[2,3])
Span("Daniella Topol", sentence=34178, chars=[127,140], words=[23,24]) Span("Majok’s", sentence=34178, chars=[15,21], words=[2,3])


### `Span` Member Functions
`Span` objects provide several useful functions for accessing text attributes.

In [6]:
print "Span 0 words               \t", c[0].get_attrib_tokens("words")
print "Span 0 lemmas              \t", c[0].get_attrib_tokens("words")
print "Span 0 part-of-speech tags \t", c[0].get_attrib_tokens("pos_tags")

Span 0 words               	[u'Daniella', u'Topol']
Span 0 lemmas              	[u'Daniella', u'Topol']
Span 0 part-of-speech tags 	[u'NNP', u'NNP']


## 2: `Context`  Objects: `Sentence` and `Document`
----

All `Candidate` objects live within a `Context`. We'll focus on `Sentence` objects to start. We can access any candidate's *parent* with the following method:

In [7]:
sent = c.get_parent()
sent

Sentence(Document 6219eaee-178f-4f1a-87cb-5f4895955cd7,1,Beginning with Majok’s smart, jabbing play about one woman’s survival, Round House Theatre’s production, assuredly directed by Daniella Topol, gets all the elements right.)

### Text and Syntactic Attributes
Text undergoes standard information extraction preprocessing to extract important syntatic attributes like part-of-speech tags and depenency tree information. You can access all of these attribtes for each token using these `Sentence` member variables:

In [8]:
print sent.words
print sent.lemmas
print sent.pos_tags
print sent.ner_tags
print sent.dep_parents
print sent.dep_labels

[u'Beginning', u'with', u'Majok', u'\u2019s', u'smart', u',', u'jabbing', u'play', u'about', u'one', u'woman', u'\u2019s', u'survival', u',', u'Round', u'House', u'Theatre', u'\u2019s', u'production', u',', u'assuredly', u'directed', u'by', u'Daniella', u'Topol', u',', u'gets', u'all', u'the', u'elements', u'right', u'.']
[u'begin', u'with', u'majok', u'\u2019s', u'smart', u',', u'jab', u'play', u'about', u'one', u'woman', u'\u2019s', u'survival', u',', u'round', u'house', u'theatre', u'\u2019s', u'production', u',', u'assuredly', u'direct', u'by', u'daniella', u'topol', u',', u'get', u'all', u'the', u'element', u'right', u'.']
[u'VBG', u'IN', u'NNP', u'NNP', u'JJ', u',', u'VBG', u'NN', u'IN', u'CD', u'NN', u'JJ', u'NN', u',', u'NNP', u'NNP', u'NNP', u'NNP', u'NN', u',', u'RB', u'VBN', u'IN', u'NNP', u'NNP', u',', u'VBZ', u'PDT', u'DT', u'NNS', u'RB', u'.']
[u'O', u'O', u'PERSON', u'PERSON', u'O', u'O', u'O', u'O', u'CARDINAL', u'CARDINAL', u'O', u'O', u'O', u'O', u'ORG', u'ORG', u'ORG

### Character-level Offsets
These lists give the starting offset (in characters) for each token in the `Sentence` object. 
`char_offsets` are relative to the start of the sentence while `abs_char_offsets` are releative to document. 

In [9]:
print sent.char_offsets
print sent.abs_char_offsets

[0, 10, 15, 20, 23, 28, 30, 38, 43, 49, 53, 58, 61, 69, 71, 77, 83, 90, 93, 103, 105, 115, 124, 127, 136, 141, 143, 148, 152, 156, 165, 170]
[195, 205, 210, 215, 218, 223, 225, 233, 238, 244, 248, 253, 256, 264, 266, 272, 278, 285, 288, 298, 300, 310, 319, 322, 331, 336, 338, 343, 347, 351, 360, 365]


### Visualizing Candidates
We can also visualize a candidate for purposes of debugging. 
This snippet shows the first 5 candidates in their parent sentences.

In [10]:
from lib.viz import *
display_candidate(c)

## 3: Helper functions

These are python helper functions that you can apply to candidates to return objects that are helpful during LF development.

You can (and should!) write your own helper functions to help write LFs.

In [11]:
import re
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

In [12]:
print "Candidate LEFT tokens:   \t", list(get_left_tokens(c,window=2))
print "Candidate RIGHT tokens:  \t", list(get_right_tokens(c,window=2))
print "Candidate BETWEEN tokens:\t", get_text_between(c)

Candidate LEFT tokens:   	[u'directed', u'by']
Candidate RIGHT tokens:  	[u'smart', u',']
Candidate BETWEEN tokens:	 smart, jabbing play about one woman’s survival, Round House Theatre’s production, assuredly directed by 


### Advanced `Candidate` and `Sentence` Attributes

Snorkel creates a dependency tree representation for all `Sentence` objects. You can query
this using advanced lxml-based operators, but sometimes its also useful just to visualize a
sentence to identify important verbs, noun phrases, and other syntatic structure. 

In [13]:
from lib.tree_structs_ipynb import *
tree = sentence_to_xmltree(c.get_parent())

In [14]:
tree.render_tree()

## 4: `Cheat Sheet`
----

### Accessing `Candidate` tokens

Full helper function list at
http://snorkel.readthedocs.io/en/latest/etc.html#module-snorkel.lf_helpers

`snorkel.lf_helpers.get_left_tokens(...)
snorkel.lf_helpers.get_right_tokens(...)
snorkel.lf_helpers.get_between_tokens(...)
snorkel.lf_helpers.get_text_between(...)
snorkel.lf_helpers.get_tagged_text(...)`

### `Candidate` member functions

`c[0].get_attrib_tokens(...)
c[0].get_word_start(...)
c[0].get_word_end(...)`


### `Sentence` attributes

| Variable Name   | Description                          |
|-------------|------------------------------------------|
| `s.words`   | Text Tokens                              |
| `s.lemmas`  | Lemma, "a base word and its inflections" |
| `s.pos_tags` | Part-of-speech Tags                     |
| `s.ner_tags` | Named Entity Tags                     |
| `s.dep_parents` |  Dependency Tree Heads            |
| `s.dep_labels` |  Dependency Tree Tags            |  
| `s.char_offsets` |  Character Offsets          |
| `s.abs_char_offsets` |  Absolute (document) Character Offsets |


### Computing Labeling Function Metrics

`snorkel.lf_helpers.test_LF`