# Snorkel Workshop 
## Part 1: Snorkel API

Complete Snorkel API documentation is available via [Read the Docs](http://snorkel.readthedocs.io/en/master/)

However, we provide several detailed examples below that are useful when you are using Snorkel for the first time. 

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
os.environ['SNORKELDB'] = "postgresql://ubuntu:snorkel@localhost/spouse"
    
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from snorkel.models import Candidate, Sentence, Span, Document

session = SnorkelSession()
from lib.util import *

# I. Candidates and Contexts
----
<img src="imgs/candidate.jpg" width="300px">
`Candidate` objects represent potential mentions found in text and are a core abstraction used in Snorkel. `Candidate(s)` are defined over 1 or more `Context` objects, which are typically some unit of text like words in a sentence. All Snorkel applications require a custom Candidate class definition. 

## A. Example Definitions

<img src="imgs/spouse.jpg" width="300px">

In our tutorial, we define a `Spouse` relation as consisting of 2 `Span(s)` (i.e., sequences of words or characters) representing the mentions of 2 people that are married. Defining a new `Candidate` class requires providing a name for the class (`Spouse`) and its `Span` arguments (`person1` and `person2`). The syntax for defining this relation is below:

In [None]:
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

## B. Candidates in Context
<img src="imgs/sentence.jpg" width="700px;">

By default, Snorkel candidates are defined over `Span` objects within a `Sentence` context.  `Span(s)` correspond to conceptual categories in text like people or disease names. In the above example, our candidate represents the possible `Spouse` mention `(Barack Obama, Michelle Obama)`. As readers, we know this mention is true due to external knowledge and the keyword of `wedding` occurring later in the sentence.

## C. Context Hierarchy 
<img src="imgs/context-hierarchy.jpg" width="300px;">

All `Context(s)` are hierarchical in Snorkel. The default objects provided by Snorkel are shown above. 

# II. Loading  `Candidate(s)` 

## A. Querying `Candiate` Objects from the  Database
Once you've defined candidates as shown above, you need to do some preprocessing to load 
your documents, extract candidates, and then load everything into a database. This is a time consuming process, so we've pre-generated a database snapshot for you. Download and launch our preprocessing tutorial <a href="https://github.com/HazyResearch/snorkel/tree/master/tutorials/workshop">Workshop 5 Advanced Preprocessing</a>, available from our GitHub repository, for specific information on how this is done.

We assume that our Candidates have already been extracted and partitioned into `train`, `dev`, and `test` sets. For now, we will just load our `train` set candidates by requesting Candidate split 0 (Candidate split 1 is the `dev` set and Candidate split 2 is the `test` set).

This query returns a list of candidate objects.


In [None]:
cands = session.query(Candidate).filter(Candidate.split == 0).all()

## B. `Candidate` Member Functions and Variables

You will interact with candidates while writing labeling functions in Snorkel. The definition of the `Spouse` and `Span` classes is outlined below;

```
class Spouse(Candidate)
 Attributes:
    person1 (Span): relation argument
    person2 (Span): relation argument

class Span(Context)
 Methods:
    get_attrib_tokens(a="words"): return all tokens of the provided type a
    get_parent(): return parent Context

```

For the following examples, we'll look at the first candidate in our `cands` list. First we'll show the candidate in its parent sentence.

In [None]:
c = cands[0]

# we can access Span(s) as named member variables
print(c.person1)
print(c.person2)

# the raw word tokens for the person1 Span
print(c.person1.get_attrib_tokens("words"))

# part of speech tags
print(c.person1.get_attrib_tokens("pos_tags"))

# named entity recognition tags
print(c.person1.get_attrib_tokens("ner_tags"))

# we can access context parents
sentence = c.get_parent()
document = sentence.get_parent()

print(sentence)
print(document)


# III. Writing Labeling Functions

Snorkel uses the idea of *labeling functions* (LFs) to generate noisy labels for training machine learning models. LFs encode heuristics that are correlated with whatever outcome (label) you are trying to predict. For example, in the `Spouse` example above, we want to predict if a pair of `Person` names are related by marriage. 


## A. Labeling Function (LF) Helpers

When writing labeling functions, there are several operators you will use over and over again; fetching text between span arguments, examing word windows around spans, etc. 

Snorkel provides several core helper functions. 
These are Python helper functions that you can apply to candidates to return objects that are helpful during LF development.

You can (and should!) write your own helper functions to help write LFs.


In [None]:
import re
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

print("Candidate LEFT tokens:   \t", list(get_left_tokens(c,window=2)))
print("Candidate RIGHT tokens:  \t", list(get_right_tokens(c,window=2)))
print("Candidate BETWEEN tokens:\t", get_text_between(c))

# IV. Exercises

### 1. Write a labeling function helper that counts the number of tokens between a candidate's `Person` spans. 

In [None]:
def count_tokens_between(c):
    pass
    
check_exercise(1, count_tokens_between, cands)

#show_exercise1_answer()

### 2. Write a labeling function helper that looks for the word 'married' occurring between  `Person` spans. 

Often, the words occurring between our relation spans provide important clues for our classification task. For example, in the sentence:

    Lottie was among Kate's eleven bridesmaids when she married Jamie Hince in 2011.
    
The word 'married' provides a clue that `Kate` is married to `Jamie Hince`.

In [None]:
def married_between(c):
    # return a boolean
    pass

check_exercise(2, married_between, cands)

#show_exercise2_answer()

### 3. Write a labeling function helper that checks if `Person` spans have the same last name.

Sharing the same last name is another weak signal that a pair of people might be married

    During the attack, he punched Mr Capaldi, who collapsed, before grabbing Mrs Capaldi around the throat.

In [None]:
def same_last_name(c):
    # return a boolean
    pass

check_exercise(3, same_last_name, cands)

#show_exercise3_answer()

# V. Cheat Sheet
----
Jupyter notebooks provide a build in docstring display operator for functions. Just prepend `?` to any function name as shown below.

    ?get_left_tokens

For class member functions, don't forget to include the class name

    ?Span.get_attrib_tokens

<img src="https://media.readthedocs.com/corporate/img/header-logo.png" width="200px;">

Complete Snorkel API documentation on [Read the Docs](http://snorkel.readthedocs.io/en/master/)

###  `Candidate` Helper Functions

Helper functions operate on a `Candidate` class instance, `c`.
  
`get_left_tokens(c, window=3, attrib='words', n_max=1, case_sensitive=False)
get_right_tokens(c, window=3, attrib='words', n_max=1, case_sensitive=False)
get_between_tokens(c, attrib='words', n_max=1, case_sensitive=False)
get_text_between(c)
get_tagged_text(c)`

A full list of helper functions is available at
http://snorkel.readthedocs.io/en/master/etc.html#module-snorkel.lf_helpers

### `Candidate` Member Functions

Give a `Candidate` class instance

`.get_attrib_tokens(a='words')
.get_word_start()
.get_word_end()`


### `Sentence` Attributes

| Variable Name   | Description                          |
|-------------|------------------------------------------|
| `words`   | Text Tokens                              |
| `lemmas`  | Lemma, "a base word and its inflections" |
| `pos_tags` | Part-of-speech Tags                     |
| `ner_tags` | Named Entity Tags                     |
| `dep_parents` |  Dependency Tree Heads            |
| `dep_labels` |  Dependency Tree Tags            |  
| `char_offsets` |  Character Offsets          |
| `abs_char_offsets` |  Absolute (document) Character Offsets |


### Computing Labeling Function Metrics

`snorkel.lf_helpers.test_LF`