# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part I: Preprocessing

In this tutorial, we will walk through the process of using `Snorkel` to identify mentions of spouses in a corpus of news articles. The tutorial is broken up into 5 notebooks, each covering a step in the pipeline:
1. Preprocessing
2. Candidate Extraction
3. Annotating Evaluation Data
4. Featurization & Training
5. Evaluation

In this notebook, we preprocess several documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer to as _contexts_. We also extract standard linguistic features from each context which will be useful downstream using [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), 

All of this preprocessed input data is saved to a database.  (Connection strings can be specified by setting the `SNORKELDB` environment variable.  In Snorkel, if no database is specified, then a SQLite database at `./snorkel.db` is created by default--so no setup is needed here!

### Initializing a `SnorkelSession`

First, we initialize a `SnorkelSession`, which manages a connection to a database automatically for us, and will enable us to save intermediate results.  If we don't specify any particular database (see commented-out code below), then it will automatically create a SQLite database in the background for us:

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

# Here, we just set a global variable related to automatic testing- you can safely ignore this!
max_docs = 50 if 'CI' in os.environ else float('inf')

## Loading the Corpus

Next, we load and pre-process the corpus of documents.

### Configuring a `DocPreprocessor`

We'll start by defining a `TSVDocPreprocessor` class to read in the documents, which are stored in a tab-seperated value format as pairs of document names and text.

In [2]:
# import codecs
# rows = []
# with codecs.open("data/articles.tsv","rU","utf-8") as fp:
#     for line in fp:
#         row = line.split("\t")
#         doc = row[1]
#         doc = doc.replace("\\n","\n")
#         doc = doc.replace('""','"')
#         sents = [s.strip() for s in doc.split("\n") if len(s.strip()) > 0]
#         rows.append([row[0]," ".join(sents)])

# with codecs.open("data/articles4.tsv","w","utf-8") as fp:
#     for r in rows:
#         fp.write("\t".join(r) + "\n")

In [3]:
from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/articles.tsv', max_docs=max_docs)



### Running a `CorpusParser`

We'll use an NLP preprocessing tool to split our documents into sentences, tokens, and provide annotations---part-of-speech tags, dependency parse structure, lemmatized word forms, named entities, etc.---for these sentences.

Let's run it single-threaded first; **note that this may take around 10 minutes depending on your machine**:

In [4]:
from snorkel.contrib.parser import *
#parser = RuleBasedParser()
#parser = Spacy()


In [5]:
from snorkel.parser import CorpusParser

corpus_parser = CorpusParser()
%time corpus_parser.apply(doc_preprocessor)

Clearing existing...
Running UDF...
CPU times: user 1min 55s, sys: 2.9 s, total: 1min 58s
Wall time: 9min 14s


We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which Snorkel uses) to check how many documents and sentences were parsed:

In [6]:
from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

('Documents:', 997)
('Sentences:', 28985)


In [7]:
documents = session.query(Document).all()

for doc in documents:
    for sent in doc.sentences:
        print sent.words
        print sent.text
        print sent.ner_tags
    break


[u'The', u'Duke', u'of', u'Cambridge', u'has', u'thrown', u'his', u'support', u'behind', u'an', u'organisation', u"'s", u'fight', u'against', u'bullying', u'-', u'and', u'listed', u'an', u'enviable', u'support', u'network', u'.']
The Duke of Cambridge has thrown his support behind an organisation's fight against bullying - and listed an enviable support network.
[u'O', u'ORGANIZATION', u'O', u'LOCATION', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O', u'O']
[u'\\', u'n', u'\\', u'nWilliam', u'wrote', u'down', u'Catherine', u',', u'Harry', u',', u'father', u',', u'grandmother', u',', u'grandfather', u'and', u'an', u'extra', u'-', u'his', u'dog', u'Lupo', u'-', u'when', u'he', u'joined', u'a', u'Diana', u'Fund', u'trainee', u'session', u'for', u'anti-bullying', u'ambassadors', u'.']
\n \nWilliam wrote down Catherine, Harry, father, grandmother, grandfather and an extra - his dog Lupo - when he joined a Diana Fund trainee session

### Running in parallel

Note that any time we execute a `UDFRunner` like `CorpusParser`, we can also execute it in parallel by running e.g.:
```python
corpus_parser.apply(doc_preprocessor, parallelism=20)
```
**Note, however, that this parallel execution will not work with SQLite, the database system used by default**; you will need to use e.g. Postgres for this!

Next, in Part 2, we will look at how to extract `Candidate` relations from our saved corpus.