# Tables in Snorkel: Extracting Attributes from Spec Sheets

In this tutorial, we will walk through the process of using Snorkel to extract relations from text, including text found in tables. If you have not already, consider walking through the Intro tutorial first to familiarize yourself with Snorkel in general. 

The source documents for this tutorial are specification sheets from various manufacturers of transistors. Prior to the beginning of this tutorial, these sheets were converted from PDF to HTML format using Adobe Acrobat. These conversions are not always (in fact, are almost never) perfectly accurate, but with Snorkel we aim to learn through this noise. In this tutorial specifically, we will be attempting to extract (part number, minimum storage temperature) pairs.

## Part I: Preprocessing

In this first notebook, we will preprocess the input data, parsing it into python objects that we will store in a database.

### Initializing a `SnorkelSession`

First, we initialize a SnorkelSession, which will enable us to save intermediate results.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Parse the Train `Corpus`

For a `DocParser` we use `HTMLParser`, which reads a file (or directory of files) looking for `<html>` tags and extracts the contents of each one as a new document.

For a `ContextParser` we use `OmniParser`, which parses the contents of an HTML document, paying special attention to tags that denote table-like structure and recording row/column information as relevant.

We limit the number of documents to parse to 10 in this tutorial for the sake of speed.

In [2]:
from snorkel.parser import CorpusParser
from snorkel.parser import HTMLParser
from snorkel.parser import OmniParser

doc_parser = HTMLParser(path='data/hardware/hardware100_html/')
context_parser = OmniParser()
cp = CorpusParser(doc_parser, context_parser, max_docs=10) 

In [3]:
%time corpus = cp.parse_corpus(name='Hardware', session=session)


CPU times: user 28.1 s, sys: 1.93 s, total: 30 s
Wall time: 1min 12s


We demonstrate here how to traverse the object hierarchy.

In [13]:
doc = corpus.documents[0]
print doc
print doc.tables[0]
print doc.tables[0].cells[5]
print doc.tables[0].cells[5].phrases[0]

Document 2N6427
Table(Doc: 2N6427, Position: 0)
Cell(Doc: 2N6427, Table: 0, Row: 1, Col: 1)
Phrase(Doc: 2N6427, Table: 0, Row: 1, Col: 1, Position: 0, Text: Collector-Emitter Voltage)


### Saving the `Corpus`
We persist the parsed corpus in Snorkel's database backend:

In [5]:
session.add(corpus)
session.commit()

### Reloading the `Corpus`
If the corpus has already been parsed, load it here:

In [6]:
from snorkel.models import Corpus

corpus = session.query(Corpus).filter(Corpus.name == 'Hardware').one()
print "%s contains %d Documents" % (corpus, len(corpus))

Corpus (Hardware) contains 10 Documents


### Split the `Corpus` into Train/Dev

Here we segment the parsed corpus into a `Training` and `Development` set.

In [7]:
from snorkel.utils import get_ORM_instance
from snorkel.queries import split_corpus

corpus = get_ORM_instance(Corpus, session, 'Hardware')
split_corpus(session, corpus, train=0.8, development=0.2, test=0, seed=0) #

8 Documents added to corpus Hardware Training
2 Documents added to corpus Hardware Development


We demonstrate below how these corpora can be retrieved.

In [8]:
from snorkel.utils import get_ORM_instance
from snorkel.models import Corpus

corpus = get_ORM_instance(Corpus, session, 'Hardware Training')
print "%s contains %d Documents" % (corpus, len(corpus))

corpus = get_ORM_instance(Corpus, session, 'Hardware Development')
print "%s contains %d Documents" % (corpus, len(corpus))

Corpus (Hardware Training) contains 8 Documents
Corpus (Hardware Development) contains 2 Documents


Next, in Part 2, we will look at how to extract `Candidate` relations from our saved `Corpus` objects.