# Tables Scratchpad

This notebook is meant for in-house demonstration of candidate extraction and featurization of tables. It assumes an input file in XHTML format, a strict form of HTML that coincides with XML structure, allowing for easy display (HTML) and safe tree traversal (XML).

In [11]:
%load_ext autoreload
%autoreload 2

[autoreload of snorkel.candidates failed: Traceback (most recent call last):
  File "/Users/bradenhancock/anaconda/lib/python2.7/site-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
  File "/Users/bradenhancock/snorkel/snorkel/candidates.py", line 208
    yield Ngram(char_start=char_start, char_end=char_start + m.start(1) - 1, context=context)
        ^
IndentationError: expected an indented block
]
  item.__name__


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[autoreload of snorkel.models.candidate failed: Traceback (most recent call last):
  File "/Users/bradenhancock/anaconda/lib/python2.7/site-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
InvalidRequestError: Table 'candidate_set' is already defined for this MetaData instance.  Specify 'extend_existing=True' to redefine options and columns on an existing Table object.
]


### Candidate Extraction

First, import the 'HTMLParser' class to read HTML tables

In [12]:
from snorkel.parser import HTMLParser
html_parser = HTMLParser(path='data/diseases/diseases.xhtml')

The "TableParser" class divides the html doc into cells, adding a 'cell_id' attribute to each cell for future traversal, and creating "Cell" objects that have attributes such as row number, column number, html tag, html attributes, and any tags/attributes on a cells ancestors in the table.

In [13]:
from snorkel.parser import TableParser
table_parser = TableParser()

As usual, pass these to a Corpus object for digestion.

In [14]:
# from snorkel.parser import Corpus
# %time corpus = Corpus(html_parser, table_parser)

from snorkel.parser import CorpusParser
cp = CorpusParser(html_parser, table_parser)
%time corpus = cp.parse_corpus(name='Diseases Corpus')

CPU times: user 38 ms, sys: 11.5 ms, total: 49.5 ms
Wall time: 73.1 ms


In [15]:
doc = corpus.documents[0]
for phrase in doc.phrases: print phrase

Phrase('0', 0, 0, 0, u'Disease')
Phrase('0', 0, 0, 1, u'Location')
Phrase('0', 0, 0, 2, u'Year')
Phrase('0', 0, 0, 3, u'Polio')
Phrase('0', 0, 0, 4, u'New York')
Phrase('0', 0, 0, 5, u'1914')
Phrase('0', 0, 0, 6, u"I don't like Chicken Pox.")
Phrase('0', 0, 0, 7, u'The plague is also bad.')
Phrase('0', 0, 0, 8, u'Boston')
Phrase('0', 0, 0, 9, u'2001')
Phrase('0', 0, 0, 10, u'Scurvy')
Phrase('0', 0, 0, 11, u'Annapolis')
Phrase('0', 0, 0, 12, u'1901')
Phrase('0', 1, 0, 0, u'Problem')
Phrase('0', 1, 0, 1, u'Cause')
Phrase('0', 1, 0, 2, u'Cost')
Phrase('0', 1, 0, 3, u'Arthritis')
Phrase('0', 1, 0, 4, u'Pokemon Go')
Phrase('0', 1, 0, 5, u'Free')
Phrase('0', 1, 0, 6, u'Yellow Fever')
Phrase('0', 1, 0, 7, u'Unicorns')
Phrase('0', 1, 0, 8, u'$17.75')
Phrase('0', 1, 0, 9, u'Hypochondria')
Phrase('0', 1, 0, 10, u'Fear')
Phrase('0', 1, 0, 11, u'$100')


Load the good 'ole disease dictionary for recognizing disease names.

In [16]:
from load_dictionaries import load_disease_dictionary

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

Loaded 507899 disease phrases!


Here we use a new CandidateSpace object, CellNgrams. It inherits from Ngrams, and ensures that the Table context object is broken up into cells before being passed into the usual routine for pulling out Ngrams.

In [17]:
from snorkel.candidates import TableNgrams
from snorkel.matchers import DictionaryMatch

# Define a candidate space
table_ngrams = TableNgrams(n_max=3)

# Define a matcher
disease_matcher = DictionaryMatch(d=diseases, longest_match_only=False)

Passing the CandidateSpace, Matcher, and Context objects to a Candidates object, extraction is performed, and we see that a number of disease CellNgrams are returned.

In [18]:
# With new Candidates object:
# from snorkel.candidates import Candidates
# %time candidates = Candidates(table_ngrams, disease_matcher, corpus.get_contexts())

# With old Candidates object:
from snorkel.candidates import EntityExtractor
ce = EntityExtractor(table_ngrams, disease_matcher)
%time candidates = ce.extract(corpus.get_tables(), name='all')

for cand in candidates: print cand

CPU times: user 3.37 ms, sys: 1.13 ms, total: 4.5 ms
Wall time: 3.96 ms
Span("Disease", context=None, chars=[0,6], words=[0,0])
Span("Location", context=None, chars=[0,7], words=[0,0])
Span("Polio", context=None, chars=[0,4], words=[0,0])
Span("Chicken Pox", context=None, chars=[13,23], words=[4,5])
Span("plague", context=None, chars=[4,9], words=[1,1])
Span("Scurvy", context=None, chars=[0,5], words=[0,0])
Span("Problem", context=None, chars=[0,6], words=[0,0])
Span("Arthritis", context=None, chars=[0,8], words=[0,0])
Span("Yellow Fever", context=None, chars=[0,11], words=[0,1])
Span("Fever", context=None, chars=[7,11], words=[1,1])
Span("Hypochondria", context=None, chars=[0,11], words=[0,0])


In [19]:
c = candidates[0]
for ngram in c.row_ngrams('words'): print ngram

location
year
polio
new
new_york
york
1914
i
i_do
i_do_n't
do
do_n't
do_n't_like
n't
n't_like
n't_like_chicken
like
like_chicken
like_chicken_pox
chicken
chicken_pox
chicken_pox_.
pox
pox_.
.
the
the_plague
the_plague_is
plague
plague_is
plague_is_also
is
is_also
is_also_bad
also
also_bad
also_bad_.
bad
bad_.
.
boston
2001
scurvy
annapolis
1901


### Feature Generation

We can then generate features on our set of candidates, including *new and improved* table features!

In [20]:
from snorkel.features import TableNgramFeaturizer
featurizer = TableNgramFeaturizer()
featurizer.fit_transform(candidates)

<11x237 sparse matrix of type '<type 'numpy.float64'>'
	with 888 stored elements in LInked List format>

In [21]:
featurizer.get_features_by_candidate(candidates[1])[:]

[u'DDLIB_WORD_SEQ_[Location]',
 u'DDLIB_LEMMA_SEQ_[Location]',
 u'DDLIB_POS_SEQ_[NNP]',
 u'DDLIB_DEP_SEQ_[ROOT]',
 u'DDLIB_W_LEFT_1_[Location]',
 u'DDLIB_W_LEFT_POS_1_[NNP]',
 u'DDLIB_STARTS_WITH_CAPITAL',
 u'DDLIB_NUM_WORDS_1',
 u'TABLE_ROW_NUM_[0]',
 u'TABLE_COL_NUM_[0]',
 u'TABLE_HTML_TAG_th',
 u'TABLE_HTML_ATTR_style=height:12pt',
 u'TABLE_HTML_ATTR_type=phenotype',
 u'TABLE_HTML_ANC_TAG_tr',
 u'TABLE_HTML_ANC_TAG_tbody',
 u'TABLE_HTML_ANC_TAG_table',
 u'TABLE_HTML_ANC_TAG_body',
 u'TABLE_HTML_ANC_ATTR_style=height:13pt',
 u'TABLE_HTML_ANC_ATTR_center=left',
 u'TABLE_HTML_ANC_ATTR_size=2',
 u'TABLE_HTML_ANC_ATTR_font=blue',
 u'TABLE_ROW_WORDS_disease',
 u'TABLE_ROW_WORDS_year',
 u'TABLE_ROW_WORDS_polio',
 u'TABLE_ROW_WORDS_new',
 u'TABLE_ROW_WORDS_new_york',
 u'TABLE_ROW_WORDS_york',
 u'TABLE_ROW_WORDS_1914',
 u'TABLE_ROW_WORDS_i',
 u'TABLE_ROW_WORDS_i_do',
 u"TABLE_ROW_WORDS_i_do_n't",
 u'TABLE_ROW_WORDS_do',
 u"TABLE_ROW_WORDS_do_n't",
 u"TABLE_ROW_WORDS_do_n't_like",
 u"TABLE_RO

Ta-da! Next up: feeding these features into the learning machine.