# Candidate Extraction : Diseases

This notebook is meant for in-house demonstration of candidate extraction and featurization of tables. It assumes an input file in XHTML format, a strict form of HTML that coincides with XML structure, allowing for easy display (HTML) and safe tree traversal (XML).

In [None]:
%load_ext autoreload
%autoreload 2

### Candidate Extraction

First, import the 'HTMLParser' class to read HTML tables

In [49]:
from snorkel.parser import HTMLParser
html_parser = HTMLParser(path='data/diseases/diseases.xhtml')

The "TableParser" class divides the html doc into cells, adding a 'cell_id' attribute to each cell for future traversal, and creating "Cell" objects that have attributes such as row number, column number, html tag, html attributes, and any tags/attributes on a cells ancestors in the table.

In [50]:
from snorkel.parser import TableParser
table_parser = TableParser()

As usual, pass these to a Corpus object for digestion.

In [51]:
# from snorkel.parser import Corpus
# %time corpus = Corpus(html_parser, table_parser)

from snorkel.parser import CorpusParser
cp = CorpusParser(html_parser, table_parser)
%time corpus = cp.parse_corpus(name='Diseases Corpus')

CPU times: user 40.6 ms, sys: 3.53 ms, total: 44.1 ms
Wall time: 64.4 ms


In [52]:
doc = corpus.documents[0]
for phrase in doc.phrases: print phrase

Phrase('0', 0, 0, 0, u'Disease')
Phrase('0', 0, 1, 0, u'Location')
Phrase('0', 0, 2, 0, u'Year')
Phrase('0', 0, 3, 0, u'Polio')
Phrase('0', 0, 4, 0, u'New York')
Phrase('0', 0, 5, 0, u'1914')
Phrase('0', 0, 6, 0, u'Chicken Pox are bad.')
Phrase('0', 0, 6, 1, u'So is the plague.')
Phrase('0', 0, 7, 0, u'Boston')
Phrase('0', 0, 8, 0, u'2001')
Phrase('0', 0, 9, 0, u'Scurvy')
Phrase('0', 0, 10, 0, u'Annapolis')
Phrase('0', 0, 11, 0, u'1901')
Phrase('0', 0, 0, 0, u'Problem')
Phrase('0', 0, 1, 0, u'Cause')
Phrase('0', 0, 2, 0, u'Cost')
Phrase('0', 0, 3, 0, u'Arthritis')
Phrase('0', 0, 4, 0, u'Pokemon Go')
Phrase('0', 0, 5, 0, u'Free')
Phrase('0', 0, 6, 0, u'Yellow Fever')
Phrase('0', 0, 7, 0, u'Unicorns')
Phrase('0', 0, 8, 0, u'$17.75')
Phrase('0', 0, 9, 0, u'Hypochondria')
Phrase('0', 0, 10, 0, u'Fear')
Phrase('0', 0, 11, 0, u'$100')


Load the good 'ole disease dictionary for recognizing disease names.

In [53]:
from load_dictionaries import load_disease_dictionary

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

Loaded 507899 disease phrases!


Here we use a new CandidateSpace object, CellNgrams. It inherits from Ngrams, and ensures that the Table context object is broken up into cells before being passed into the usual routine for pulling out Ngrams.

In [54]:
from snorkel.candidates import TableNgrams
from snorkel.matchers import DictionaryMatch

# Define a candidate space
ngrams = TableNgrams(n_max=3)

# Define a matcher
disease_matcher = DictionaryMatch(d=diseases, longest_match_only=False)

Passing the CandidateSpace, Matcher, and Context objects to a Candidates object, extraction is performed, and we see that a number of disease CellNgrams are returned.

In [55]:
# With new Candidates object:
# from snorkel.candidates import Candidates
# %time candidates = Candidates(table_ngrams, disease_matcher, corpus.get_contexts())

# With old Candidates object:
from snorkel.candidates import CandidateExtractor
ce = CandidateExtractor(ngrams, disease_matcher)
%time candidates = ce.extract(corpus.get_tables(), name='all')

for cand in candidates: print cand

CPU times: user 46 ms, sys: 1.3 ms, total: 47.3 ms
Wall time: 57.2 ms
Ngram("Disease", context=None, chars=[0,6], words=[0,0])
Ngram("Location", context=None, chars=[11,18], words=[0,0])
Ngram("Polio", context=None, chars=[31,35], words=[0,0])
Ngram("Chicken Pox", context=None, chars=[60,70], words=[0,1])
Ngram("plague", context=None, chars=[91,96], words=[3,3])
Ngram("Scurvy", context=None, chars=[120,125], words=[0,0])
Ngram("Problem", context=None, chars=[0,6], words=[0,0])
Ngram("Arthritis", context=None, chars=[28,36], words=[0,0])
Ngram("Yellow Fever", context=None, chars=[63,74], words=[0,1])
Ngram("Fever", context=None, chars=[70,74], words=[1,1])
Ngram("Hypochondria", context=None, chars=[101,112], words=[0,0])


### Feature Generation

We can then generate features on our set of candidates, including *new and improved* table features!

In [56]:
from snorkel.features import TableNgramFeaturizer
featurizer = TableNgramFeaturizer()
featurizer.fit_transform(candidates)

<11x115 sparse matrix of type '<type 'numpy.float64'>'
	with 242 stored elements in LInked List format>

In [57]:
featurizer.get_features_by_candidate(candidates[0])

[u'DDLIB_WORD_SEQ_[Disease]',
 u'DDLIB_LEMMA_SEQ_[disease]',
 u'DDLIB_POS_SEQ_[NN]',
 u'DDLIB_DEP_SEQ_[ROOT]',
 u'DDLIB_W_LEFT_1_[disease]',
 u'DDLIB_W_LEFT_POS_1_[NN]',
 'DDLIB_STARTS_WITH_CAPITAL',
 'DDLIB_LENGTH_1',
 'TABLE_ROW_NUM_0',
 'TABLE_COL_NUM_1',
 'TABLE_HTML_TAG_th',
 'TABLE_HTML_ATTR_style=height:66pt',
 'TABLE_HTML_ATTR_style=outline:solid',
 'TABLE_HTML_ATTR_type=phenotype',
 'TABLE_HTML_ANC_TAG_tr',
 'TABLE_HTML_ANC_TAG_tbody',
 'TABLE_HTML_ANC_TAG_table',
 'TABLE_HTML_ANC_TAG_body',
 'TABLE_HTML_ANC_ATTR_align=left',
 'TABLE_HTML_ANC_ATTR_style=width:158pt',
 'TABLE_HTML_ANC_ATTR_style=border-top-style:solid',
 'TABLE_HTML_ANC_ATTR_style=border-top-width:1pt',
 'TABLE_HTML_ANC_ATTR_size=5',
 'TABLE_HTML_ANC_ATTR_font=blue']

Ta-da! Next up: feeding these features into the learning machine.