# Tutorial, Part I: Candidate Extraction

In this example, we'll be writing an application to extract **person-age relationships** from homemade tables, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _person-age relation mentions_ as either true or false.  To do this, we first need a set of such candidates- in this notebook, we'll use `Snorkel` utilities to extract person candidates.

## Loading the Corpus

First, we will load and pre-process the corpus, storing it for convenience in a `Corpus` object

### Configuring a table parser

We'll start by defining an 'HTMLTableParser' class to read HTML tables.

In [1]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [38]:
from ddlite_parser import HTMLTableParser
html_parser = HTMLTableParser(path='data/diseases.xml')

In [39]:
from ddlite_parser import TableParser
table_parser = TableParser()

In [40]:
from ddlite_parser import Corpus
%time corpus = Corpus(html_parser, table_parser)

Parsing documents...
Parsing contexts...
CPU times: user 54.1 ms, sys: 5.07 ms, total: 59.2 ms
Wall time: 85.3 ms


In [41]:
from load_dictionaries import load_disease_dictionary

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

Loaded 507899 disease phrases!


In [42]:
from ddlite_candidates import CellNgrams
from ddlite_matchers import DictionaryMatch

# Define a candidate space
cell_ngrams = CellNgrams(n_max=3)

# Define a matcher
disease_matcher = DictionaryMatch(d=diseases, longest_match_only=False)

# Define extractor
# disease_extractor = EntityExtractor(ngrams, matcher)


Note that we set `longest_match_only=False`, which means that we _will_ consider subsequences of phrases that match 
our dictionary.

The `Ngrams` operator is applied over our `Sentence` objects and returns `Ngram` objects, and the `Matcher` then filters these, so we apply our operators over the sentences in the corpus, storing the results in a `Candidates` object for convenience:

In [43]:
from ddlite_candidates import Candidates
%time c = Candidates(cell_ngrams, disease_matcher, corpus.get_sentences())
print c.get_candidates()[0].html_attrs
for feat in c.get_candidates()[0].get_table_feats():
    print feat

Extracting candidates...
CPU times: user 2.65 ms, sys: 640 Âµs, total: 3.29 ms
Wall time: 2.87 ms
[]
ROW_NUM_0
COL_NUM_1
HTML_TAG_th
HTML_ANC_TAG_tr
HTML_ANC_TAG_tbody
HTML_ANC_TAG_table
HTML_ANC_TAG_body
HTML_ANC_ATTR_align=left
HTML_ANC_ATTR_size=5
HTML_ANC_ATTR_font=blue


In [10]:
# Define another matcher
years = [str(x) for x in range(1800,2016)]
year_matcher = DictionaryMatch(d=years, longest_match_only=False)

%time c = Candidates(cell_ngrams, year_matcher, corpus.get_sentences())
c.get_candidates()

Extracting candidates...
CPU times: user 3.3 ms, sys: 1.12 ms, total: 4.42 ms
Wall time: 3.93 ms


[<CellNgram("1914", id=0-0-5:0-3, chars=[0,3], (row,col)=(1,5), tag=td),
 <CellNgram("1901", id=0-0-11:0-3, chars=[0,3], (row,col)=(3,5), tag=td),
 <CellNgram("2001", id=0-0-8:0-3, chars=[0,3], (row,col)=(2,5), tag=td)]

In [54]:
from ddlite_candidates import RelationExtractor
disease_year_extractor = RelationExtractor([disease_extractor, year_extractor])
%time c = Candidates(disease_year_extractor, corpus.get_sentences())
c.get_candidates()

Extracting candidates...
CPU times: user 14.4 ms, sys: 3.92 ms, total: 18.3 ms
Wall time: 15.7 ms


[Relation<Ngram("Disease", id=0-0-0:0-6),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Polio", id=0-0-3:0-4),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Polio", id=0-0-3:0-4),Ngram("1914", id=0-0-5:0-3)>,
 Relation<Ngram("Disease", id=0-0-0:0-6),Ngram("1914", id=0-0-5:0-3)>,
 Relation<Ngram("Scurvy", id=0-0-9:0-5),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Disease", id=0-0-0:0-6),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Polio", id=0-0-3:0-4),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Chicken Pox", id=0-0-6:0-10),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Chicken Pox", id=0-0-6:0-10),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Scurvy", id=0-0-9:0-5),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Chicken Pox", id=0-0-6:0-10),Ngram("1914", id=0-0-5:0-3)>,
 Relation<Ngram("Scurvy", id=0-0-9:0-5),Ngram("1914", id=0-0-5:0-3)>]