# Entity Extraction from Tables

This notebook demonstrates the full extraction and learning process for _entities_ in tables with a data set of transistor spec sheets, extracting minimum storage temperatures.

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Extraction

In [3]:
load_pickle = True # with pickle ~15s; without pickle ~75s
save_pickle = False

corpus_loaded = False
import cPickle
if load_pickle:
    try:
        with open("data/hardware/hardware_corpus.pkl","r") as pkl:
            %time corpus = cPickle.load(pkl)
        corpus_loaded = True
        print "Corpus has been loaded."
    except:
        print "Corpus could not be loaded."
        print "Corpus will be parsed instead..."
if not corpus_loaded:
    from snorkel.parser import CorpusParser
    from snorkel.parser import HTMLParser
    from snorkel.parser import TableParser

    html_parser = HTMLParser(path='data/hardware/hardware_html/')
    table_parser = TableParser()

    cp = CorpusParser(html_parser, table_parser, max_docs=200)
    %time corpus = cp.parse_corpus(name='Hardware Corpus')
    print "Corpus has been parsed."

    if save_pickle:
        with open("data/hardware/hardware_corpus.pkl","w") as pkl:
            %time cPickle.dump(corpus, pkl)
            print "Corpus has been pickled."

CPU times: user 12.3 s, sys: 555 ms, total: 12.8 s
Wall time: 13 s
Corpus has been loaded.


In [4]:
from snorkel.candidates import TableNgrams
from snorkel.matchers import NumberMatcher, RangeMatcher

# Define a candidate space
ngrams = TableNgrams(n_max=2)

# Define a matcher
number_matcher = RangeMatcher(low=-70,high=-50)

In [5]:
from snorkel.candidates import EntityExtractor
ce = EntityExtractor(ngrams, number_matcher)
%time candidates = ce.extract(corpus.get_tables(), name='all')
for cand in candidates[:10]: 
    print cand
print "%s candidates extracted" % len(candidates)

CPU times: user 1.75 s, sys: 40 ms, total: 1.79 s
Wall time: 1.78 s
Span("-55", context=None, chars=[0,2], words=[0,0])
Span("-50", context=None, chars=[0,2], words=[0,0])
Span("-50", context=None, chars=[0,2], words=[0,0])
Span("-50", context=None, chars=[4,6], words=[2,2])
Span("-50", context=None, chars=[4,6], words=[2,2])
Span("-50", context=None, chars=[4,6], words=[2,2])
Span("-55", context=None, chars=[0,2], words=[0,0])
Span("-55", context=None, chars=[0,2], words=[0,0])
Span("-55", context=None, chars=[0,2], words=[0,0])
Span("-50", context=None, chars=[0,2], words=[0,0])
108 candidates extracted


### Learning

First, generate gold data.

In [6]:
from utils import collect_hardware_entity_gold
filename='data/hardware/gold_all.csv'
(gold_candidates, gold_labels) = collect_hardware_entity_gold(filename, 'stg_temp_min', candidates)
print "%s out of %s candidates have gold labels" % (len(gold_candidates), len(candidates))
print "%s out of %s labeled candidates have positive label" % (gold_labels.count(1), len(gold_candidates))

# Split into train and test set
training_candidates = []
n_half = len(candidates)/2
for idx, c in enumerate(candidates[:n_half]):
    if c in gold_candidates:
        gold_candidates.append(c)
        gold_labels.append(gold_labels[idx])
    else:
        training_candidates.append(c)
training_candidates.extend(candidates[n_half:])
gold_labels = np.array(gold_labels)
# print "Training set size: %s" % len(training_candidates)
# print "Gold set size: %s" % len(gold_candidates)
print "Positive labels in gold set: %s" % sum(gold_labels==1)
print "Negative labels in gold set: %s" % sum(gold_labels==-1)

98 gold annotations
59 out of 108 candidates have gold labels
51 out of 59 labeled candidates have positive label
Positive labels in gold set: 81
Negative labels in gold set: 12


Let's take a quick peek at the features:

In [7]:
from snorkel.features import TableNgramFeaturizer
featurizer = TableNgramFeaturizer()
featurizer.fit_transform(candidates)
for f in featurizer.get_features_by_candidate(candidates[0])[:10]: print f

Building feature index...
Extracting features...
0/4419
DDLIB_WORD_SEQ_[-55]
DDLIB_LEMMA_SEQ_[-55]
DDLIB_POS_SEQ_[CD]
DDLIB_DEP_SEQ_[ROOT]
DDLIB_W_LEFT_1_[_NUMBER]
DDLIB_W_LEFT_POS_1_[CD]
DDLIB_W_LEFT_2_[to _NUMBER]
DDLIB_W_LEFT_POS_2_[TO CD]
DDLIB_W_LEFT_3_[_NUMBER to _NUMBER]
DDLIB_W_LEFT_POS_3_[CD TO CD]


Define labeling functions:

In [8]:
def LF_to_range(m):
    return 1 if 'to' in m.post_window('words') else 0
def LF_tilde_range(m):
    return 1 if '~' in m.post_window('words') else 0
def LF_storage(m):
    return 1 if 'storage' in m.aligned_ngrams('words') else -1
def LF_tstg(m):
    return 1 if 'tstg' in m.aligned_ngrams('words') else -1
def LF_tj(m):
    return 1 if 'tj' in m.aligned_ngrams('words') else -1
def LF_temperature(m):
    return 1 if 'temperature' in m.aligned_ngrams('words') else -1
def LF_celsius(m):
    return 1 if 'c' in m.aligned_ngrams('words') else -1
def LF_max(m):
    return 1 if 'max' in m.aligned_ngrams('words') else 0
def LF_min(m):
    return 1 if 'min' in m.aligned_ngrams('words') else 0

In [9]:
LFs = [LF_to_range, LF_tilde_range, LF_storage, LF_tstg, LF_tj, LF_temperature,
      LF_celsius, LF_max, LF_min]

In [10]:
from snorkel.snorkel import TrainingSet
from snorkel.features import NgramFeaturizer

training_set = TrainingSet(training_candidates, LFs, featurizer=TableNgramFeaturizer())

Applying LFs...
Featurizing...
Building feature index...
Extracting features...
0/3491
LF Summary Statistics: 9 LFs applied to 74 candidates
------------------------------------------------------------
Coverage (candidates w/ > 0 labels):		100.00%
Overlap (candidates w/ > 1 labels):		100.00%
Conflict (candidates w/ conflicting labels):	62.16%


In [11]:
lf_stats = training_set.lf_stats()
lf_stats[:5]

Unnamed: 0,conflicts,coverage,j,overlaps
LF_to_range,0.135135,0.216216,0,0.216216
LF_tilde_range,0.067568,0.094595,1,0.094595
LF_storage,0.621622,1.0,2,1.0
LF_tstg,0.621622,1.0,3,1.0
LF_tj,0.621622,1.0,4,1.0


Now learn, baby, learn!

In [12]:
from snorkel.snorkel import Learner
from snorkel.learning import LogReg

learner = Learner(training_set, model=LogReg(bias_term=True))

In [13]:
# Splitting into CV and test set
n_half = len(gold_candidates)/2
test_candidates = gold_candidates[:n_half]
test_labels     = gold_labels[:n_half]
cv_candidates   = gold_candidates[n_half:]
cv_labels       = gold_labels[n_half:]

In [14]:
from snorkel.learning_utils import GridSearch

gs       = GridSearch(learner, ['mu', 'lf_w0'], [[1e-5, 1e-7],[1.0,2.0]])
gs_stats = gs.fit(cv_candidates, cv_labels)

Testing mu = 1.00e-05, lf_w0 = 1.00e+00
Begin training for rate=0.01, mu=1e-05
	Learning epoch = 0	Gradient mag. = 0.039140
	Learning epoch = 250	Gradient mag. = 0.048421
	Learning epoch = 500	Gradient mag. = 0.058534
	Learning epoch = 750	Gradient mag. = 0.072298
Final gradient magnitude for rate=0.01, mu=1e-05: 0.091
Applying LFs...
Featurizing...
Testing mu = 1.00e-05, lf_w0 = 2.00e+00
Begin training for rate=0.01, mu=1e-05
	Learning epoch = 0	Gradient mag. = 0.063301
	Learning epoch = 250	Gradient mag. = 0.074607
	Learning epoch = 500	Gradient mag. = 0.084353
	Learning epoch = 750	Gradient mag. = 0.096808
Final gradient magnitude for rate=0.01, mu=1e-05: 0.110
Testing mu = 1.00e-07, lf_w0 = 1.00e+00
Begin training for rate=0.01, mu=1e-07
	Learning epoch = 0	Gradient mag. = 0.039140
	Learning epoch = 250	Gradient mag. = 0.048469
	Learning epoch = 500	Gradient mag. = 0.058646
	Learning epoch = 750	Gradient mag. = 0.072472
Final gradient magnitude for rate=0.01, mu=1e-07: 0.091
Testin

In [15]:
gs_stats

Unnamed: 0,mu,lf_w0,Prec.,Rec.,F1
0,1e-05,1.0,0.891304,0.97619,0.931818
1,1e-05,2.0,0.891304,0.97619,0.931818
2,1e-07,1.0,0.891304,0.97619,0.931818
3,1e-07,2.0,0.891304,0.97619,0.931818


In [16]:
learner.test(test_candidates, test_labels)

Applying LFs...
Featurizing...
Test set size:	46
----------------------------------------
Precision:	0.904761904762
Recall:		0.974358974359
F1 Score:	0.938271604938
----------------------------------------
TP: 38 | FP: 4 | TN: 3 | FN: 1




In [17]:
learner.feature_stats(n_max=10)

Unnamed: 0,j,w
TABLE_ROW_WORDS_temperature,1855,0.088621
TABLE_COL_WORDS_temperature,1346,0.088621
TABLE_COL_WORDS_tstg,1116,0.077765
TABLE_ROW_WORDS_tstg,1841,0.077765
TABLE_ROW_WORDS_storage_temperature,1255,0.075992
TABLE_COL_WORDS_storage_temperature,3231,0.075992
TABLE_COL_WORDS_junction_temperature,2943,0.065262
TABLE_ROW_WORDS_junction_temperature,3200,0.065262
TABLE_ROW_WORDS_junction,1820,0.042801
TABLE_COL_WORDS_junction,144,0.042801


Tune in next time for relation extraction!