# Relation Extraction from Tables

This notebook demonstrates the full extraction and learning process for _relations_ in tables with a data set of transistor spec sheets, extracting (temperature label, min storage temperature) pairs. (Eventually, this tutorial will be updated to extract (part number, min storage temperature) pairs.

In [None]:
%load_ext autoreload
%autoreload 2

### Extraction

Use this option if you would like to unpickle a previously parsed corpus, (i.e., the corpus from the Entity Extraction notebook).

In [2]:
try:
    import cPickle
    with open("data/hardware/hardware_corpus.pkl","r") as pkl:
        %time corpus = cPickle.load(pkl)
    print "Corpus has been loaded."
except:
    print "Corpus could not be loaded."

CPU times: user 36.2 s, sys: 1.79 s, total: 38 s
Wall time: 38.6 s
Corpus has been loaded.


Use this option to extract it again; for example, if you would like to use a different value for `max_docs`.

In [3]:
from snorkel.parser import CorpusParser
from snorkel.parser import HTMLParser
from snorkel.parser import TableParser

html_parser = HTMLParser(path='data/hardware/hardware_html/')
table_parser = TableParser()

cp = CorpusParser(html_parser, table_parser, max_docs=15)
%time corpus = cp.parse_corpus(name='Hardware Corpus')

print "Corpus has been parsed."

CPU times: user 8.29 s, sys: 439 ms, total: 8.73 s
Wall time: 14 s
Corpus has been parsed.


We now create two EntityExtractor objects, one for each component in the relation.

In [6]:
from snorkel.candidates import TableNgrams, EntityExtractor
from snorkel.matchers import RegexMatch, RangeMatcher

# Extractor 1: Part numbers
table_ngrams = TableNgrams(n_max=3)
eeca = RegexMatchEach(rgx=r'(\b[b]{1}[abcdefklnpqruyz]{1}[\swxyz]?[0-9]{3,4}[\s]?[A-Z]{0,2}[0-9]?([-][A-Z0-9]{1,3})?\b)', attrib='words', ignore_case=True),
eeca_base
eeca_suffic
matcher = Union(
    
    RegexMatchEach(rgx=r'')
)

eeca = re.compile(ur'(\b[b]{1}[abcdefklnpqruyz]{1}[\swxyz]?[0-9]{3,4}[\s]?[A-Z]{0,2}[0-9]?([-][A-Z0-9]{1,3})?\b)', re.IGNORECASE)
eeca_base = re.compile(ur'\b[b]{1}[abcdefklnpqruyz]{1}[\swxyz]?[0-9]{3,4}', re.IGNORECASE)
eeca_suffix = re.compile(ur'(?:[b]{1}[abcdefklnpqruyz]{1}[0-9]{3,4})([A-Z]{0,2}[0-9]?([-][A-Z0-9]{1,3})?)', re.IGNORECASE)
eeca_common = re.compile(ur'(?:^|\s)(A|B|C|-16|-25|-40)(?:\s|$)')
jedec = re.compile(ur'([123]N\d{3,4}[A-Z]?\b)', re.IGNORECASE)
jis = re.compile(ur'(2S[abcdefghjkmqrstvz]{1}[\d]{2,4})', re.IGNORECASE)
others = re.compile(ur'((NSVBC|SMBT|MJ|MJE|MPS|MRF|RCA|TIP|ZTX|ZT|TIS|TIPL|DTC|MMBT|PZT){1}[\d]{2,4}[A-Z]{0,3}([-][A-Z0-9]{0,3})?\b)', re.IGNORECASE)

# Extractor 2: Temperatures (min storage temperature)
range_matcher = RangeMatcher(low=-70,high=-50)
temp_extractor = EntityExtractor(table_ngrams, range_matcher)

%time candidates = temp_extractor.extract(corpus.get_tables(), name='all')
for cand in candidates[:10]: 
    print cand
print "%s candidates extracted" % len(candidates)

CPU times: user 2.56 s, sys: 48.1 ms, total: 2.61 s
Wall time: 2.6 s
Ngram("-55", context=None, chars=[297,299], words=[0,0])
Ngram("-50", context=None, chars=[414,416], words=[0,0])
Ngram("-50", context=None, chars=[494,496], words=[0,0])
Ngram("-50", context=None, chars=[683,685], words=[2,2])
Ngram("-50", context=None, chars=[854,856], words=[2,2])
Ngram("-50", context=None, chars=[991,993], words=[2,2])
Ngram("-55", context=None, chars=[294,296], words=[0,0])
Ngram("-55", context=None, chars=[292,294], words=[0,0])
Ngram("-55", context=None, chars=[355,357], words=[0,0])
Ngram("-50", context=None, chars=[410,412], words=[0,0])
14 candidates extracted
