# Tables in Snorkel: Extracting Attributes from Spec Sheets

## Part II: `Candidate` Extraction

In [1]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading the Training `Corpus`

First, we will load the Training `Corpus` that we preprocessed in Part I:

In [2]:
from snorkel.models import Corpus
from snorkel.utils import get_ORM_instance

corpus = get_ORM_instance(Corpus, session, 'Hardware Training')
print "%s contains %d Documents" % (corpus, len(corpus))

Corpus (Hardware Training) contains 8 Documents


## Defining a `Candidate` Schema

We define our candidates to be a binary relation with a part number and a temperature value.

In [3]:
from snorkel.models import candidate_subclass

Part_Temp = candidate_subclass('Part_Temp', ['part','temp'])

## Defining `ContextSpaces`

Special context space objects are made for both Parts and Temperatures because of the unique needs of this dataset. 

In the `OmniNgramsTemp` object, any "dash-like" character (minus sign, em dash, en dash, etc.) is converted to a simple dash, and spaces between dashes and numbers are removed. This improves our ability to interpret the numbers downstream. We anticipate all temperatures being expressed in no more than two n-grams, so we set `n_max=2`.

In [17]:
from hardware_utils import OmniNgramsTemp

temp_ngrams = OmniNgramsTemp(n_max=2)

The `OmniNgramsPart` is more complex. In it, we "expand" part numbers in 3 ways:

1. Part number ranges are split into individual parts
    * e.g., BC548-BC550 -> BC548, BC549, BC550
2. Part number suffix groups are enumerated
    * e.g., BC548A/B/C -> BC548A, BC548B, BC548C
3. Base part numbers are expanded to include more specific variants:
    * e.g., (if BC548A/B appears elsewhere in the document), BC548 -> BC548, BC548A, BC548B

The third expansion mode requires prior knowledge about the other part numbers that occur in the document. This information can be obtained from a preprocessing pass of the data sheets, or with a gold dictionary, which we use here.

These expansion operations allow for part numbers that don't occur verbatim anywhere in the document to still be discovered and matched using the same `Matcher` objects as simple part number mentions. Wherever this expansion occurs, an `ImplicitSpan` object is created instead of a regular `Span` object.

In [7]:
import os
from hardware_utils import get_gold_dict
from collections import defaultdict

gold_file = os.environ['SNORKELHOME'] + '/tutorials/tables/data/hardware/hardware_gold.csv'
gold_parts = get_gold_dict(gold_file, doc_on=True, part_on=True, val_on=False)
parts_by_doc = defaultdict(set)
for part in gold_parts:
    parts_by_doc[part[0]].add(part[1])

We see, for example, that the document 'VISHS23888-1' has fourteen part numbers mentioned in it.

In [18]:
parts_by_doc.items()[3]

('VISHS23888-1',
 {'BC546',
  'BC546A',
  'BC546B',
  'BC547',
  'BC547A',
  'BC547B',
  'BC547C',
  'BC548',
  'BC548A',
  'BC548B',
  'BC548C',
  'BC549',
  'BC549B',
  'BC549C'})

In [20]:
from hardware_utils import OmniNgramsPart

part_ngrams = OmniNgramsPart(parts_by_doc=parts_by_doc, n_max=3)

## Defining `Matchers`

We also define matchers, which filter the produced `Span` objects.

The `Matcher` for parts looks for a match with any of a number of standard naming protocol regexes for transistors.

In [21]:
from snorkel.matchers import RegexMatchSpan, Union

eeca_matcher = RegexMatchSpan(rgx='([b]{1}[abcdefklnpqruyz]{1}[\swxyz]?[0-9]{3,5}[\s]?[A-Z\/]{0,5}[0-9]?[A-Z]?([-][A-Z0-9]{1,7})?([-][A-Z0-9]{1,2})?)')
jedec_matcher = RegexMatchSpan(rgx='([123]N\d{3,4}[A-Z]{0,5}[0-9]?[A-Z]?)')
jis_matcher = RegexMatchSpan(rgx='(2S[abcdefghjkmqrstvz]{1}[\d]{2,4})')
others_matcher = RegexMatchSpan(rgx='((NSVBC|SMBT|MJ|MJE|MPS|MRF|RCA|TIP|ZTX|ZT|TIS|TIPL|DTC|MMBT|PZT){1}[\d]{2,4}[A-Z]{0,3}([-][A-Z0-9]{0,6})?([-][A-Z0-9]{0,1})?)')
parts_matcher = Union(eeca_matcher, jedec_matcher, jis_matcher, others_matcher)

The `Matcher` for temperatures looks for temperatures in the set {-50, -55, -60, -65, -70} which we know by experience covers nearly all transistors in our corpus.

In [22]:
from snorkel.matchers import RegexMatchSpan

temp_matcher = RegexMatchSpan(rgx=r'-[5-7][05]', longest_match_only=False)

## Running the `CandidateExtractor`

We combine these `ContextSpaces` and `Matchers` to form a `CandidateExtractor`, which we can now apply to our `Corpus`.

In [23]:
from snorkel.candidates import CandidateExtractor

ce = CandidateExtractor(Part_Temp, [part_ngrams, temp_ngrams], [parts_matcher, temp_matcher])

In [24]:
%time train = ce.extract(corpus.documents, 'Hardware Training Candidates', session)
print "%s contains %d Candidates" % (train, len(train))


CPU times: user 11.5 s, sys: 145 ms, total: 11.6 s
Wall time: 11.8 s
Candidate Set (Hardware Training Candidates) contains 6571 Candidates


In [26]:
print train[0]

Part_Temp(ImplicitSpan("2N3906", parent=653, words=[0,0], position=[0]), ImplicitSpan("-50", parent=11554, words=[0,0], position=[0]))


### Saving the extracted candidates

In [27]:
session.add(train)
session.commit()

### Reloading the candidates

In [28]:
from snorkel.models import CandidateSet
from snorkel.utils import get_ORM_instance

train = get_ORM_instance(CandidateSet, session, 'Hardware Training Candidates')
print "%s contains %d Candidates" % (train, len(train))

Candidate Set (Hardware Training Candidates) contains 6571 Candidates


### Repeating for Development Corpus

In [29]:
for corpus_name in ['Hardware Development']:
    corpus = get_ORM_instance(Corpus, session, corpus_name)
    print "Extracting Candidates from %s" % corpus
    %time candidates = ce.extract(corpus.documents, corpus_name + ' Candidates', session)
    session.add(candidates)
    print "%s contains %d Candidates" % (candidates, len(candidates))
session.commit()

Extracting Candidates from Corpus (Hardware Development)

CPU times: user 1.23 s, sys: 24.8 ms, total: 1.26 s
Wall time: 1.29 s
Candidate Set (Hardware Development Candidates) contains 57 Candidates


Next, in Part 3, we will load `Labels` for each of our `Candidates` so that we can evaluate performance.