# Tables in Snorkel: Extracting Attributes from Spec Sheets

## Part V: Training a Model with Data Programming

In [1]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from snorkel.models import candidate_subclass

Part_Temp = candidate_subclass('Part_Temp', ['part','temp'])

### Loading the `CandidateSet`,  `Feature` matrix, and `Label` matrix

Putting it all together now, we reload the `CandidateSet`, `Feature` matrix, and `Label` matrix from the previous notebooks.

In [3]:
from snorkel.models import CandidateSet

train = session.query(CandidateSet).filter(
    CandidateSet.name == 'Hardware Training Candidates').one()

In [5]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()
%time F_train = feature_manager.load(session, train, 'Training Features')

CPU times: user 12.5 s, sys: 457 ms, total: 13 s
Wall time: 13 s


In [6]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()
%time L_train = label_manager.load(session, train, 'LF Labels')

CPU times: user 339 ms, sys: 49.5 ms, total: 388 ms
Wall time: 404 ms


## Train Generative Model

We train our generative model using the `Label` Matrix from the Training `CandidateSet`.

In [7]:
from snorkel.learning import NaiveBayes

gen_model = NaiveBayes()
gen_model.train(L_train, n_iter=5000, rate=1e-3)

Training marginals (!= 0.5):	6571
Features:			10
Begin training for rate=0.001, mu=1e-06
	Learning epoch = 0	Gradient mag. = 0.354352
	Learning epoch = 250	Gradient mag. = 0.375214
	Learning epoch = 500	Gradient mag. = 0.379046
	Learning epoch = 750	Gradient mag. = 0.384424
	Learning epoch = 1000	Gradient mag. = 0.391265
	Learning epoch = 1250	Gradient mag. = 0.399481
	Learning epoch = 1500	Gradient mag. = 0.408980
	Learning epoch = 1750	Gradient mag. = 0.419665
	Learning epoch = 2000	Gradient mag. = 0.431442
	Learning epoch = 2250	Gradient mag. = 0.444216
	Learning epoch = 2500	Gradient mag. = 0.457898
	Learning epoch = 2750	Gradient mag. = 0.472403
	Learning epoch = 3000	Gradient mag. = 0.487653
	Learning epoch = 3250	Gradient mag. = 0.503577
	Learning epoch = 3500	Gradient mag. = 0.520112
	Learning epoch = 3750	Gradient mag. = 0.537200
	Learning epoch = 4000	Gradient mag. = 0.554792
	Learning epoch = 4250	Gradient mag. = 0.572844
	Learning epoch = 4500	Gradient mag. = 0.591319
	Lear

In [8]:
gen_model.save(session, 'Generative Params')

In [9]:
train_marginals = gen_model.marginals(L_train)

## Train Discriminative Model

We now train a discriminative model using the `Feature` matrix generated earlier and marginal probabilities produced by the generative model.

In [10]:
from snorkel.learning import LogReg

disc_model = LogReg()
disc_model.train(F_train, train_marginals, n_iter=2000, rate=1e-3)

Training marginals (!= 0.5):	6571
Features:			3820
Using gradient descent...
	Learning epoch = 0	Step size = 0.001
	Loss = 4554.670123	Gradient magnitude = 12692.862502
	Learning epoch = 100	Step size = 0.000904792147114
	Loss = 1899.505850	Gradient magnitude = 988.946242
	Learning epoch = 200	Step size = 0.000818648829479
	Loss = 1869.787360	Gradient magnitude = 5214.426285
	Learning epoch = 300	Step size = 0.000740707032156
	Loss = 1645.137887	Gradient magnitude = 905.699390
	Learning epoch = 400	Step size = 0.000670185906007
	Loss = 1518.267465	Gradient magnitude = 835.951013
	Learning epoch = 500	Step size = 0.000606378944861
	Loss = 1087.352665	Gradient magnitude = 1532.727777
	Learning epoch = 600	Step size = 0.000548646907485
	Loss = 1172.138936	Gradient magnitude = 983.269116
	Learning epoch = 700	Step size = 0.000496411413431
	Loss = 902.762534	Gradient magnitude = 1262.552477
	Learning epoch = 800	Step size = 0.00044914914861
	Loss = 912.473345	Gradient magnitude = 768.188099

In [11]:
disc_model.w.shape

(3820,)

In [12]:
%time disc_model.save(session, "Discriminative Params")

CPU times: user 710 ms, sys: 18.5 ms, total: 728 ms
Wall time: 736 ms


## Assess Performance on Development Set

In [13]:
from snorkel.models import CandidateSet
dev = session.query(CandidateSet).filter(
    CandidateSet.name == 'Hardware Development Candidates').one()

In [15]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()
%time F_dev = feature_manager.load(session, dev, 'Training Features')

CPU times: user 5.81 s, sys: 250 ms, total: 6.06 s
Wall time: 6.11 s


In [16]:
L_dev = label_manager.load(session, dev, "Hardware Development Labels -- Gold")

In [17]:
gold_dev_set = session.query(CandidateSet).filter(
    CandidateSet.name == 'Hardware Development Candidates -- Gold').one()

In [19]:
tp, fp, tn, fn = disc_model.score(F_dev, L_dev, gold_dev_set)

Calibration plot:
Recall-corrected Noise-aware Model
Pos. class accuracy: 1.0
Neg. class accuracy: nan
Corpus Precision 1.0
Corpus Recall    1.0
Corpus F1        1.0
----------------------------------------
TP: 57 | FP: 0 | TN: 0 | FN: 0

Recall-corrected Noise-aware Model
Pos. class accuracy: 1.0
Neg. class accuracy: nan
Corpus Precision 1.0
Corpus Recall    1.0
Corpus F1        1.0
----------------------------------------
TP: 57 | FP: 0 | TN: 0 | FN: 0





Here we can perform error analysis on any `Candidates` which were incorrectly classified. In a text-only environment, we could use the `Viewer` for this task (see, for example, the Intro tutorial). Because we do not yet have a viewer compatible with HTML tables, we use helper functions with print statements.

In [23]:
from hardware_utils import part_error_analysis

if fp:
    part_error_analysis(list(fp)[0])

The results above are reported at the `Candidate`, or _mention_, level. What we're really interested in for many applications (including this one) is the performance at the _entity_ level. (For example, classifying all five (BC548, -55) `Candidates` from a document correctly should only count as one true positive entity, not five). The function below performs this correction.

In [24]:
from snorkel.models import Corpus
from hardware_utils import entity_level_f1
import os

gold_file = os.environ['SNORKELHOME'] + '/tutorials/tables/data/hardware/hardware_gold.csv'
corpus = session.query(Corpus).filter(Corpus.name == 'Hardware Development').one()
(TP, FP, FN) = entity_level_f1(tp, fp, tn, fn, gold_file, corpus, 'stg_temp_min')

Scoring on Entity-Level Gold Data
Corpus Precision 1.0
Corpus Recall    1.0
Corpus F1        1.0
----------------------------------------
TP: 4 | FP: 0 | FN: 0



In [27]:
if FP:
    print FP[0]

Using what we've learned, we can then iterate over and refine our LFs or learning parameters before assessing our final system performance on a Test set of `Candidates`.

The End.