The main aim of Snorkel is to allow to create more training data, because labelling large amounts of data manually is very hard and requires a lot of effort. Snorkel is using labelling functions to classify the data in different categories. Those labelling functions represent noisy, programmatic rules and heuristics that assign labels to unlabeled training data.

In [2]:
#importing all the necessary packages
from utils import extract_plain_text, load_data, ABSTAIN, BENIGN, MALICIOUS, ADULT
from url_labeling_functions import lf_educational_content, lf_sexual_innuendos,lf_malicious_keywords
from plain_text_labeling_functions import lf_plain_text_adult_content
import snorkel
import json
import pandas as pd
import numpy as np



As an example, we load train and validation data for spam detection here

In [3]:
file_train = '../data/Hackathon_data/train/D1_train.jsonl'
file_val = '../data/Hackathon_data/validation//D1_validation.jsonl'
val_truth = '../data/Hackathon_data/validation/D1_validation-truth.jsonl'


In [4]:
#loading train and validation data
df_train = load_data(file_train)
df_val = load_data(file_val)


19783it [00:45, 433.42it/s]
2937it [00:05, 544.62it/s]


Usually, Snorkel returns -1 value for ABSTAIN category and 0 or 1 for labels, but, as we have 3 labels, we use 0, 1 and 2.

We replace the labels by numerical values

In [5]:
df_val_t = load_data(val_truth)
df_val_t = df_val_t.replace({'Benign': 0, 'Malicious': 1, 'Adult': 2})

# we check if all the labeled items are in data
list_of_values = df_val['uid'].to_list()
df_val_t = df_val_t[df_val_t['uid'].isin(list_of_values)]
df_val_t
val_true = np.array(df_val_t['label'])

2937it [00:00, 109863.56it/s]


Labeling functions in Snorkel are created with the @labeling_function decorator. In this case, the labelling function uses the keywords that are present in the urls and indicate that the content is adult, malicious or benign. The functions can search in different parts of the data (such as URL and html content).

This function searches in html content instead of URL

To apply one or more labelling functions that we’ve written to a collection of data points, we use an LFApplier. Because our data points are represented with a Pandas DataFrame, we use the PandasLFApplier. Correspondingly, a single data point x that’s passed into our LFs will be a Pandas Series object.

In [6]:
from snorkel.labeling import PandasLFApplier

lfs = [lf_educational_content, lf_sexual_innuendos,lf_malicious_keywords,lf_plain_text_adult_content]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_val = applier.apply(df=df_val)


100%|██████████| 19783/19783 [00:00<00:00, 31444.85it/s]
100%|██████████| 2937/2937 [00:00<00:00, 22892.56it/s]


Here we have statistics about the labelling of our labelling functions. Polarity means what labels and how many labels does a function return (for example, Malicious and Abstain). Coverage indicates how much dataset is labeled by this function. Overlaps is when several functions give the same label to the data. Conflicts is when different functions labeled the same data differently.

In [7]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_educational_content,0,[0],0.006723,0.002275,0.002275
lf_sexual_innuendos,1,[2],0.012536,0.012031,5.1e-05
lf_malicious_keywords,2,[1],0.001415,0.000152,0.000152
lf_plain_text_adult_content,3,[2],0.303493,0.014406,0.002426


The LabelModel is able to learn weights for the labeling functions using only the label matrix as input. We also specify the cardinality, or number of classes.

In [8]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=3, verbose=True)
label_model.fit(L_train=L_val, n_epochs=500, log_freq=100, seed=123)


INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.005]
  0%|          | 1/500 [00:00<02:06,  3.93epoch/s]INFO:root:[100 epochs]: TRAIN:[loss=0.000]
 29%|██▉       | 147/500 [00:00<00:00, 527.51epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.000]
 57%|█████▋    | 287/500 [00:00<00:00, 822.76epoch/s]INFO:root:[300 epochs]: TRAIN:[loss=0.000]
INFO:root:[400 epochs]: TRAIN:[loss=0.000]
100%|██████████| 500/500 [00:00<00:00, 831.06epoch/s] 
INFO:root:Finished Training


The majority vote model uses the information about how many functions predict that this data will have this label and how many predict that this data will have a different label. The final label is determined by the majority: if 3 functions predict that it will be a label 'a' and 2 that if will be a label 'b', the final label will be 'a'

In [9]:
from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter(cardinality=3, verbose=True)
preds_train = majority_model.predict(L=L_train)


Here we can see, what accuracy scores we get. For some tasks label model is better, for some majority label voter is better: for you to find it out.

In [10]:
label_model_acc = label_model.score(L=L_val, Y=val_true, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:'} {label_model_acc * 100:.1f}%")

majority_acc = majority_model.score(L=L_val, Y=val_true, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:'} {majority_acc * 100:.1f}%")

Label Model Accuracy: 30.3%
Majority Vote Accuracy: 30.3%
