The main aim of Snorkel is to allow to create more training data, because labelling large amounts of data manually is very hard and requires a lot of effort. Snorkel is using labelling functions to classify the data in different categories. Those labelling functions represent noisy, programmatic rules and heuristics that assign labels to unlabeled training data.

In [2]:
#importing all the necessary packages
from utils import extract_plain_text, load_data, ABSTAIN, BENIGN, MALICIOUS, ADULT, url_tokenizetion
from url_labeling_functions import lf_educational_content, lf_sexual_innuendos,lf_malicious_keywords, \
    lf_adult_keywords, lf_malicious_keywords, lf_common_benign_domains, lf_explicit_usernames, lf_adult_product_references, \
    lf_common_adult_content_keywords, lf_euphemisms_for_adult, lf_adult_url_structure, lf_adult_industry_domains, lf_age_restriction, \
    lf_explicit_adult_keywords   
from plain_text_labeling_functions import lf_plain_text_adult_content
from load_hosts import lf_outgoing_host_is_malicious, lf_host_is_malicious, lf_outgoing_host_is_adult, lf_host_is_adult

from keyword_labeling_functions import lf_main_content_has_adult_content, lf_plain_text_has_adult_content, lf_html_has_adult_content, lf_url_has_adult_content, lf_url_param_has_adult_content, lf_url_fragment_has_adult_content, lf_image_text_has_adult_content, lf_video_text_has_adult_content

from sklearn_utils import SkLearnClassifier

import snorkel
import json
import pandas as pd
import numpy as np



As an example, we load train and validation data for spam detection here

In [3]:
fields = ['url_tokenized', 'url_params', 'url_fragment', 'outgoing_links', 'plain_text']

df_train = load_data('../data/Hackathon_data/train/D1_train.jsonl', fields)
df_train_truth = load_data('../data/Hackathon_data/train/D1_train-truth.jsonl', fields)


19783it [00:54, 362.63it/s]
19783it [00:00, 241963.08it/s]


In [4]:
df_train.iloc[0]

uid                            28597064-476d-4988-a701-00d279db856a
url                                  http://pornsites.cc/ru/fetish/
html              <!DOCTYPE html>\r\n<html lang="ru"><head><meta...
plain_text        SitePorn\nru\n  • English\n  • Deutschland\n  ...
url_tokenized                                pornsites cc ru fetish
url_params                                                         
url_fragment                                                       
outgoing_links    [/ru/hidden-cam/, /ru/crossdresser/, /c/?g=rN2...
Name: 0, dtype: object

In [7]:
SkLearnClassifier('plain_text').predict(df_train.iloc[0].to_dict())

NameError: name 'process' is not defined

In [7]:
SkLearnClassifier('url_tokenized').train(df_train, df_train_truth)

In [3]:
SkLearnClassifier('plain_text').train(df_train, df_train_truth)

In [6]:
#loading train and validation data
fields = ['url_tokenized', 'url_params', 'url_fragment', 'outgoing_links']
df_train = load_data(file_train, fields)
df_train_truth = load_data(file_train_truth, fields)
df_val = load_data(file_val, fields)


6757it [00:09, 711.77it/s]

: 

Usually, Snorkel returns -1 value for ABSTAIN category and 0 or 1 for labels, but, as we have 3 labels, we use 0, 1 and 2.

We replace the labels by numerical values

In [5]:
df_val_t = load_data(val_truth, [])
df_val_t = df_val_t.replace({'Benign': 0, 'Malicious': 1, 'Adult': 2})

# we check if all the labeled items are in data
list_of_values = df_val['uid'].to_list()
df_val_t = df_val_t[df_val_t['uid'].isin(list_of_values)]
df_val_t
val_true = np.array(df_val_t['label'])

2937it [00:00, 283997.39it/s]


Labeling functions in Snorkel are created with the @labeling_function decorator. In this case, the labelling function uses the keywords that are present in the urls and indicate that the content is adult, malicious or benign. The functions can search in different parts of the data (such as URL and html content).

This function searches in html content instead of URL

To apply one or more labelling functions that we’ve written to a collection of data points, we use an LFApplier. Because our data points are represented with a Pandas DataFrame, we use the PandasLFApplier. Correspondingly, a single data point x that’s passed into our LFs will be a Pandas Series object.

In [6]:
from snorkel.labeling import PandasLFApplier

lfs = [lf_educational_content, lf_sexual_innuendos, lf_malicious_keywords, lf_url_has_adult_content,

 lf_adult_keywords,  lf_common_benign_domains, lf_explicit_usernames,
lf_adult_product_references, lf_common_adult_content_keywords, lf_euphemisms_for_adult, lf_adult_url_structure,
lf_adult_industry_domains, lf_age_restriction, lf_explicit_adult_keywords,
 

lf_outgoing_host_is_malicious, lf_host_is_malicious, lf_outgoing_host_is_adult, lf_host_is_adult,
#, lf_plain_text_adult_content, lf_main_content_has_adult_content, lf_plain_text_has_adult_content,
  #     lf_html_has_adult_content, , lf_url_param_has_adult_content, lf_url_fragment_has_adult_content, lf_image_text_has_adult_content, 
    #   lf_video_text_has_adult_content]
]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_val = applier.apply(df=df_val)


100%|██████████| 19783/19783 [00:35<00:00, 551.53it/s]
100%|██████████| 2937/2937 [00:03<00:00, 776.67it/s]


Here we have statistics about the labelling of our labelling functions. Polarity means what labels and how many labels does a function return (for example, Malicious and Abstain). Coverage indicates how much dataset is labeled by this function. Overlaps is when several functions give the same label to the data. Conflicts is when different functions labeled the same data differently.

In [7]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_val, lfs=lfs).lf_summary()


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_educational_content,0,[0],0.009874,0.000681,0.000681
lf_sexual_innuendos,1,[2],0.002383,0.001362,0.0
lf_malicious_keywords,2,[1],0.001702,0.0,0.0
lf_url_has_adult_content,3,[2],0.034729,0.01396,0.002724
lf_adult_keywords,4,[2],0.010895,0.010555,0.001021
lf_common_benign_domains,5,[],0.0,0.0,0.0
lf_explicit_usernames,6,[2],0.000681,0.000681,0.0
lf_adult_product_references,7,[],0.0,0.0,0.0
lf_common_adult_content_keywords,8,[2],0.006129,0.006129,0.00034
lf_euphemisms_for_adult,9,[2],0.00034,0.0,0.0


The LabelModel is able to learn weights for the labeling functions using only the label matrix as input. We also specify the cardinality, or number of classes.

In [8]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=3, verbose=True)
label_model.fit(L_train=L_val, n_epochs=500, log_freq=100, seed=123)


INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.002]
  0%|          | 1/500 [00:00<02:14,  3.71epoch/s]INFO:root:[100 epochs]: TRAIN:[loss=0.001]
 25%|██▌       | 125/500 [00:00<00:00, 430.97epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.001]
 51%|█████     | 253/500 [00:00<00:00, 710.92epoch/s]INFO:root:[300 epochs]: TRAIN:[loss=0.001]
 74%|███████▍  | 369/500 [00:00<00:00, 853.42epoch/s]INFO:root:[400 epochs]: TRAIN:[loss=0.001]
100%|██████████| 500/500 [00:00<00:00, 703.27epoch/s]
INFO:root:Finished Training


The majority vote model uses the information about how many functions predict that this data will have this label and how many predict that this data will have a different label. The final label is determined by the majority: if 3 functions predict that it will be a label 'a' and 2 that if will be a label 'b', the final label will be 'a'

In [9]:
from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter(cardinality=3, verbose=True)
preds_train = majority_model.predict(L=L_train)


Here we can see, what accuracy scores we get. For some tasks label model is better, for some majority label voter is better: for you to find it out.

In [10]:
label_model_acc = label_model.score(L=L_val, Y=val_true, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:'} {label_model_acc * 100:.1f}%")

majority_acc = majority_model.score(L=L_val, Y=val_true, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:'} {majority_acc * 100:.1f}%")

Label Model Accuracy: 35.6%
Majority Vote Accuracy: 35.2%
