The main aim of Snorkel is to allow to create more training data, because labelling large amounts of data manually is very hard and requires a lot of effort. Snorkel is using labelling functions to classify the data in different categories. Those labelling functions represent noisy, programmatic rules and heuristics that assign labels to unlabeled training data.

In [171]:
!pip install snorkel
!pip install jsonlines



In [172]:
#importing all the necessary packages
import snorkel
import json
import pandas as pd
import numpy as np
from snorkel.labeling import labeling_function

As an example, we load train and validation data for spam detection here

In [173]:
file_train = '/content/drive/MyDrive/data_colab/D1_train.jsonl'
file_val = '/content/drive/MyDrive/data_colab/D1_validation.jsonl'
val_truth = '/content/drive/MyDrive/data_colab/D1_validation-truth.jsonl'

def load_data(file_path):
  data = []
  with open(file_path, 'r') as file:
      for line in file:
          json_data = json.loads(line)
          data.append(json_data)
  df = pd.DataFrame(data)
  return df


In [174]:
#loading train and validation data
df_train = load_data(file_train)
df_val = load_data(file_val)


Usually, Snorkel returns -1 value for ABSTAIN category and 0 or 1 for labels, but, as we have 3 labels, we use 0, 1 and 2.

In [175]:
ABSTAIN = -1
BENIGN = 0
MALICIOUS = 1
ADULT = 2

We replace the labels by numerical values

In [176]:
df_val_t = load_data(val_truth)
df_val_t = df_val_t.replace({'Benign': 0, 'Malicious': 1, 'Adult': 2})

# we check if all the labeled items are in data
list_of_values = df_val['uid'].to_list()
df_val_t = df_val_t[df_val_t['uid'].isin(list_of_values)]
df_val_t
val_true = np.array(df_val_t['label'])

Labeling functions in Snorkel are created with the @labeling_function decorator. In this case, the labelling function uses the keywords that are present in the urls and indicate that the content is adult, malicious or benign. The functions can search in different parts of the data (such as URL and html content).

In [177]:
@labeling_function()
def lf_educational_content(x):
    edu_keywords = ['academic', 'research', 'conference', 'student','school','education', 'university']
    url = x['url']
    if any(keyword in url for keyword in edu_keywords):
        return BENIGN
    return ABSTAIN

In [178]:
@labeling_function()
def lf_sexual_innuendos(x):
    innuendos = ['booty', 'babe', 'milf', 'daddy','chick']
    url = x['url']
    if any(innuendo in url for innuendo in innuendos):
        return ADULT
    return ABSTAIN

In [179]:
@labeling_function()
def lf_malicious_keywords(x):
    malicious_keywords = ['hack', 'phish', 'malware', 'spyware']
    url = x['url']
    if any(keyword in url for keyword in malicious_keywords):
        return MALICIOUS
    return ABSTAIN


This function searches in html content instead of URL

In [180]:
@labeling_function()
def lf_adult_content(x):
    adult_keywords = ['sex', 'porn', 'hot','erotic']
    html = x['html']
    if any(keyword in html for keyword in adult_keywords):
        return ADULT
    return ABSTAIN

To apply one or more labelling functions that we’ve written to a collection of data points, we use an LFApplier. Because our data points are represented with a Pandas DataFrame, we use the PandasLFApplier. Correspondingly, a single data point x that’s passed into our LFs will be a Pandas Series object.

In [181]:
from snorkel.labeling import PandasLFApplier

lfs = [lf_educational_content, lf_sexual_innuendos,lf_malicious_keywords,lf_adult_content]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_val = applier.apply(df=df_val)


100%|██████████| 19783/19783 [00:07<00:00, 2657.97it/s]
100%|██████████| 2936/2936 [00:01<00:00, 1945.79it/s]


Here we have statistics about the labelling of our labelling functions. Polarity means what labels and how many labels does a function return (for example, Malicious and Abstain). Coverage indicates how much dataset is labeled by this function. Overlaps is when several functions give the same label to the data. Conflicts is when different functions labeled the same data differently.

In [182]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_educational_content,0,[0],0.006723,0.003336,0.003336
lf_sexual_innuendos,1,[2],0.012536,0.012384,5.1e-05
lf_malicious_keywords,2,[1],0.001415,0.000202,0.000202
lf_adult_content,3,[2],0.449527,0.015872,0.003538


The LabelModel is able to learn weights for the labeling functions using only the label matrix as input. We also specify the cardinality, or number of classes.

In [183]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=3, verbose=True)
label_model.fit(L_train=L_val, n_epochs=500, log_freq=100, seed=123)


100%|██████████| 500/500 [00:01<00:00, 293.99epoch/s]


The majority vote model uses the information about how many functions predict that this data will have this label and how many predict that this data will have a different label. The final label is determined by the majority: if 3 functions predict that it will be a label 'a' and 2 that if will be a label 'b', the final label will be 'a'

In [184]:
from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter(cardinality=3, verbose=True)
preds_train = majority_model.predict(L=L_train)


Here we can see, what accuracy scores we get. For some tasks label model is better, for some majority label voter is better: for you to find it out.

In [185]:
label_model_acc = label_model.score(L=L_val, Y=val_true, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:'} {label_model_acc * 100:.1f}%")

majority_acc = majority_model.score(L=L_val, Y=val_true, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:'} {majority_acc * 100:.1f}%")

Label Model Accuracy: 25.4%
Majority Vote Accuracy: 25.8%
