# ✂️ _Spam_ — Data Slicing Tutorial

In real-world applications, _some model outcomes are often more important than others_, e.g. vulnerable cyclist detections in an autonomous driving task, or, in our running **spam** application, potentially malicious link redirects to external websites.

Traditional machine learning systems optimize for overall quality, which may be too coarse-grained: models that achieve high overall performance might produce unacceptable failure rates on critical slices of the data — data subsets that might correspond to vulnerable cyclist detection in an autonomous driving task, or in our running spam detection application, external links to potentially malicious websites.

In this tutorial, we introduce _Slicing Functions (SFs)_ as a programming interface to:
1. **Monitor** application-critical data slices
2. **Address model performance** on slices

First, we'll set up our notebook for reproducibility and proper logging.

In [1]:
import logging
import os
import pandas as pd
from snorkel.analysis.utils import set_seed

# For reproducibility
os.environ["PYTHONHASHSEED"] = "0"
set_seed(111)

# Make sure we're running from the spam/ directory
if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("spam")

# To visualize logs
logger = logging.getLogger()
logger.setLevel(logging.WARNING)

# Show full columns for viewing data
pd.set_option("display.max_colwidth", -1)

_Note:_ this tutorial differs from labeling tutorial because we use ground truth labels in the train split for demo purposes.
In practice, data slicing is agnostic to the _training labels_ used as inputs — you can use Snorkel-generated labels as inputs to this pipeline!

In [2]:
from utils import load_spam_dataset

df_train, df_valid, df_test = load_spam_dataset(
    load_train_labels=True, include_dev=False
)

## 1. Train a discriminative model

To start, we'll initialize a discriminative model using our [`SnorkelClassifier`](https://snorkel.readthedocs.io/en/redux/source/snorkel.classification.html).
We'll assume that you are familiar with Snorkel's the data/model/training abstraction — if not, we'd recommend you check out our [MTL Tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/mtl/multitask_tutorial.ipynb).

### Featurize Data

As a first step, we'll featurize the data—as you saw in the introductory Spam tutorial, we'll extract simple bag of words features and store them as numpy arrays.

In [3]:
import torch
from sklearn.feature_extraction.text import CountVectorizer
from snorkel.classification.data import DictDataset, DictDataLoader

vectorizer = CountVectorizer(ngram_range=(1, 1))


def df_to_torch_example(vectorizer, df, fit_train=False):
    words = [row.text for i, row in df.iterrows()]

    if fit_train:
        feats = vectorizer.fit_transform(words)
    else:
        feats = vectorizer.transform(words)
    X = feats.todense()
    Y = df["label"].values
    return X, Y

In [4]:
X_train, Y_train = df_to_torch_example(vectorizer, df_train, fit_train=True)
X_valid, Y_valid = df_to_torch_example(vectorizer, df_valid, fit_train=False)
X_test, Y_test = df_to_torch_example(vectorizer, df_test, fit_train=False)

### Create DataLoaders

Next, we'll use the extracted Tensors to initialize a `DictDataLoader` — as a quick recap, this is a Snorkel-specific class that inherits from the common PyTorch class and supports multiple data fields in the `X_dict` and labels in the `Y_dict`.

In this task, we'd like to store the `bow_features` in our `X_dict`, and we have one set of labels (for now) correpsonding to the `spam_task`.

In [5]:
BATCH_SIZE = 32


def create_dict_dataloader(X, Y, split, **kwargs):
    ds = DictDataset(
        name="spam_dataset",
        split=split,
        X_dict={"bow_features": torch.FloatTensor(X)},
        Y_dict={"spam_task": torch.LongTensor(Y)},
    )
    return DictDataLoader(ds, **kwargs)


dl_train = create_dict_dataloader(
    X_train, Y_train, split="train", batch_size=BATCH_SIZE, shuffle=True
)
dl_valid = create_dict_dataloader(
    X_valid, Y_valid, split="valid", batch_size=BATCH_SIZE, shuffle=False
)
dl_test = create_dict_dataloader(
    X_test, Y_test, split="test", batch_size=BATCH_SIZE, shuffle=False
)

We can inspect our datasets to confirm that they have the appropriate fields.

In [6]:
dl_valid.dataset

DictDataset(name=spam_dataset, X_keys=['bow_features'], Y_keys=['spam_task'])

### Define `SnorkelClassifier`

We'll define a simple Multi-Layer Perceptron (MLP) architecture to learn from the `bow_features`.

_Note: the following might feel like extra steps to define what is a very simple architecture, but this will lend us additional flexibility later in the pipeline!_

To start, we define a `module_pool` with all the [PyTorch](https://pytorch.org) modules that we'll want to include in our network.

In [7]:
import torch.nn as nn

bow_dim = X_train.shape[1]
module_pool = nn.ModuleDict(
    {
        "mlp": nn.Sequential(nn.Linear(bow_dim, bow_dim), nn.ReLU()),
        "prediction_head": nn.Linear(bow_dim, 2),
    }
)

Then, we specify the desired `task_flow` through each module.

In [8]:
from snorkel.classification.task import Operation

task_flow = [
    Operation(name="input_op", module_name="mlp", inputs=[("_input_", "bow_features")]),
    Operation(name="head_op", module_name="prediction_head", inputs=[("input_op", 0)]),
]

With these pieces, we're ready to define a [`Task`](https://snorkel.readthedocs.io/en/redux/source/snorkel.classification.html#module-snorkel.classification.task) in Snorkel for spam classification.

In [9]:
from functools import partial
from snorkel.classification.task import Task, ce_loss, softmax
from snorkel.classification.scorer import Scorer

spam_task = Task(
    name="spam_task",
    module_pool=module_pool,
    task_flow=task_flow,
    loss_func=partial(ce_loss, "head_op"),
    output_func=partial(softmax, "head_op"),
    scorer=Scorer(metrics=["accuracy", "f1"]),
)

We'll initialize a [`SnorkelClassifier`](https://snorkel.readthedocs.io/en/redux/source/snorkel.classification.html) with the `spam_task` we've created, initialize a corresponding [`Trainer`](https://snorkel.readthedocs.io/en/redux/source/snorkel.classification.training.html#module-snorkel.classification.training.trainer), and `fit` to our dataloaders!

In [10]:
from snorkel.classification.snorkel_classifier import SnorkelClassifier
from snorkel.classification.training import Trainer

model = SnorkelClassifier([spam_task])
trainer = Trainer(n_epochs=5, lr=1e-4, progress_bar=True)
# trainer.fit(model, [dl_train, dl_valid])

How well does our model do?

In [11]:
model.score([dl_train, dl_valid], as_dataframe=True)

Unnamed: 0,label,dataset,split,metric,score
0,spam_task,spam_dataset,train,accuracy,0.471627
1,spam_task,spam_dataset,train,f1,0.093074
2,spam_task,spam_dataset,valid,accuracy,0.55
3,spam_task,spam_dataset,valid,f1,0.129032


## 2. Perform error analysis

In overall metrics (`f1`, `accuracy`) our model appears to perform well!

However, we emphasize here that more often than not, we're interested in performance for application-critical subsets, or _slices_.

Let's perform an [`error_analysis`](https://snorkel.readthedocs.io/en/redux/source/snorkel.analysis.html#module-snorkel.analysis.error_analysis) to see where our model makes mistakes.
We'll collect the predictions from the model and visualize examples in specific error buckets.

In [12]:
from snorkel.analysis.error_analysis import get_label_buckets
from snorkel.analysis.utils import probs_to_preds

outputs = model.predict(dl_valid, return_preds=True)
error_buckets = get_label_buckets(
    outputs["golds"]["spam_task"], outputs["preds"]["spam_task"]
)

For application purposes, we might care especially about false negatives (true label was `1`, but model predicted `0`) — for the spam task, external links might point to malware, and we don't want to expose our users to these risks!

In [13]:
df_valid[["text", "label"]].iloc[error_buckets[(1, 0)]].head()

Unnamed: 0,text,label
70,"You guys should check out this EXTRAORDINARY website called ZONEPA.COM . You can make money online and start working from home today as I am! I am making over $3,000+ per month at ZONEPA.COM ! Visit Zonepa.com and check it out! How does the mother approve the axiomatic insurance? The fear appoints the roll. When does the space prepare the historical shame?",1
156,Check out these Irish guys cover of Avicii&#39;s Wake Me Up! Just search... &quot;wake me up Fiddle Me Silly&quot; Worth a listen for the gorgeous fiddle player!,1
73,"if you want to win money at hopme click here <a href=""https://www.paidverts.com/ref/sihaam01"">https://www.paidverts.com/ref/sihaam01</a> it&#39;s work 100/100﻿",1
130,"Hey Youtubers and All Music lover&#39;s, Guess most of you all skip these comments, but for you who is still reading this, thanks ! I dont have any money for advertisiments, no chance of getting heard, nothing. All that&#39;s left is spam, sorry. Im 17, Rapper/Singer from Estonia. Please listen my new cover on my account. You wont regret it. Give me just a chance, please. Take half a second of your life and thumb this comment up. It will maybe change my life, for real. Thank you Wafence",1
242,**CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE*** ***CHECK OUT MY NEW MIXTAPE******CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE*** ***CHECK OUT MY NEW MIXTAPE******CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE*** ***CHECK OUT MY NEW MIXTAPE******CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE**** **CHECK OUT MY NEW MIXTAPE*** ***CHECK OUT MY NEW MIXTAPE****,1


We notice that we're mis-classifying particularly some comments with shortened urls (e.g. `bit.ly/...`) — these links could redirect us to potentially dangerous websites, and we don't want our users to click them!

## 3. Monitor data slices

We leverage *slicing functions* (SFs) — an abstraction that shares syntax with *labeling functions*, which you should already be familiar with! (If not, please see the [intro tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb).) A key difference: whereas labeling functions output labels, slicing functions output binary _masks_ indicating whether an example is in the slice or not.

In the following cells, we define a slicing function that identifies these shortened links the spam dataset.
To do so, we write a regex that checks for the commonly-used `.ly` extension.

You'll notice that the slicing function is noisily defined — SFs are often heuristics to quickly measure performance over important subsets of the data.

In [14]:
import re
from snorkel.slicing.sf import slicing_function


@slicing_function()
def short_link(x):
    """Spam comments post links to other channels."""
    return bool(re.search(r"\w+\.ly", x.text))


sfs = [short_link]
slice_names = [sf.name for sf in sfs]

For our $n$ examples and $k$ slices in each split, we apply the SF to our data to create an $n \times k$ matrix. (So far, $k=1$).

In [15]:
from snorkel.slicing.apply import PandasSFApplier

applier = PandasSFApplier(sfs)
S_train = applier.apply(df_train)
S_valid = applier.apply(df_valid)
S_test = applier.apply(df_test)

100%|██████████| 1586/1586 [00:00<00:00, 40608.04it/s]
100%|██████████| 120/120 [00:00<00:00, 30470.79it/s]
100%|██████████| 250/250 [00:00<00:00, 37098.04it/s]


### Visualize slices with `PandasSlicer`

With a utility function from `snorkel.slicing.monitor`, we can visualize examples belonging to this slice in a `pandas.DataFrame`.

In [16]:
from snorkel.slicing.monitor import PandasSlicer

pd_slicer = PandasSlicer(df_valid)
short_link_df = pd_slicer.slice(short_link)
short_link_df[["text", "label"]]

100%|██████████| 120/120 [00:00<00:00, 31151.60it/s]


Unnamed: 0,text,label
280,Being paid to respond to fast paid surveys from home has enabled me to give up working and make more than 4500 bucks monthly. To read more go to this web site bit.ly\1bSefQe,1
192,Meet The Richest Online Marketer NOW CLICK : bit.ly/make-money-without-adroid,1
301,"coby this USL and past :<br /><a href=""http://adf.ly"">http://adf.ly</a> /1HmVtX<br />delete space after y﻿",1
350,adf.ly / KlD3Y,1
18,Earn money for being online with 0 efforts! bit.ly\14gKvDo,1


Now, we add labels for this particularly slice to an existing dataloader.
Specifically, `add_slice_labels` will add two sets of labels for each slice:
* `spam_task_slice:{slice_name}_ind`: an indicator label, which corresponds to the outputs of the slicing functions.
These indicate whether each example is in the slice (`label=1`)or not (`label=0`).
* `spam_task_slice:{slice_name}_pred`: a _masked_ set of the original task labels (in this case, labeled `spam_task`) for each slice. Examples that are masked (with `label=-1`) will not contribute to loss or scoring.

In [17]:
from snorkel.slicing.utils import add_slice_labels

slice_names = [sf.name for sf in sfs]
add_slice_labels(dl_train, spam_task, S_train, slice_names)
add_slice_labels(dl_valid, spam_task, S_valid, slice_names)
add_slice_labels(dl_test, spam_task, S_test, slice_names)

In [18]:
dl_valid.dataset

DictDataset(name=spam_dataset, X_keys=['bow_features'], Y_keys=['spam_task', 'spam_task_slice:short_link_ind', 'spam_task_slice:short_link_pred', 'spam_task_slice:base_ind', 'spam_task_slice:base_pred'])

With our updated dataloader, we want to evaluate on model on the defined slice. In the `SnorkelClassifier`, we can call `score` with an additional argument, `remap_labels` to specify that the slice's prediction labels, `spam_task_slice:short_link_pred` should be mapped to the `spam_task` for evaluation.

In [19]:
model.score(
    dataloaders=[dl_valid, dl_test],
    remap_labels={"spam_task_slice:short_link_pred": "spam_task"},
    as_dataframe=True,
)

  'precision', 'predicted', average, warn_for)


Unnamed: 0,label,dataset,split,metric,score
0,spam_task,spam_dataset,valid,accuracy,0.55
1,spam_task,spam_dataset,valid,f1,0.129032
2,spam_task_slice:short_link_pred,spam_dataset,valid,accuracy,0.0
3,spam_task_slice:short_link_pred,spam_dataset,valid,f1,0.0
4,spam_task,spam_dataset,test,accuracy,0.524
5,spam_task,spam_dataset,test,f1,0.016529
6,spam_task_slice:short_link_pred,spam_dataset,test,accuracy,0.0
7,spam_task_slice:short_link_pred,spam_dataset,test,f1,0.0


### Monitor slices with `SliceScorer`

If you're using a model other than `SnorkelClassifier`, you can still evaluate on slices using the more general `SliceScorer` class.

We define a `LogisticRegression` model from sklearn and show how we might visualize these slice-specific scores.

In [20]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=0.001, solver="liblinear")
sklearn_model.fit(X=X_train, y=Y_train)
sklearn_model.score(X_test, Y_test)

0.928

In [21]:
from snorkel.analysis.utils import preds_to_probs
from snorkel.slicing.monitor import SliceScorer


preds_test = sklearn_model.predict(X_test)

scorer = Scorer(metrics=["accuracy", "f1"])
scorer = SliceScorer(scorer, slice_names)
scorer.score(
    S_matrix=S_test,
    golds=Y_test,
    preds=preds_test,
    probs=preds_to_probs(preds_test, 2),
    as_dataframe=True,
)

Unnamed: 0,accuracy,f1
overall,0.928,0.925
short_link,0.333333,0.5


## 4. Address slice performance

In classification tasks, we might attempt to increase slice performance with techniques like _oversampling_ (i.e. with PyTorch's [`WeightedRandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler)).
This would shift the training distribution to over-represent certain minority populations.
Intuitively, we'd like to show more `short_link` examples to the model so that the representation is better suited to handle these examples!

A technique like upsampling might work with a small number of slices, but with a large number of slices, it could quickly become intractable to tune upsampling weights per slice.
In the following section, we show how we might handle numerous slices with a modeling approach using `SnorkelClassifier`.

### Write additional slicing functions (SFs)

We'll take inspiration from the labeling tutorial to write a few additional `SlicingFunctions`.

In [22]:
import spacy
from snorkel.slicing.sf import SlicingFunction, slicing_function, nlp_slicing_function
from snorkel.preprocess import preprocessor


def keyword_lookup(x, keywords):
    return any(word in x.text.lower() for word in keywords)


def make_keyword_sf(keywords):
    return SlicingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords),
    )


"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_sf(keywords=["subscribe"])

"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_sf(keywords=["please", "plz"])


@nlp_slicing_function()
def has_person_nlp(x):
    """Ham comments mention specific people and are short."""
    return len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents])


@slicing_function()
def regex_check_out(x):
    return bool(re.search(r"check.*out", x.text, flags=re.I))


@slicing_function()
def short_comment(x):
    """Ham comments are often short, such as 'cool video!'"""
    return len(x.text.split()) < 5


@slicing_function(pre=[spacy])
def has_person(x):
    """Ham comments mention specific people and are short."""
    return len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents])


from textblob import TextBlob


@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x


@slicing_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    return x.polarity > 0.9


extra_sfs = [
    keyword_subscribe,
    keyword_please,
    regex_check_out,
    short_comment,
    has_person_nlp,
    textblob_polarity,
]

sfs = [short_link] + extra_sfs

In [23]:
applier = PandasSFApplier(sfs)
S_train = applier.apply(df_train)
S_valid = applier.apply(df_valid)
S_test = applier.apply(df_test)

100%|██████████| 1586/1586 [00:14<00:00, 107.07it/s]
100%|██████████| 120/120 [00:01<00:00, 104.08it/s]
100%|██████████| 250/250 [00:02<00:00, 100.32it/s]


In [24]:
slice_names = [sf.name for sf in sfs]
add_slice_labels(dl_train, spam_task, S_train, slice_names)
add_slice_labels(dl_valid, spam_task, S_valid, slice_names)
add_slice_labels(dl_test, spam_task, S_test, slice_names)

Like we saw above, we'd like to visualize examples in the slice.

In [25]:
pd_slicer = PandasSlicer(df_valid)
polarity_df = pd_slicer.slice(textblob_polarity)
polarity_df[["text", "label"]].head()

100%|██████████| 120/120 [00:00<00:00, 22599.64it/s]


Unnamed: 0,text,label
16,Love this song !!!!!!,0
309,One of the best song of all the time﻿,0
164,She is perfect,0
310,Best world cup offical song﻿,0
352,I remember this :D,0


### Representation learning with slices

To cope with scale, we will attempt to learn and combine many slice-specific representations with an attention mechanism (for more, please see our technical report — coming soon!).
Using the helper, `convert_to_slice_tasks`, we have now have a list of slice tasks that appropriate constructs the `task_flow` to do just that!

In [26]:
from snorkel.slicing.utils import convert_to_slice_tasks

slice_tasks = convert_to_slice_tasks(spam_task, slice_names)
slice_tasks

[Task(name=spam_task_slice:short_link_ind),
 Task(name=spam_task_slice:keyword_subscribe_ind),
 Task(name=spam_task_slice:keyword_please_ind),
 Task(name=spam_task_slice:regex_check_out_ind),
 Task(name=spam_task_slice:short_comment_ind),
 Task(name=spam_task_slice:has_person_nlp_ind),
 Task(name=spam_task_slice:textblob_polarity_ind),
 Task(name=spam_task_slice:base_ind),
 Task(name=spam_task_slice:short_link_pred),
 Task(name=spam_task_slice:keyword_subscribe_pred),
 Task(name=spam_task_slice:keyword_please_pred),
 Task(name=spam_task_slice:regex_check_out_pred),
 Task(name=spam_task_slice:short_comment_pred),
 Task(name=spam_task_slice:has_person_nlp_pred),
 Task(name=spam_task_slice:textblob_polarity_pred),
 Task(name=spam_task_slice:base_pred),
 Task(name=spam_task)]

In [27]:
slice_model = SnorkelClassifier(slice_tasks)

We train this model, and note that we can monitor slice-specific performance during training.
This is a powerful way to track especially critical subsets of the data.

_Note: This model includes more parameters (corresponding to additional slices) — we only train for 1 epoch here for demonstration purposes._

In [28]:
trainer = Trainer(n_epochs=1, lr=1e-4, progress_bar=True)
trainer.fit(slice_model, [dl_train, dl_valid])

  'precision', 'predicted', average, warn_for)
Epoch 0:: 100%|██████████| 50/50 [01:15<00:00,  1.60s/it, model/all/train/loss=0.531, model/all/train/lr=0.0001, spam_task/spam_dataset/valid/accuracy=0.883, spam_task/spam_dataset/valid/f1=0.86, spam_task_slice:short_link_ind/spam_dataset/valid/f1=0, spam_task_slice:short_link_pred/spam_dataset/valid/accuracy=0, spam_task_slice:short_link_pred/spam_dataset/valid/f1=0, spam_task_slice:base_ind/spam_dataset/valid/f1=1, spam_task_slice:base_pred/spam_dataset/valid/accuracy=0.883, spam_task_slice:base_pred/spam_dataset/valid/f1=0.857, spam_task_slice:keyword_subscribe_ind/spam_dataset/valid/f1=0, spam_task_slice:keyword_subscribe_pred/spam_dataset/valid/accuracy=1, spam_task_slice:keyword_subscribe_pred/spam_dataset/valid/f1=1, spam_task_slice:keyword_please_ind/spam_dataset/valid/f1=0, spam_task_slice:keyword_please_pred/spam_dataset/valid/accuracy=1, spam_task_slice:keyword_please_pred/spam_dataset/valid/f1=1, spam_task_slice:regex_check_ou

At inference time, the primary task head (`spam_task`) will be making all final predictions.
We'd like to evaluate all the slice heads on the original task head.
To do this, we use our `remap_labels` API, as we did earlier.
Note that this time, we map each `ind` head to `None` — it doesn't make sense to evaluate these labels on the base task head.

In [29]:
Y_dict = dl_valid.dataset.Y_dict
eval_mapping = {label: "spam_task" for label in Y_dict.keys() if "pred" in label}
eval_mapping.update({label: None for label in Y_dict.keys() if "ind" in label})

_Note: in this toy dataset, we might not see significant gains because slices are defined for demo purposes. 
The dataset's slices contain only a few examples — they are not reliable evaluation metrics. For a demonstration of data slicing deployed in state-of-the-art models, please see our [SuperGLUE](https://github.com/HazyResearch/snorkel-superglue) tutorials._

In [30]:
slice_model.score([dl_valid], remap_labels=eval_mapping, as_dataframe=True)



Unnamed: 0,label,dataset,split,metric,score
0,spam_task,spam_dataset,valid,accuracy,0.883333
1,spam_task,spam_dataset,valid,f1,0.86
2,spam_task_slice:short_link_pred,spam_dataset,valid,accuracy,0.4
3,spam_task_slice:short_link_pred,spam_dataset,valid,f1,0.571429
4,spam_task_slice:base_pred,spam_dataset,valid,accuracy,0.883333
5,spam_task_slice:base_pred,spam_dataset,valid,f1,0.86
6,spam_task_slice:keyword_subscribe_pred,spam_dataset,valid,accuracy,0.9
7,spam_task_slice:keyword_subscribe_pred,spam_dataset,valid,f1,0.947368
8,spam_task_slice:keyword_please_pred,spam_dataset,valid,accuracy,0.888889
9,spam_task_slice:keyword_please_pred,spam_dataset,valid,f1,0.941176


You've just defined slicing functions to monitor specific slices + improved slice-specific performance!
For more on the technical details of our modeling approach—our technical report is coming soon!