# Snorkel Intro Tutorial: Data Slicing
Traditional machine learning systems optimize for overall quality, which may be too coarse-grained. Models that achieve high overall performance might produce unacceptable failure rates on critical slices of the data.

*Note:* tutorial differs from the labeling tutorial in that we use ground truth labels in the train split for demo purposes. SFs are intended to be used after the training set has already been labeled by LFs (or by hand) in the training data pipeline.

In [1]:
from utils import load_spam_dataset

df_train, df_test = load_spam_dataset(load_train_labels=True)

## 1. Writing slicing functions

By leveraging **slicing functions (SFs)**, which output binary masks indicating whether an data point is in the slice of not. Each slice represents some noisily-defined subset of the data (corresponding to an SF) that we'd like to programmatically monitor.

In [2]:
import re

from snorkel.slicing import slicing_function

@slicing_function()
def short_comment(x):
  # HAM comments are often short, such as 'cool video!'
  return len(x.text.split()) < 5

sfs = [short_comment]

### Visualize slices

In [3]:
from snorkel.slicing import slice_dataframe

short_comment_df = slice_dataframe(df_test, short_comment)

100%|██████████| 250/250 [00:00<00:00, 65307.42it/s]


In [4]:
short_comment_df[["text", "label"]].head()

Unnamed: 0,text,label
194,super music﻿,0
2,I like shakira..﻿,0
110,subscribe to my feed,1
263,Awesome ﻿,0
77,Nice,0


## 2. Monitor slice performance with `Scorer.score_slices`

### Train a simple classifier

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from utils import df_to_features

vectorizer = CountVectorizer(ngram_range=(1, 1))
X_train, Y_train = df_to_features(vectorizer, df_train, "train")
X_test, Y_test = df_to_features(vectorizer, df_test, "test")

In [6]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=0.001, solver="liblinear")
sklearn_model.fit(X=X_train, y=Y_train)

LogisticRegression(C=0.001, solver='liblinear')

In [7]:
LogisticRegression(
  C=0.001, 
  class_weight=None,
  dual=False,
  fit_intercept=True,
  intercept_scaling=1,
  l1_ratio=None,
  max_iter=100,
  multi_class="auto",
  n_jobs=None,
  penalty="l2",
  random_state=None,
  solver="liblinear",
  tol=0.0001,
  verbose=0,
  warm_start=False
)

LogisticRegression(C=0.001, solver='liblinear')

In [8]:
from snorkel.utils import preds_to_probs

preds_test = sklearn_model.predict(X_test)
probs_test = preds_to_probs(preds_test, 2)

In [9]:
from sklearn.metrics import f1_score

print(f"Test set F1: {100 * f1_score(Y_test, preds_test):.1f}%")

Test set F1: 92.5%


### Store slice metadata in `S`

In [10]:
from snorkel.slicing import PandasSFApplier

applier = PandasSFApplier(sfs)
S_test = applier.apply(df_test)

100%|██████████| 250/250 [00:00<00:00, 89852.27it/s]


In [11]:
from snorkel.analysis import Scorer

scorer = Scorer(metrics=["f1"])

In [12]:
scorer.score_slices(
  S=S_test,
  golds=Y_test,
  preds=preds_test,
  probs=probs_test,
  as_dataframe=True
)

Unnamed: 0,f1
overall,0.925
short_comment,0.666667


Despite high overall performance, the ` short_comment` slice performs poorly here!

### Writing additonal slicing functions (SFs)

Slices are dynamic - as monitoring needs grow or change with new data distributions or application needs, an ML pipeline might require dozens, or even hundreds, of slices.


In [13]:
from snorkel.slicing import SlicingFunction, slicing_function
from snorkel.preprocess import preprocessor


# Keyword-based SFs
def keyword_lookup(x, keywords):
  return any(word in x.text.lower() for word in keywords)

def make_keyword_sf(keywords):
  return SlicingFunction(
    name=f"keyword_{keywords[0]}",
    f=keyword_lookup,
    resources=dict(keywords=keywords)
  )

keyword_please = make_keyword_sf(keywords=["please", "plz"])

In [14]:
# Regex-based SFs
@slicing_function()
def regex_check_out(x):
  return bool(re.search(r"check.*out", x.text, flags=re.I))


In [15]:
@slicing_function()
def short_link(x):
  return bool(re.search(r"\w+\.ly", x.text))


In [16]:
from textblob import TextBlob

@preprocessor(memoize=True)
def textblob_sentiment(x):
  scores = TextBlob(x.text)
  x.polarity = scores.sentiment.polarity
  return x

@slicing_function(pre=[textblob_sentiment])
def textblob_polarity(x):
  return x.polarity > 0.9

In [17]:
polarity_df = slice_dataframe(df_test, textblob_polarity)

100%|██████████| 250/250 [00:00<00:00, 2965.84it/s]


In [18]:
polarity_df[["text", "label"]].head()

Unnamed: 0,text,label
263,Awesome ﻿,0
240,Shakira is the best dancer,0
261,OMG LISTEN TO THIS ITS SOO GOOD!! :D﻿,0
14,Shakira is very beautiful,0
114,awesome,0


In [19]:
extra_sfs = [
  keyword_please,
  regex_check_out,
  short_link,
  textblob_polarity
]

sfs = [short_comment] + extra_sfs
slice_names = [sf.name for sf in sfs]

In [20]:
applier = PandasSFApplier(sfs)
S_test = applier.apply(df_test)

100%|██████████| 250/250 [00:00<00:00, 25784.44it/s]


In [21]:
scorer.score_slices(
    S=S_test, 
    golds=Y_test, 
    preds=preds_test, 
    probs=probs_test, 
    as_dataframe=True
)

Unnamed: 0,f1
overall,0.925
short_comment,0.666667
keyword_please,1.0
regex_check_out,1.0
short_link,0.5
textblob_polarity,0.727273


## 3. Improve slice performance
This section will demonstrate a modeling approach that we call **Slice-based Learning**, which improves performance by adding extra slice-specific representational capacity to whichever model we're using. Intuitively, we'd like to model to learn representations that are better suited to handle data points in *specific slice*. The approach is to model each slice as a seperate "expert task" in the style of multi-task learning;

In other approaches, one might attemp to increase slice performance with techniques like *oversampling* (i.e. with PyTorch's `WeightedRandomSampler`), effectively shifting the training distribution towards certain populations.

This might work with small number of slices, but with hundreds or thousands or production slices at scale, it could quickly become intractable to tune upsampling weights per slice.

### Constructing a `SliceAwareClassifier`

To cope with scale, we will attemp to learn and combine many slice-specific representations with an attention mechanism.

First initialize a `SliceAwareClassifier`
- `base_architecture`: Defines a simple Multi-Layer Perceptron (MLP) in PyTorch to serve as the primary representation architecture. Noted that the `BinarySlicingClassifier` is **agnostic to the base architecture** - can leverage a transformer model for text, or a ResNet for images.
- `head_dim`: identifies the final output feature dimension of the `base_architecture`
- `slice_names`: specify the slices that we plan to train on with this classifier

In [25]:
from snorkel.slicing import SliceAwareClassifier
from utils import get_pytorch_mlp

# Define model architecture
bow_dim = X_train.shape[1]
hidden_dim = bow_dim
mlp = get_pytorch_mlp(hidden_dim=hidden_dim, num_layers=2)

# Initialize slice model
slice_model = SliceAwareClassifier(
  base_architecture=mlp,
  head_dim=hidden_dim,
  slice_names=[sf.name for sf in sfs],
  scorer=scorer
)

In [26]:
applier = PandasSFApplier(sfs)
S_train = applier.apply(df_train)
S_test = applier.apply(df_test)

100%|██████████| 1586/1586 [00:00<00:00, 3854.60it/s]
100%|██████████| 250/250 [00:00<00:00, 27118.11it/s]


To train using slice information, we'd like to initialize a **slice-aware dataloader**. We can use `slice_model.make_slice_dataloader` to add slice labels to an existing dataloader.

Under the hood, this method leverages slice metadata to add slice labels to the appropriate fields such that it will be compatible with our model, a `SliceAwareClassifier`.

In [28]:
from utils import create_dict_dataloader

BATCH_SIZE = 64

train_dl = create_dict_dataloader(X_train, Y_train, "train")
train_dl_slice = slice_model.make_slice_dataloader(
  train_dl.dataset, S_train, shuffle=True, batch_size=BATCH_SIZE
)

test_dl = create_dict_dataloader(X_test, Y_test, "train")
test_dl_slice = slice_model.make_slice_dataloader(
  test_dl.dataset, S_test, shuffle=False, batch_size=BATCH_SIZE
)

### Representation learning with slices
Using Snorkel's `Trainer`, we fit our classifier with the training set dataloader.

In [29]:
from snorkel.classification import Trainer

# For demostration purpose, we set n_epochs = 2
trainer = Trainer(n_epochs=2, lr=1e-4, progress_bar=True)
trainer.fit(slice_model, [train_dl_slice])

Epoch 0:: 100%|██████████| 25/25 [00:06<00:00,  3.98it/s, model/all/train/loss=0.504, model/all/train/lr=0.0001]
Epoch 1:: 100%|██████████| 25/25 [00:06<00:00,  4.05it/s, model/all/train/loss=0.269, model/all/train/lr=0.0001]


At inference time, the primary task head (`spam_task`) will make all final predictions. We'd like to evaluate all the slice heads on the original task head - `score_slices` remaps all slice-related labels, denoted `spam_task_slice:{slice_name}_pred`, to be evaluated on the `spam_task`.

## Extra Reading

- [Multi-task learning](https://github.com/snorkel-team/snorkel-tutorials/blob/master/multitask/multitask_tutorial.ipynb)
- [Slice-based Learning Code Example](https://github.com/snorkel-team/snorkel/blob/master/snorkel/slicing/utils.py)
- [Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices](https://arxiv.org/abs/1909.06349)