# Introductory Snorkel Tutorial: Spam Detection

* Nice introductory text
* Purpose of this tutorial...
* Steps:
    1. Load data
    2. Write labeling functions (LFs)
    3. Combine with Label Model
    4. Predict with Classifier

### Task: Spam Detection

* Here's what we're trying to do
* Here's where the data came from (cite properly)
* Show sample T and F in markdown

### Data Splits in Snorkel

* 4 splits: train, dev, valid, test
* train is large and unlabeled
* valid/test is labeled and you don't look at it
* best to come up with LFs while looking at data. Options:
    * look at train for ideas; no labels, but np.
    * label small subset of train (e.g., 200), call it "dev"
    * in a pinch, use valid set as dev (note though that valid will no longer be good rep of test)

## 1. Load data

* Start by loading data
* utility pulls from internet, re-splits, and shuffles
* for this application, train is videos 1-4, valid/test are video 5

In [11]:
from utils import load_spam_dataset

df_train, df_dev, df_valid, df_test = load_spam_dataset()

* Describe fields

In [12]:
df_train.sample(5, random_state=1)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,LABEL,VIDEO_ID
394,z12shxbbulncyx43n23psbzohvumz1nib04,Will Smith,2014-12-22T04:56:24.257000,Check out this playlist on YouTube:pl﻿,1,3
266,z12jzbkwfyuzdr1gg04cgzjqdpfgtnm5l04,Chris4chan,2015-05-26T02:55:34.984000,This video is kinda close to 1 million views <br />﻿,2,4
358,z12wjzc4eprnvja4304cgbbizuved35wxcs,Dakota Taylor,2015-05-29T02:13:07.810000,Cool﻿,2,4
434,z13lfpkzyzvoynqvi234g5ix3taoefr21,Roham 11,2015-05-22T20:42:29.523000,Strong messages in every song I&#39;ve heard.﻿,2,3
249,z12mhbmbhlv5jbi1n231chnqtrmkjrenc,MrSlowGhost,2014-10-21T15:30:56,It should be illegal to be this goodlooking as this babe is...﻿,2,2


## 2. Write Labeling Functions (LFs)

* What's an LF
    * Why are they awesome
* Can be many types:
    * keyword
    * pattern-match
    * heuristic
    * third-party models
    * distant supervision
    * crowdworkers (non-expert)

* Look at 10 examples; got any ideas?

In [13]:
# Don't truncate text fields in the display
pd.set_option('display.max_colwidth', 0)  

# Display just the text and label
df_dev[["CONTENT", "LABEL"]].sample(10, random_state=123)

Unnamed: 0,CONTENT,LABEL
159,"You guys should check out this EXTRAORDINARY website called ZONEPA.COM . You can make money online and start working from home today as I am! I am making over $3,000+ per month at ZONEPA.COM ! Visit Zonepa.com and check it out! Why does the answer rehabilitate the blushing limit? The push depreciateds the steel. How does the beautiful selection edit the range?",1
133,"I'm sorry Katy Perry, I was being weird. I still love you &lt;3﻿",2
234,plz subscribe to my channel i need subs and if you do i will sub back i need help﻿,1
193,Is that tiger called 'Katty Purry'?﻿,2
218,"Check out this video on YouTube: <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23Eminem"">#Eminem</a> <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23Lovethewayyoulie"">#Lovethewayyoulie</a> <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23RapGod"">#RapGod</a> <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23King"">#King</a> ﻿",1
6,nice ..very nice﻿,2
390,ayyy can u guys please check out my rap video im 16 n im juss tryna get some love please chrck it out an thank u,1
171,Dance :)﻿,2
175,e.e....everyone could check out my channel.. dundundunnn,1
155,I think this is now a place to promote channels in the comment section lol.﻿,2


The simplest way to create labeling functions in Snorkel is with the `@labeling_function()` decorator, which wraps a function for evaluating on a single `DataPoint` (in this case, a row of the dataframe).

Looking at samples of our data, we see multiple messages where spammers are trying to get viewers to look at "my channel" or "my video," so we write a simple LF that labels an example as spam if it includes the word "my".

In [14]:
from snorkel.labeling.lf import labeling_function

# For clarity, we'll define constants to represent the class labels for spam, ham, and abstaining.
ABSTAIN = 0
SPAM = 1
HAM = 2

# We initialize an empty list that we'll add our LFs to as we create them
lfs = []

@labeling_function()
def keywords_my(x):
    return SPAM if 'my' in x.CONTENT.lower() else ABSTAIN

lfs.append(keywords_my)

To apply one or more LFs that we've written to a collection of `DataPoints`, we use an `LFApplier`.

Because our `DataPoints` are represented with a Pandas dataframe in this tutorial, we use the `PandasLFApplier` class.

In [15]:
from snorkel.labeling.apply import PandasLFApplier

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1586/1586 [00:00<00:00, 32982.62it/s]


The output of the `apply()` method is a sparse label matrix which we generally refer to as `L`.

In [16]:
L_train

<1586x1 sparse matrix of type '<class 'numpy.int64'>'
	with 315 stored elements in Compressed Sparse Row format>

We can easily calculate the coverage of this LF (i.e., the percentage of the dataset that it labels) as follows:

In [17]:
coverage = L_train.nnz / L_train.shape[0]
print(f"Coverage: {coverage}")

Coverage: 0.19861286254728877


To get an estimate of its accuracy, we can label the development set with it and compare that to the few gold labels we do have.

In [18]:
L_dev = applier.apply(df_dev)

# Note that we don't want to penalize the LF for examples where it abstained, 
# so we filter out both the predictions and the gold labels where the prediction
# is ABSTAIN
L_dev_array = np.asarray(L_dev.todense()).squeeze()
Y_dev_array = df_dev["LABEL"].values
accuracy = ((L_dev_array == Y_dev_array)[L_dev_array != ABSTAIN]).sum() / (L_dev_array != ABSTAIN).sum()
print(f"Accuracy: {accuracy}")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 26998.19it/s]

Accuracy: 0.9090909090909091





Alternatively, you can use the helper method `lf_summary` to report the following summary statistics:
* Polarity: The set of labels this LF outputs
* Coverage: The fraction of the dataset the LF labels
* Overlaps: The fraction of the dataset where this LF and at least one other LF label
* Conflicts: The fraction of the dataset where this LF and at least one other LF label and disagree
* Correct: The number of `DataPoints` this LF labels correctly (if gold labels are provided)
* Incorrect: The number of `DataPoints` this LF labels incorrectly (if gold labels are provided)
* Emp. Acc.: The empirical accuracy of this LF (if gold labels are provided)

In [19]:
from snorkel.labeling.analysis import lf_summary

lf_names= [lf.name for lf in lfs]
lf_summary(L=L_dev, Y=Y_dev_array, lf_names=lf_names)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
keywords_my,0,[1],0.22,0.0,0.0,40,4,0.909091


This LF is fairly accurate, but it only labels a fraction of the dataset.
If we want to do well on our test set, we'll need more LFs.

In the following subsections, we'll show just a few of the many types of LFs that you could write to generate a training dataset for this problem.

### i. Keyword LFs

* Keywords

### ii. Pattern-matching LFs

* Regexes

### iii.  Heuristic LFs

* Length, early comma, etc.
* SpaCy (preprocessor)

### iv. Third-party Model LFs

* Sentiment classifier (preprocessor)

### v. Write your own LFs

* Make a stub

## 3. Combine with Label Model

* Pretty much copy prose from Spouse tutorial

* Run LabelModel, get probabilities
    * Note: no labels are required or used
* Look at probabilities (histogram)
* What if we used this directly as a classifier? (score)
    * Why we expect classifier we train to generalize better
    * Look - we're randomly guessing on XX% of the data

* Can also compare to MV
    * Does worse

## 4. Predict with Classifier

* Now train classifier
    * Can use any third-party classifier (plug into your existing pipelines!)
    * Some libraries natively support probabilistic labels (us, TF); for others, can round.
* Use bag-of-ngrams as features
* [Train TF logreg w/ soft labels]
* Score; see, we do better!
* Also demonstrate sklearn logreg with hard labels (end model agnostic)
* Compare with training on dev directly (see, we did better)
    * And we could do even better with more raw unlabeled data