# Youtube Spam Classification Task

### For this task, you will work with comments from 5 different YouTube videos, and classify comments as either spam (1) or legitimate comments (0) by writing labeling functions.

Spam can be defined as irrelevant or unsolicited messages sent over the Internet.

First, import necessary libraries:

In [None]:
from data.preparer import load_youtube_dataset
from snorkel.labeling import labeling_function
from snorkel.labeling import LabelModel
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier
from analyzer import train_model
import re
import pandas as pd
pd.set_option('display.max_colwidth', -1)

## The Data

The data is available [via Kaggle](https://www.kaggle.com/goneee/youtube-spam-classifiedcomments). You may download it there, or, if you have the password, unzip the data below.

You must replace `PASSWORD` with the password to unzip the data.

In [None]:
!unzip -P PASSWORD data/data.zip

In [None]:
DELIMITER = "#"
df_train, df_dev, df_valid, df_test = load_youtube_dataset(delimiter=DELIMITER)
print("{} training examples".format(len(df_train)))
print("{} development examples".format(len(df_dev)))
print("{} validation examples".format(len(df_valid)))
print("{} test examples".format(len(df_test)))

Define variable names for the labels in this task:

In [None]:
#define labels
ABSTAIN = -1
NOT_SPAM = 0
SPAM = 1

In [1]:
print("Some labeled examples: ")
display(df_dev[df_dev.label==NOT_SPAM].sample(5))
display(df_dev[df_dev.label==SPAM].sample(5))

Some labeled examples: 


NameError: name 'df_dev' is not defined

## Writing Labeling Functions
Time to write some labeling functions! Below is an example. Be sure to add your function to the list `lfs`

In [None]:
lfs = []

In [None]:
@labeling_function()
def my_first_labeling_function(x):
    return SPAM if "my" in x.text.lower() else ABSTAIN

lfs.append(my_first_labeling_function)

## Applying Functions
This is how we obtain training labels, by training a model to combine the outputs of the noisy labeling functions.

In [None]:
# Apply the LFs to the unlabeled training data, and the development data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

In [None]:
# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")
display(df_train.sample(5))

## View Unlabeled Examples
You can use these to brainstorm new labeling functions. You may try filtering or sorting them in other ways.

In [None]:
# You can filter for unlabeled data
df_unlabeled = df_train[df_train.label == ABSTAIN]
display(df_unlabeled.sample(5))

## Analyze Results
Evaluate the accuracy of the estimated training labels and development set labels (based on ground truth).

In [None]:
Y_train = df_train.label.values
train_analysis = LFAnalysis(L=L_train, lfs=lfs).lf_summary(Y=Y_train)
display("Training set results:", train_analysis)

In [None]:
Y_dev = df_dev.label.values
dev_analysis = LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)
display("Dev set results:", dev_analysis)

## Train Model
Train a simple bag of words model on these labels, and report test accuracy.

In [None]:
train_model(label_model, df_train, df_valid, df_test, L_train)