# Youtube Spam Classification Task

### For this task, you will work with comments from 5 different YouTube videos, and classify comments as either spam (1) or legitimate comments (0) by writing labeling functions.

Spam can be defined as irrelevant or unsolicited messages sent over the Internet.

First, import necessary libraries:

In [None]:
from data.preparer import load_youtube_dataset
from datetime import datetime
from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.labeling import labeling_function
from snorkel.labeling import LabelModel
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier
from analyzer import train_model
import re
import pandas as pd
pd.set_option('display.max_colwidth', -1)
stat_history = pd.DataFrame()

## The Data

The data is optained [from Kaggle](https://www.kaggle.com/goneee/youtube-spam-classifiedcomments). 

Load the data:

In [None]:
DELIMITER = "#"
df_train, df_dev, df_valid, df_test = load_youtube_dataset(delimiter=DELIMITER)
print("{} training examples".format(len(df_train)))
print("{} development examples".format(len(df_dev)))
print("{} validation examples".format(len(df_valid)))
print("{} test examples".format(len(df_test)))

Define variable names for the labels in this task:

In [None]:
#define labels
ABSTAIN = -1
NOT_SPAM = 0
SPAM = 1

Let's see some positive and negative examples

In [None]:
print("Some labeled examples: ")
display(df_dev[df_dev.label==NOT_SPAM].sample(5))
display(df_dev[df_dev.label==SPAM].sample(5))

## Writing Labeling Functions
Time to write some labeling functions! 

Your task is to __create 10 labeling functions__ that take the text of the review as input, and output either a SPAM or a NOT_SPAM or an ABSTAIN label. Try to write them as quickly and accurately as possible.

You may consult the internet at any time.

In [None]:
@labeling_function()
def lf0(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf1(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf2(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf3(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf4(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf5(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf6(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf7(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf8(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf9(x):
    return ABSTAIN

In [None]:
lfs = [lf0, lf1, lf2, lf3, lf4, lf5, lf6, lf7, lf8, lf9]

## (Optional) Test your function

In [None]:
from types import SimpleNamespace

def test_func(lf, example_text):
    x = SimpleNamespace(text=example_text)
    return lf(x)

In [None]:
test_func(lf0, "your text here")

## Applying Functions
This is how we obtain training labels, by training a model to combine the outputs of the noisy labeling functions.

In [None]:
# Apply the LFs to the unlabeled training data, and the development data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

In [None]:
# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df_train["label"] = label_model.predict_proba(L=L_train)

# record intermediate results
# Don't worry about this code block, we just store some metrics to keep track of your progress.
Y_dev = df_dev.label.values
stats = label_model.score(L=L_dev, Y=Y_dev, metrics=["f1", "precision", "recall"])
probs_train = df_train["label"]
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
            X=df_train, y=probs_train, L=L_train)
stats["training_label_coverage"] = len(probs_train_filtered)/len(probs_train)
stats["training_label_size"] = len(probs_train_filtered)
stats["time"] = datetime.now()
stat_history = stat_history.append(stats, ignore_index=True)


# let's see some examples of aggregated (probabilistic) labels!
display(df_train.sample(5))

## View Unlabeled Examples
You can use these to brainstorm new labeling functions. You may try filtering or sorting them in other ways.

If you get a `ValueError: a must be greater than 0 unless no samples are taken`, this means all your training examples are labeled by at least one LF.

In [None]:
# You can filter for unlabeled data
df_unlabeled = df_train[~df_train.index.isin(df_train_filtered.index)]
display(df_unlabeled.sample(5))

## Analyze Results
Evaluate the accuracy of the estimated training labels and development set labels (based on ground truth).

In [None]:
Y_train = df_train.label.values
train_analysis = LFAnalysis(L=L_train, lfs=lfs).lf_summary(Y=Y_train)
display("Training set results:", train_analysis)

In [None]:
Y_dev = df_dev.label.values
dev_analysis = LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)
display("Dev set results:", dev_analysis)

## Save the Model
When you have finished.

In [None]:
label_model.save("snorkel_youtube_lfmodel.pkl")
stat_history.to_csv("snorkel_youtube_statistics_history.csv")

## Train Model
Train a simple bag of words model on these labels, and report test accuracy.

In [None]:
train_model(label_model, df_train, df_valid, df_test, L_train)