# Youtube Spam Classification Task with Babble

### For this task, you will work with comments from 5 different YouTube videos, and classify comments as either spam or legitimate comments by writing labeling explanations with Babble Labble.

Spam can be defined as irrelevant or unsolicited messages sent over the Internet. 


## The Data

The data is optained [from Kaggle](https://www.kaggle.com/goneee/youtube-spam-classifiedcomments). 

Load the data:

In [None]:
from data.preparer import load_youtube_dataset

DELIMITER = "#"
df_train, df_dev, df_valid, df_test = load_youtube_dataset(delimiter=DELIMITER)
print("{} training examples".format(len(df_train)))
print("{} development examples".format(len(df_dev)))
print("{} validation examples".format(len(df_valid)))
print("{} test examples".format(len(df_test)))

In [None]:
#define labels
ABSTAIN = 0
NOT_SPAM = 1
SPAM = 2

Transform the data into a format compatible with Babble Labble:

In [None]:
from babble.Candidate import Candidate # this is a helper class to transform our data into a format Babble can parse

dfs = [df_train, df_dev, df_test]

for df in dfs:
    df["id"] = range(len(df))

Cs = [df.apply(lambda x: Candidate(x), axis=1) for df in dfs]

# babble labble uses 1 and 2 for labels, while our data uses 0 and 1
# add 1 to convert
Ys = [df.label.values + 1 for df in dfs]

In [None]:
from babble import BabbleStream
from babble import Explanation


aliases = {}
babbler = BabbleStream(Cs, Ys, balanced=True, shuffled=True, seed=456, aliases=aliases)

In [None]:
def prettyprint(candidate):
    # just a helper function to print the candidate nicely
    print(candidate.text)

Let's see an example candidate!

In [None]:
candidate = babbler.next()
prettyprint(candidate)


Your task is to __create 10 labeling functions__ by writing natural language descriptions of labeling rules. Try to write them as quickly and accurately as possible.

You may consult the internet at any time.

## Create Explanations

Creating explanations generally happens in five steps:
1. View candidates
2. Write explanations
3. Get feedback
4. Update explanations 
5. Apply label aggregator

Steps 3-5 are optional; explanations may be submitted without any feedback on their quality. However, in our experience, observing how well explanations are being parsed and what their accuracy/coverage on a dev set are (if available) can quickly lead to simple improvements that yield significantly more useful labeling functions. Once a few labeling functions have been collected, you can use the label aggregator to identify candidates that are being mislabeled and write additional explanations targeting those failure modes.

### Collection

Use `babbler` to show candidates

In [None]:
candidate = babbler.next()
print(candidate)

If you don't know whether it's spam or not, it's okay to make your best guess or skip an example.
For a candidate you decide to label, write an explanation of why you chose that label.

You can consult the internet or refer to the babble tutorial notebook.

In [None]:
e0 = Explanation(
    # feel free to change the name to something that describes your rule better.
    name = "e0", 
    label = ABSTAIN, 
    condition = "", 
    # remember that is argument (candidate) is optional. 
    # You can use it to make sure the explanation applies to the candidate you pass as an argument.
    candidate = candidate.mention_id 
)

In [None]:
e1 = Explanation(
    name = "e1", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e2 = Explanation(
    name = "e2", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e3 = Explanation(
    name = "e3", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e4 = Explanation(
    name = "e4", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e5 = Explanation(
    name = "e5", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e6 = Explanation(
    name = "e6", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e7 = Explanation(
    name = "e7", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e8 = Explanation(
    name = "e8", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
e9 = Explanation(
    name = "e9", 
    label = ABSTAIN, 
    condition = "", 
    candidate = candidate.mention_id 
)

In [None]:
explanations = [e0, e1, e2, e3, e4, e5, e6, e7, e8, e9]

Babble will parse your explanations into functions, then filter out functions that are duplicates, incorrectly label their given candidate, or assign the same label to all examples.

In [None]:
parses, filtered = babbler.apply(explanations)

### Analysis
See how your explanations were parsed and filtered

In [None]:
babbler.analyze(parses)

In [None]:
babbler.filtered_analysis(filtered)

In [None]:
babbler.commit()

### Evaluation
Get feedback on the performance of your explanations

In [None]:
from metal.analysis import lf_summary

Ls = [babbler.get_label_matrix(split) for split in [0,1,2]]
lf_names = [lf.__name__ for lf in babbler.get_lfs()]
lf_summary(Ls[1], Ys[1], lf_names=lf_names)

In [None]:
from metal import LabelModel
from metal.tuners import RandomSearchTuner

search_space = {
    'n_epochs': [50, 100, 500],
    'lr': {'range': [0.01, 0.001], 'scale': 'log'},
    'show_plots': False,
}

tuner = RandomSearchTuner(LabelModel, seed=123)

label_aggregator = tuner.search(
    search_space, 
    train_args=[Ls[0]], 
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

If you'd like to save the explanations you've generated, you can use the `ExplanationIO` object to write to or read them from file.

In [None]:
from babble.utils import ExplanationIO

FILE = "babbler_youtube_explanations.tsv"
exp_io = ExplanationIO()
exp_io.write(explanations, FILE)
explanations = exp_io.read(FILE)