# Babble Tutorial

In this notebook, we'll walk through how to create your own explanations that can be fed into Babble Labble.

Creating explanations generally happens in five steps:
1. View candidates
2. Write explanations
3. Get feedback
4. Update explanations 
5. Apply label aggregator

Steps 3-5 are optional; explanations may be submitted without any feedback on their quality. However, in our experience, observing how well explanations are being parsed and what their accuracy/coverage on a dev set are (if available) can quickly lead to simple improvements that yield significantly more useful labeling functions. Once a few labeling functions have been collected, you can use the label aggregator to identify candidates that are being mislabeled and write additional explanations targeting those failure modes.

We'll walk through each of the steps individually with examples; at the end of the notebook is an area for you to iterate with your own explanations.

## Step 0: Setup

Once again, we need to first load the data (candidates and labels) from the pickle.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pickle

DATA_FILE = 'babble/tutorial_data.pkl'
with open(DATA_FILE, 'rb') as f:
    Cs, Ys = pickle.load(f)

In [None]:
print(vars(Cs[0][0]))

Our data is now divided into three splits (80/10/10), which we'll refer to as the training, dev(evelopment), and test splits. In these tutorials, we will do the bulk of our analysis on the dev split to protect the integrity of the held-out test set.

The variables Cs and Ys are each lists of length 3, corresponding to the three splits; each C is a list of candidates, and each Y is a numpy arrays of gold labels. Our labels are categorical (1=True, 2=False).

## Step 1: View Candidates

We've combined most of the steps required for writing explanations into a single class for convenience: the `BabbleStream`. This will allow you to view candidates, submit explanations, analyze the resulting parses, save explanations that you're satisfied with, and generate label matrices from the parses you've saved so far. (The `Babbler` class seen in the Tutorial 1 is simply a subclass of `BabbleStream` that submits explanations as a batch and commits them immediately, for non-iterative workflows).

In [None]:
#!source activate babble
#!source babble/add_to_path.sh
from babble import BabbleStream

In [None]:
babbler = BabbleStream(Cs, Ys, balanced=True, shuffled=True, seed=321)

Now that the `BabbleStream` has been initiated, we can run the cell below repeatedly to iterate through candidates for labeling. Some candidates will prove very difficult to give explanations for; **feel free to skip these**! The number of unlabeled candidates is often orders of magnitude larger than the number of explanations we need, so we can afford to skip the tricky ones.

Since many explanations end up referring to distances between words, each candidate will be displayed in two ways: as a list of tokens, and as a single string. In both cases, curly brackets have been placed around the entities; these are shown for your convenience only and are not actually a part of the raw text.

In [None]:
from babble.utils import display_candidate

candidate2 = babbler.next()
display_candidate(candidate2)
candidate2.word_ends

In [None]:
from babble.utils import display_candidate

candidate = babbler.next()
display_candidate(candidate)
candidate.text

## Step 2: Write Explanations

Now, looking at candidates one by one, we can create `Explanation` objects. Each `Explanation` requires 3 things (with an optional 4th):
- A label: An integer (For this task, 1 if X and Y were/are/will soon be married, and 2 otherwise.
- A condition: See below for details.
- A candidate: This will be used by the filter bank inside to check for semantic consistency.
- A name: (Optional) Adding names can be helpful for bookkeeping if you have many explanations floating around. 

The condition should satisfy the following properties:
1. **Complete Sentences**: Form a complete sentence when preceded by "I labeled it \[label\] because..." (i.e., instead of simply the phrase "his wife", it should be a statement like "'his wife' is in the sentence").
2. **X and Y**: Refer to the person who occurs first in the sentence as **X** and the second person as **Y**. (These can be overwritten with custom strings, but for now we'll stick with X and Y).
3. **Valid Primitives**: Utilize primitives supported by the grammar. These include:  
true, false, strings, ints, floats, tuples, lists, sets, and, or, not, any, all, none, =, !=, <, ≤, >, ≥, lowercase, uppercase, capitalized, all caps, starts with, ends with, substring, basic NER tags (person, location, date, number, organization), count, contains, intersection, map, filter, distances in words or characters, relative positions (left/right/between/within).

The rule-based parser is naive, not comprehensive, and can certainly be improved to support more primitives. These are just some of the ones we found to be the most commonly used and easily supported. When tempted to refer to real-world concepts (e.g., the "last name" of X), see if you can capture something similar using the supported primitives (e.g., "the last word of X").

In [None]:
from babble import Explanation
explanation = Explanation(
    name='LF_fiance_between',
    label=1,
    condition='The word "fiance" is in the sentence',
    candidate=candidate,
)

When we call `babbler.apply()`, our explanation is parsed into (potentially multiple) parses, which are then passed through the filter bank, removing any that fail. It returns a list of passing parses, and filtered ones.

In [None]:
parses, filtered = babbler.apply(explanation)

You can view a pseudocode translation of your parse using the `view_parse()` method.

In [None]:
babbler.view_parse(parses[0])
print(parses[0].semantics)

At this point, if you're confident in the value of your explanation, you can go ahead and it to the set of parses to keep by calling `babbler.commit()`. But if you'd like to investigate its quality first, continue on to Step 3. 

## Step 3: Get Feedback

If you have a labeled dev set, you can evaluate your resulting parse's performance on that set to get an estimate of what it's accuracy and coverage are. You may be surprised at how good/bad/broad/narrow your explanations actually are. 

**NOTE:** There is a risk to doing this evaluation, however. The dev set is generally small; be careful not to overfit to it with your explanations! This is especially important if you use the same dev set for explanation validation and hyperparameter tuning.

In [None]:
babbler.analyze(parses)

In this case, we see that our explanation yielded a labeling function that has rather low accuracy (~22%), and low coverage (~1%).

You can view examples of candidates your parse labeled correctly or incorrectly for ideas. Once the viewer is instantiated, you can simply rerun the cell with `viewer.view()` to move on to the next candidate.

In [None]:
from babble.utils import CandidateViewer

correct, incorrect = babbler.error_buckets(parses[0])
viewer = CandidateViewer(incorrect)

In [None]:
viewer.view()

If you want to see what parses were filtered and why, there's a helper method for that as well. Because of the simplicity of the parser, even some seemingly simple explanations can be parsed incorrectly or failed to yield any valid parses at all. But be warned: in general, we find that time spent analyzing the parser's performance is rarely as productive as time spent simply producing more labeling functions, possibly varying the way you phrase your explanations or the types of signals you refer to.

In [None]:
babbler.filtered_analysis(filtered)

## Step 4: Update Explanations

If an explanation we propose has lower accuracy than we'd like, we can try tightening it up (reducing the number of false positives) by making it more specific. If it has lower coverage than we'd like, one simple way to boost it is to replace keywords with aliases.

As was mentioned in Tutorial 1, aliases are sets of words that can be referred to with a single term. To add aliases to the babbler, we call `babbler.add_aliases` with a dictionary containing key-value pairs corresponding to the name of the alias and the set it refers to.

In [None]:
babbler.add_aliases({'spouse': ['husband', 'wife', 'spouse', 'bride', 'groom', 'fiance']})

In [None]:
explanation = Explanation(
    name='LF_spouse_between',
    label=1,
    condition='A spouse word is between X and Y',
    candidate=candidate,
)
parses, filtered = babbler.apply(explanation)
babbler.analyze(parses)

We can see that broadening our explanation in this way improved our parse both in coverage and accuracy! We'll go ahead and commit this parse.

In [None]:
babbler.commit()

In an ideal world, our parses would all have both high coverage and high accuracy. In practice, however, there is usually a tradeoff. When in doubt, we give a slight edge to accuracy over coverage, since the discriminative model can help with generalization, but it is unlikely to be much more precise than the model that generated its labels.

## Step 5: Apply Label Aggregator

At any point, we can extract our growing label matrices to view the summary statistics of all the parses we've commited so far.

In [None]:
from metal.analysis import lf_summary

Ls = [babbler.get_label_matrix(split) for split in [0,1,2]]
lf_names = [lf.__name__ for lf in babbler.get_lfs()]
lf_summary(Ls[1], Ys[1], lf_names=lf_names)

Once we've committed parses (i.e., labeling functions) to our babbler, we can use them to train the label aggregator to see how we're doing overall.

In [None]:
from metal import LabelModel
from metal.tuners import RandomSearchTuner

search_space = {
    'n_epochs': [50, 100, 500],
    'lr': {'range': [0.01, 0.001], 'scale': 'log'},
    'show_plots': False,
}

tuner = RandomSearchTuner(LabelModel, seed=123)

label_aggregator = tuner.search(
    search_space, 
    train_args=[Ls[0]], 
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

It may be somewhat suprising to see how quickly quality improves with the first few labeling functions you submit. But remember: each labeling function you provide results in tens or hundreds of labels, so your effective training set size can actually be growing quite quickly. But as with traditional labels, there will come a point when adding more labeling functions will yield diminishing returns, so it's good to check in on the overall quality of your label aggregator every once in a while.

This process of iteratively tweaking 








# Your Turn!

# Youtube Spam Classification Task

### For this task, you will work with comments from 5 different YouTube videos, and classify comments as either spam or legitimate comments by writing labeling explanations.


## The Data

The data is available [via Kaggle](https://www.kaggle.com/goneee/youtube-spam-classifiedcomments). You may download it there, or, if you have the password, unzip the data below.

You must replace `PASSWORD` with the password to unzip the data.

In [None]:
!unzip -P PASSWORD data/data.zip
!ls

In [None]:
from data.preparer import load_youtube_dataset

DELIMITER = "#"
df_train, df_dev, df_valid, df_test = load_youtube_dataset(delimiter=DELIMITER)
print("{} training examples".format(len(df_train)))
print("{} development examples".format(len(df_dev)))
print("{} validation examples".format(len(df_valid)))
print("{} test examples".format(len(df_test)))

In [None]:
#define labels
ABSTAIN = 0
NOT_SPAM = 1
SPAM = 2

Transform the data into a format compatible with Babble Labble:

In [None]:
from babble.Candidate import Candidate # this is a helper class to transform our data into a format Babble can parse

dfs = [df_train, df_dev, df_test]

for df in dfs:
    df["id"] = range(len(df))

Cs = [df.apply(lambda x: Candidate(x), axis=1) for df in dfs]

# babble labble uses 1 and 2 for labels, while our data uses 0 and 1
# add 1 to convert
Ys = [df.label.values + 1 for df in dfs]

In [None]:
from babble import BabbleStream

aliases = {}
babbler = BabbleStream(Cs, Ys, balanced=True, shuffled=True, seed=456, aliases=aliases)

### Collection

Use `babbler` to show candidates

In [None]:
candidate = babbler.next()
print(candidate)

In [None]:
from babble import Explanation
explanation = Explanation(
    name='check_out', # name of this rule, for your reference
    label=SPAM, # label to assign
    condition='The word "my" is in the text', # natural language description of why you label the candidate this way
    candidate=candidate # optional argument, the candidate should be an example labeled by this rule
)


Babble will parse your explanations into functions, then filter out functions that are duplicates, incorrectly label their given candidate, or assign the same label to all examples.

In [None]:
parses, filtered = babbler.apply(explanation)

### Analysis

In [None]:
babbler.analyze(parses)

In [None]:
babbler.filtered_analysis(filtered)

In [None]:
babbler.commit()

### Evaluation

In [None]:
from metal.analysis import lf_summary

Ls = [babbler.get_label_matrix(split) for split in [0,1,2]]
lf_names = [lf.__name__ for lf in babbler.get_lfs()]
lf_summary(Ls[1], Ys[1], lf_names=lf_names)

In [None]:
from metal import LabelModel
from metal.tuners import RandomSearchTuner

search_space = {
    'n_epochs': [50, 100, 500],
    'lr': {'range': [0.01, 0.001], 'scale': 'log'},
    'show_plots': False,
}

tuner = RandomSearchTuner(LabelModel, seed=123)

label_aggregator = tuner.search(
    search_space, 
    train_args=[Ls[0]], 
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

If you'd like to save the explanations you've generated, you can use the `ExplanationIO` object to write to or read them from file.

In [None]:
from babble.utils import ExplanationIO

FILE = "my_explanations.tsv"
exp_io = ExplanationIO()
exp_io.write(explanations, FILE)
explanations = exp_io.read(FILE)