# Prompt-Based NLP

In Homework 4, we’ll try using Jigsaw’s Toxic Language dataset using PET to train our classifier.
Conveniently the PET authors have already provided code for you to use at https://github.com/timoschick/pet. Your task will be to (1) write your own custom verbalizer and patterns
and (2) train your model by modifying one of their example scripts. The PET repository has good
documentation on how to set up their model, train it, and use the code.

Like in Homework 3, in this assignment we will use a much smaller but nearly-as-performant
version of BERT, https://huggingface.co/microsoft/MiniLM-L12-H384-uncased,
to train our models. While PET can work on any LLM, MiniLM will make the homework much
faster to finish.

In [1]:
import pandas as pd
from transformers import BertTokenizerFast, BertForSequenceClassification #EarlyStoppingCallback
from datasets import load_dataset, load_metric, Dataset
from transformers import Trainer, TrainingArguments
import torch
import wandb
import os
from pathlib import Path
torch.cuda.empty_cache()

In [2]:
train_df = pd.read_csv('data/hw4_train.csv')
test_df = pd.read_csv('data/hw4_test.csv')
train_df.sample(5)

Unnamed: 0,id,comment_text,toxic
156603,d10f0baab32ca062,this page is a complete mess \n\nno one seems ...,0
99860,16584c23f714d37f,Suck cock you snivelling cunt,1
80413,d72132805bb8629b,"""\n Your submission at AfC Benson Dillon Billi...",0
153083,97260f6f0a501e7c,More vandalism by this IP address \n\nSee here.,0
15124,27f0dbfcd47c759a,is D's imprisonment really a turning point ?,0


In [3]:
# create unlabeled data and dev data

train_unlabled_df = train_df.drop('toxic', axis=1)
train_unlabled_df.to_csv('data/hw4_train_unlabled.csv', index=False)

dev_split_pcent = .2
dev_split_size = round(train_df.shape[0] * .2)
dev_df = train_df.sample(dev_split_size).copy()
dev_df.to_csv('data/hw4_dev.csv', index=False)

In [4]:
model_name = 'microsoft/MiniLM-L12-H384-uncased'
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

## Part 1

Write a simple piece code that takes a single word as input and then
tokenizes it with the BERT tokenizer in huggingface and returns the word’s corresponding tokens
(or token IDs) in the BERT vocabulary. You’ll want to use this piece of code in the next task to
check that your verbalizer is using only single-token words.

This section only needs to be run once - currently the tokenized data is saved in `data/tokenized_train_dataset.pt` and `data/tokenized_test_dataset.pt` and can be loaded and used after the kernel has been shut down. 

In [5]:
train_df.sample()

Unnamed: 0,id,comment_text,toxic
4007,0ab669a21bd76681,Why is there more information on the Transform...,0


In [6]:
#debug = Dataset.from_pandas(train_df.sample(5, random_state=630))
debug = train_df.sample(5, random_state=630)
example = debug['comment_text'].values[0]
example_list = example.split()
oneword_tokens = []
for word in example_list:
    token = tokenizer.tokenize(word)
    if len(token) == 1:
        oneword_tokens.append(token[0])

print(len(oneword_tokens))
print(' '.join(oneword_tokens))
print('-------------------')
print(len(example_list))
print(example)

36
was a superhero what it appeared to be it was a superhero between and reason why it is a superhero because due to actions in stopping by the law in criminal why it is a superhero
-------------------
44
RoboCop was a superhero 

What it appeared to be it was a superhero between cyberpunk and action, reason why it is a superhero because due to RoboCop's actions in stopping crime, by upholding the law in criminal charge. That's why it is a superhero


In [7]:
# No padding necessary? 

def process_data_for_tokenizer(string):
    """
    accepts a string and tokenizes it using a predefined tokenizer (such as BertTokenizer)
    checks to see if the length of the list of tokens is one
    if the token list is greater than 1, it is dropped
    otherwise, it is added to a new list
    returns a new string which will be split into single word tokens
    """
    string_list = string.split()
    oneword_tokens = []
    start_len = len(string_list)
    for word in string_list:
        token = tokenizer.tokenize(word)
        if len(token) == 1:
            oneword_tokens.append(token[0])

    oneword_string = ' '.join(oneword_tokens)
    end_len = len(oneword_string)
    if start_len != end_len:
        print("removed multi-word tokens")
    else:
        print("no multi-word tokens present")

    return oneword_string

In [8]:
oneword_example = process_data_for_tokenizer(example)
oneword_example

removed multi-word tokens


'was a superhero what it appeared to be it was a superhero between and reason why it is a superhero because due to actions in stopping by the law in criminal why it is a superhero'

In [9]:
debug['oneword_comments'] = debug.comment_text.apply(process_data_for_tokenizer)
debug

removed multi-word tokens
removed multi-word tokens
removed multi-word tokens
removed multi-word tokens
removed multi-word tokens


Unnamed: 0,id,comment_text,toxic,oneword_comments
57910,9b113508b15890f7,RoboCop was a superhero \n\nWhat it appeared t...,0,was a superhero what it appeared to be it was ...
7231,1342886f4e88377c,Christopher Walken \nDoesn't this guy look lik...,0,christopher this guy look like christopher
72996,c34aa1b92d69847c,Hmmm... hat we have is perhaps a merge or perh...,0,hat we have is perhaps a merge or perhaps just...
48556,81de738eee0c2837,hey shut up ok just because i dont know much e...,1,hey shut up ok just because i know much englis...
120561,84f1c010e30b3cd7,Phuck?\n\nPhuck Phred Phelps.,0,


### Check Verbalizers

In [10]:
# check verbalizers

## Part 2

Write 10 different prompts that can be used to classify toxic speech.
Prompts should be relatively different (not just adding/changing one word). For each, come up
with at least 2 verbalizations of each class (toxic/non-toxic). You can share verbalizations across
prompts if needed. We really want to see some creativity across your prompts (this will also help
the model learn more too).

### Initialize MyTaskDataProcessor and MyTaskPVP

Both are necessary for completing a Task

Initialized in a new file inside `pet/custom` called `toxic_task_processor.py` and `toxic_task_pvp.py`

## Part 3
For comparison with PET, train a regular classifier using Trainer and
the MiniLM parameters on all the training data (very similar to what you did in Homework 3!). You
should train your model for at least two epochs, but you’re not required to do any hyperparameter
tuning (you just need a score). Predict the toxicity of the provided test data and calculate the F1.

In [11]:
MiniLM_tokenizer = BertTokenizerFast.from_pretrained(model_name)
MiniLMmodel = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
max_input_length = 512
max_target_length = 512

def preprocess_function(examples):
    inputs = [ex for ex in examples["comment_text"]]
    model_inputs = tokenizer(inputs, padding='max_length', max_length=max_input_length, truncation=True)


    model_inputs["labels"] = examples["labels"]
    return model_inputs

In [12]:
train_dataset = Dataset.from_pandas(train_df.rename(columns={'toxic':'labels'}))
# dev_dataset = Dataset.from_pandas(dev_df.rename(columns={'toxic':'labels'}))
test_dataset = Dataset.from_pandas(test_df.rename(columns={'toxic':'labels'}))

In [13]:
train_dataset

Dataset({
    features: ['id', 'comment_text', 'labels'],
    num_rows: 159571
})

In [14]:
test_dataset

Dataset({
    features: ['id', 'comment_text', 'labels'],
    num_rows: 63978
})

In [15]:
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
#tokenized_dev_dataset = dev_dataset.map(lambda x: tokenizer(x['comment_text'],padding = 'max_length', max_length =512, truncation=True))
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

  0%|          | 0/159571 [00:00<?, ?ex/s]

  0%|          | 0/31914 [00:00<?, ?ex/s]

  0%|          | 0/63978 [00:00<?, ?ex/s]

In [16]:
tokenized_test_dataset

Dataset({
    features: ['id', 'comment_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 63978
})

In [17]:
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask','labels'])
# tokenized_dev_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask','labels'])
tokenized_test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

In [18]:
from sklearn.metrics import f1_score
import numpy as np
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    preds = preds.reshape(len(preds),)
    f1 = f1_score(labels, preds, average='macro')
    return {
        'f1': f1
    }

In [19]:
# keep getting errors
# __init__() got an unexpected keyword argument 'evaluation_strategy'
# __init__() got an unexpected keyword argument 'load_best_model_at_end'

training_args = TrainingArguments(
    output_dir = 'MiniLM',
    num_train_epochs = 2,
#    evaluation_strategy = 'steps',
#    eval_steps = 500,
    learning_rate=1e-4,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 16,
#    seed = 0,
#    load_best_model_at_end = True,
    gradient_accumulation_steps = 16,
)

In [21]:
trainer = Trainer(
    MiniLMmodel,
    training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics,
#    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [None]:
trainer.train()

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/9974 [00:00<?, ?it/s]

In [None]:
trainer.evaluate()

In [None]:
trainer.save_model('MiniLMmodel')

## Part 4
Using your patterns and verbalizers, train separate PET models on 10,
50, 100, and 500 instances of data. Your data should be randomly sampled from the training data
but be sure to have examples of each class. You are free to choose which instances you use and
what distribution of toxic/non-toxic labels are in your training data (provided you have at least one
example of each). For each model, predict the scores for the provided test data and calculate the
Macro F1.

### Sample data for each PET instance

In [None]:
train_df.groupby('toxic').size()

In [None]:
instances = [10, 50, 100, 500]
if os.path.exists('data/instances') == True:
    for instance in instances:
        if os.path.exists('data/instances/instance'+str(instance)+'/hw4_train.csv') == False:
            instance_train = train_df.groupby('toxic').sample(instance)
            instance_train = instance_train[['toxic', 'comment_text', 'id']]
            instance_train.to_csv('data/instances/'+'instance'+str(instance)+'/hw4_train.csv', index=False, header=False)
        if os.path.exists('data/instances/instance'+str(instance)+'/hw4_train_unlabeled.csv') == False:
            instance_train_unlabeled = train_unlabled_df.sample(instance)
            instance_train_unlabeled = instance_train_unlabeled[['comment_text', 'id']]
            instance_train_unlabled.to_csv('data/instances/'+'instance'+str(instance)+'/hw4_train_unlabeled.csv', index=False, header=False)
        if os.path.exists('data/instances/instance'+str(instance)+'/hw4_dev.csv') == False:
            instance_dev = train_df.groupby('toxic').sample(instance)
            instance_dev = instance_dev[['toxic', 'comment_text', 'id']]
            instance_dev.to_csv('data/instances/'+'instance'+str(instance)+'/hw4_dev.csv', index=False, header=False)
        if os.path.exists('data/instances/instance'+str(instance)+'/hw4_test.csv') == False:
            instance_test = test_df.groupby('toxic').sample(instance)
            instance_test = instance_test[['toxic', 'comment_text', 'id']]
            instance_test.to_csv('data/instances/'+'instance'+str(instance)+'/hw4_test.csv', index=False, header=False)
else:
    print("create instance data directory")

## Part 5
Let’s compare our PET-based models and our regular all-data MiniLM
model. Plot the score for each PET model and your full-data MiniLM model using Seaborn. If
you are feeling curious, feel free to train models on different sizes/distributions of data and include
those too. Write your guess on how many instances you think you need to train a PET model that
will reach the performance of a MiniLM model trained on all the data.