# Prompt-Based NLP

In Homework 4, we’ll try using Jigsaw’s Toxic Language dataset using PET to train our classifier.
Conveniently the PET authors have already provided code for you to use at https://github.com/timoschick/pet. Your task will be to (1) write your own custom verbalizer and patterns
and (2) train your model by modifying one of their example scripts. The PET repository has good
documentation on how to set up their model, train it, and use the code.

Like in Homework 3, in this assignment we will use a much smaller but nearly-as-performant
version of BERT, https://huggingface.co/microsoft/MiniLM-L12-H384-uncased,
to train our models. While PET can work on any LLM, MiniLM will make the homework much
faster to finish.

In [1]:
import pandas as pd
from transformers import BertTokenizerFast, BertForSequenceClassification #EarlyStoppingCallback
from datasets import load_dataset, load_metric, Dataset
from transformers import Trainer, TrainingArguments
import torch
#import wandb
import os
from pathlib import Path
torch.cuda.empty_cache()

In [2]:
train_df = pd.read_csv('data/hw4_train.csv')
test_df = pd.read_csv('data/hw4_test.csv')
train_df.sample(5)

Unnamed: 0,id,comment_text,toxic
100262,18bc9c930d585fe3,"""\n\n Gita's Samkhya is NOT DIFFERENT \n\n""""On...",1
114061,622f2af984c7bf08,Remaining pages \n\nWhen do you plan to have p...,0
137054,dd4c2206298426cc,"The resultant redirect, AM-2 should redirect t...",0
67519,b4aed34834b1eff7,"Nizami \nJames, I add a lot of academical sour...",0
62525,a74c5249bcf40306,The same applies to your latest revert at Nati...,0


In [4]:
model_name = 'microsoft/MiniLM-L12-H384-uncased'
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

## Part 3

** Note that Parts 1, 2, 4, and 5 were completed in a separate notebook**

For comparison with PET, train a regular classifier using Trainer and
the MiniLM parameters on all the training data (very similar to what you did in Homework 3!). You
should train your model for at least two epochs, but you’re not required to do any hyperparameter
tuning (you just need a score). Predict the toxicity of the provided test data and calculate the F1.

In [10]:
MiniLM_tokenizer = BertTokenizerFast.from_pretrained(model_name)
MiniLMmodel = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
max_input_length = 512
max_target_length = 512

def preprocess_function(examples):
    inputs = [ex for ex in examples["comment_text"]]
    model_inputs = tokenizer(inputs, padding='max_length', max_length=max_input_length, truncation=True)


    model_inputs["labels"] = examples["labels"]
    return model_inputs

In [12]:
train_dataset = Dataset.from_pandas(train_df.rename(columns={'toxic':'labels'}))
# dev_dataset = Dataset.from_pandas(dev_df.rename(columns={'toxic':'labels'}))
test_dataset = Dataset.from_pandas(test_df.rename(columns={'toxic':'labels'}))

In [13]:
train_dataset

Dataset({
    features: ['id', 'comment_text', 'labels'],
    num_rows: 159571
})

In [14]:
test_dataset

Dataset({
    features: ['id', 'comment_text', 'labels'],
    num_rows: 63978
})

In [15]:
tokenized_train_dataset = train_dataset.map(lambda x: tokenizer(x['comment_text'],padding = 'max_length', max_length =512, truncation=True))
#tokenized_dev_dataset = dev_dataset.map(lambda x: tokenizer(x['comment_text'],padding = 'max_length', max_length =512, truncation=True))
tokenized_test_dataset = test_dataset.map(lambda x: tokenizer(x['comment_text'],padding = 'max_length', max_length =512, truncation=True))


  0%|          | 0/159571 [00:00<?, ?ex/s]

  0%|          | 0/63978 [00:00<?, ?ex/s]

In [16]:
tokenized_test_dataset

Dataset({
    features: ['id', 'comment_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 63978
})

In [17]:
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask','labels'])
# tokenized_dev_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask','labels'])
tokenized_test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

In [18]:
from sklearn.metrics import f1_score
import numpy as np
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1) # for whole numbers
    preds = preds.reshape(len(preds),)
    f1 = f1_score(labels, preds, average='macro')
    return {
        'f1': f1
    }

In [19]:
training_args = TrainingArguments(
    output_dir = 'MiniLM',
    num_train_epochs = 2,
    learning_rate=1e-4,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 16,
    gradient_accumulation_steps = 16,
)

In [20]:
trainer = Trainer(
    MiniLMmodel,
    training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics,
)

In [21]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, comment_text. If id, comment_text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 159571
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 16
  Total optimization steps = 1246


Step,Training Loss
500,0.1444
1000,0.0814


Saving model checkpoint to MiniLM/checkpoint-500
Configuration saved in MiniLM/checkpoint-500/config.json
Model weights saved in MiniLM/checkpoint-500/pytorch_model.bin
Saving model checkpoint to MiniLM/checkpoint-1000
Configuration saved in MiniLM/checkpoint-1000/config.json
Model weights saved in MiniLM/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1246, training_loss=0.1058224912248492, metrics={'train_runtime': 17143.8737, 'train_samples_per_second': 18.616, 'train_steps_per_second': 0.073, 'total_flos': 2.101728739680461e+16, 'train_loss': 0.1058224912248492, 'epoch': 2.0})

In [22]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, comment_text. If id, comment_text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 63978
  Batch size = 16


{'eval_loss': 0.19436031579971313,
 'eval_f1': 0.8122410868096552,
 'eval_runtime': 1171.5032,
 'eval_samples_per_second': 54.612,
 'eval_steps_per_second': 1.707,
 'epoch': 2.0}

In [23]:
trainer.save_model('MiniLMmodel')

Saving model checkpoint to MiniLMmodel
Configuration saved in MiniLMmodel/config.json
Model weights saved in MiniLMmodel/pytorch_model.bin
