# Developing a toxicity detector using large language models (LLMs)

Solution to the [assignment](https://docs.google.com/document/d/1w307HmtXqqreDj5VMiUtdjD0YG2pvzBbiPqwMf6vJ9E/edit) for Generative AI 1.

The fine-tuned model checkpoints produced here have also been  shared on Google Drive.

---
### Dataset Selection & Exploratory Data Analysis

Here we will load the [Toxic Comments Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) mentioned in the assignment. We have manually loaded the files from this dataset into Google Drive and will access them directly from there.

In [None]:
import pandas as pd
from google.colab import drive # Used for mounting Google Drive

In [None]:
# Mount Google Drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
# Read the test data provided in the dataset
df_train_file = pd.read_csv("/content/gdrive/MyDrive/T5/train.csv")

In [None]:
# Print the first few rows of this dataset
df_train_file.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Here we see that there appear to be multiple classes in this dataset beyond just "toxic". Look at some random rows labeled as "toxic" to see if individual comments can be assigned to more than one of these classes.

In [None]:
# Print labels for a random sample labeled as toxic
# Note that some samples receive multiple labels
cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
df_train_file[df_train_file["toxic"] == 1][cols].sample(10)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
35382,1,0,0,0,0,0
33450,1,0,0,0,0,0
119365,1,0,1,0,1,0
137123,1,0,1,0,0,0
81181,1,0,0,0,1,0
61852,1,0,0,0,0,0
146832,1,0,1,0,0,0
88928,1,0,1,0,0,0
105425,1,0,0,0,1,0
33839,1,0,1,0,1,0


So, it looks like this is a multi-label classification problem. That is, multiple nonexclusive labels (or none at all) may be assigned to each instance. When building our classifier, we will need to use a model capable of producing multiple labels for each input.


Let's also take a look at the test data provided with this data set:

In [None]:
# Read the test data provided in the dataset
df_test_file = pd.read_csv("/content/gdrive/MyDrive/T5/test.csv")

In [None]:
# View the first few rows of the test data
df_test_file.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


It looks like this test data does not contain labels. This makes sense, since this is likely the test data used for scoring the Kaggle competition associated with this dataset. Since we will want to evaluate the performance of our model out-of-sample, we will need to construct our own test set that contains labels. To do this, we we will hold out a random 20% of the training set to test on, and use the remaining 80% to train our model.

In [None]:
# Use train_test_split to create our labeled training and test sets
from sklearn.model_selection import train_test_split

In [None]:
# Create training and test sets from the labeled data that we have available
df_train, df_test = train_test_split(df_train_file, test_size=0.2)

In [None]:
# Save our new training and test sets to Google Drive
df_train.to_csv("/content/gdrive/MyDrive/T5/assignment_train.csv", index=False)
df_test.to_csv("/content/gdrive/MyDrive/T5/assignment_test.csv", index=False)

---
### Baseline Model Testing

While there are a variety of approaches that could be used to solve this problem, we will use the [T5 model](https://huggingface.co/docs/transformers/model_doc/t5) that was discussed in class. T5 is a text-to-text model that has been fine-tuned on a variety of tasks, including sentiment analysis, summarization, and question answering. When providing text input to the model, the specific task to be performed can be indicated in a prefix in the text input. The full list of tasks the T5 model has already been fine-tuned on and examples of the prefixes used for those tasks are in Appendix D of [the T5 paper](https://arxiv.org/pdf/1910.10683.pdf).

We will load the T5 model checkpoint from Hugging Face and test its ability to perform sentiment analysis (as an example) out-of-the box. Although T5 is capable of performing a variety of tasks without additional fine-tuning, it has not been fine-tuned to perform the specific task of toxic comment classification that we want to perform. We will verify that the model produces meaningless output when prompted to classify toxic content.

To get started, we will need to install the Hugging Face Transformers library and one of its dependencies.

In [None]:
!pip install transformers
!pip install sentencepiece



In [None]:
# Imports that we will use for implementing the T5 model
import torch
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    DataCollatorForSeq2Seq
)

In [None]:
# Supress warnings. None are important in this case, and they clutter output
import warnings
warnings.filterwarnings("ignore")

We will use the T5-small model instance, which has ~60M parameters. This will allow us to fine tune a model faster and within the limits of free Colab instances. As we will see, even the small model produces a fairly capable classifier.

In [None]:
# Load the pre-trained T5 small model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Send the model to the GPU for faster training and inference
model = model.to("cuda")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


To get a feel for how the T5 model works, we will test it on a sentiment classification example. Sentiment classification is one of the tasks that the checkpoint available from Hugging Face has already been fine-tuned for. As discussed in the T5 paper, the model has been fine-tuned so that the NLP task that we want to perform can be indicated by a prefix applied to the model's input. For example, to perform sentiment classification, the prefix `sst2 sentence:` can be used:

In [None]:
# Look at a sample that should have positive sentiment
input_ids = tokenizer(
    "sst2 sentence: The movie was interesting.",
    return_tensors="pt"
).input_ids
output = model.generate(input_ids=input_ids.to("cuda"))
print(tokenizer.decode(output[0]))

<pad> positive</s>


In [None]:
# Look at a variation that should now have negative sentiment
input_ids = tokenizer(
    "sst2 sentence: The movie was too long to be interesting.",
    return_tensors="pt"
).input_ids
output = model.generate(input_ids=input_ids.to("cuda"))
print(tokenizer.decode(output[0]))

<pad> negative</s>


As we can see, the model generates text strings that indicate the appropriate class as its output.

We would like to use this model for toxic content classification, although the model has not yet been fine-tuned for this task. Just to see what happens, we will pass an input to the model with a reasonable prefix for toxic content classification. Since the model is unfamiliar with this task, we get a meaningless output as expected:

In [None]:
# Pick a random input to test for toxic comment classification
sample = df_train["comment_text"][4]
print(sample)

You, sir, are my hero. Any chance you remember what page that's on?


In [None]:
# Pass input to the model. The specific input we pass in this example is
# "is this a toxic comment: You, sir, are my hero. Any chance you remember what page that's on?"
# Note that the output string that is produced is meaningless
input_ids = tokenizer(
    "is this a toxic comment: "+sample,
    return_tensors="pt"
).input_ids
output = model.generate(input_ids=input_ids.to("cuda"))
print(tokenizer.decode(output[0]))

<pad> <extra_id_0> comment?</s>


---
### Fine-Tuning on Training Data

Now we will fine-tune the T5 model to perform toxic comment classification. The main design decision we need to make is how we want to format the labels that our model will predict as a text sequence that T5 can output. Since this is a muti-label classification problem, we will produce a text string containing the names of all predicted labels. For example, if the comment is flagged as toxic, obscene, and an insult, our model will output the string `"toxic obscene insult"`. If it is only flagged as toxic, we will only output `"toxic"`. If the comment is not flagged with any of the labels, we will output the text string `"not_toxic"`.

In addition to designing our labels, we will also select a prefix for this task to keep a consistent style with the pre-trained T5 model. For our task, we will use the prefix `toxic comment classification:`.

To fine-tune our model we will first define a PyTorch Dataset class that will be used to fetch and format comments and labels from our training set. This dataset will be used to generate prefixed inputs to the model and multi-label outputs as described above. We will then create a DataLoader that can be used to fetch and iterate over batches of training samples. We will then write a simple training loop that will iterate over batches in a single epoch to produce our fine-tuned model.

In [None]:
class ToxicityDataset(torch.utils.data.Dataset):
    """
    This class is used to manage datasets for toxic content
    classification. It is a standard PyTorch Dataset.
    The items returned by this dataset are dictionaries with
    two keys:
    - input_ids: The token ids of the tokens in the input string
    - labels: The token ids of the tokens in the label string
    """

    def __init__(self, csv_file, tokenizer):
        """
        When initializing the dataset, load the provided CSV
        file as a dataframe and save the provided tokenizer.
        """
        self.df = pd.read_csv(csv_file)
        self.tokenizer = tokenizer

    def __len__(self):
        """
        This method will return the number of samples in the dataset
        """
        return self.df.shape[0]

    def __getitem__(self, idx):
        """
        This method will return a single item from the dataset. The index
        of the desired item is provided as input.
        """

        # List all of the possible toxic labels
        categories = [
            "toxic",
            "severe_toxic",
            "obscene",
            "threat",
            "insult",
            "identity_hate"
        ]

        # Select the sample with the provided index from the dataframe
        row = self.df.iloc[idx]

        # We will build our model inputs and outputs here
        # Inputs will be a tensor containing token indices of the input string
        # Outputs will be a tensor containing token indices of the label string
        labels = []
        toxic = False
        # Loop over categories
        for cat in categories:
            # If the sample is flagged as the current toxic category
            # append the label to a list of labels
            if row[cat] == 1:
                toxic = True
                labels.append(cat)
        # If the sample is not flagged as any toxic category, append
        # the value "not_toxic" to the list of labels
        if toxic is False:
            labels.append("not_toxic")

        # Tokenize the input string
        input_ids = self.tokenizer.encode(
            "toxic comment classification: " + row["comment_text"],
            return_tensors="pt"
        )
        # Tokenize the label output string
        labels = self.tokenizer.encode(
            " ".join(labels),
            return_tensors="pt"
        )

        # Return a dictionary containing the inputs and labels
        return {"input_ids": input_ids[0], "labels": labels[0]}


In [None]:
# Create a dataset object from our training set
dataset = ToxicityDataset("/content/gdrive/MyDrive/T5/assignment_train.csv", tokenizer)

In [None]:
# Create a DataLoader for iterating over batches of training samples

# Create a DataCollator to use with our DataLoader
# This will allow us to create batches of more than one training sample, and
# will appropriately mask and pad inputs in a batch so they have the same lengths
collator = DataCollatorForSeq2Seq(tokenizer)

# Create the DataLoader for our training set
# Here we will use a batch size of 1 because using anything larger
# can lead to out-of-memory issues on a Colab T4 instance
training_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=collator
)

In [None]:
# Create the optimizer that we will use for fine-tuning
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

In [None]:
"""
Here we implement the training loop for fine-tuning our model.
We will fine tune over a single epoch of our training set using
batches composed of a single training sample
"""

# Get the number of batches, which we will print in status messages
n_batches = dataset.__len__()

# Loop over a single epoch to fine-tune the model
current_batch = 0
total_loss = 0
for batch in training_loader:

    # Move the current batch to the GPU
    batch = batch.to("cuda")
    # Evaluate the loss on the current batch and perform
    # a single step of parameter optimization.
    output = model(**batch)
    optimizer.zero_grad(set_to_none=True)
    output.loss.backward()
    optimizer.step()

    # Keep a running total of the per-batch loss.
    # We will use this for reporting the mean training loss.
    total_loss += output.loss.item()
    current_batch += 1
    # Every 100 batches, print a status update.
    if current_batch%100 == 0:
        status = "Mean loss: {}. Batch {} of {}".format(
            total_loss/100,
            current_batch,
            n_batches
        )
        print(status)
        total_loss = 0

    # Every 5000 batches, save a model checkpoint in Google Drive
    # Colab can randomly disconnect during long training jobs and
    # this will ensure that we don't lose all of our work.
    if current_batch%5000 == 0:
        model.save_pretrained("/content/gdrive/MyDrive/T5/t5_checkpoint")
        print("Checkpointing model...")

Mean loss: 0.09523613936835318. Batch 100 of 127656
Mean loss: 0.05655968463797763. Batch 200 of 127656
Mean loss: 0.1365324474645604. Batch 300 of 127656
Mean loss: 0.07409224301867652. Batch 400 of 127656
Mean loss: 0.08099154121460743. Batch 500 of 127656
Mean loss: 0.09570899739544984. Batch 600 of 127656
Mean loss: 0.10185960668939514. Batch 700 of 127656
Mean loss: 0.06738146965359192. Batch 800 of 127656
Mean loss: 0.10000457228350569. Batch 900 of 127656
Mean loss: 0.08010242569664115. Batch 1000 of 127656
Mean loss: 0.10016160817074705. Batch 1100 of 127656
Mean loss: 0.05749054300909848. Batch 1200 of 127656
Mean loss: 0.10140042294187879. Batch 1300 of 127656
Mean loss: 0.13222045421360235. Batch 1400 of 127656
Mean loss: 0.0369298497307318. Batch 1500 of 127656
Mean loss: 0.04605511341291276. Batch 1600 of 127656
Mean loss: 0.04686834342141083. Batch 1700 of 127656
Mean loss: 0.09498333913870738. Batch 1800 of 127656
Mean loss: 0.07761606088166445. Batch 1900 of 127656
Mean

Now that we have fine-tuned our model, we will verify that it now produces a resonable prediction on the toxic comment classification example that we tested with the pre-trained model. Note that we have modified the prefix applied to the input to match the prefix we used when training.

In [None]:
model = model.to("cuda")

input_ids = tokenizer(
    "toxic comment classification: You, sir, are my hero. Any chance you remember what page that's on?",
    return_tensors="pt"
).input_ids
output = model.generate(input_ids=input_ids.to("cuda"))
print(tokenizer.decode(output[0]))

<pad> not_toxic</s>


---

Now we will evaluate accuracy metrics on the test set. Since this is a multi-label classification problem, we will compute precision, recall and F1-scores on each class individually.

Recall that our model outputs text strings containing all relevant predicted labels. To enable easy evaluation of metrics, we will generate predictions and labels on our test set, then convert the outputs into a dataframe that contains one row for each test sample and columns for each of the  predicted and labeled classes.

In [None]:
# Load the model from the saved checkpoint if needed
# model = T5ForConditionalGeneration.from_pretrained("/content/gdrive/MyDrive/T5/t5_checkpoint/").to("cuda")

In [None]:
# Load the test data as an instance of our Dataset class
test_dataset = ToxicityDataset("/content/gdrive/MyDrive/T5/assignment_test.csv", tokenizer)

In [None]:
def generate_predictions(testing_dataset, tokenizer, model):
    """
    This function will be used to generate predictions on the
    test dataset.
    Inputs:
    - testing_dataset: The ToxicityDataset instance containing samples
        to evaluate.
    - tokenizer: The T5 tokenizer.
    - model: The fine-tuned model that we want to evaluate.
    This function returns a DataFrame containing predictions in a
    format that enables easy computation of accuracy metrics.
    """

    # Create a DataLoader to iterate over our dataset
    collator = DataCollatorForSeq2Seq(tokenizer)
    testing_loader = torch.utils.data.DataLoader(
        testing_dataset,
        batch_size=1,
        shuffle=False,
        collate_fn=collator
    )

    # These are the column names that will be added to the output
    # DataFrame. This contains columns for each predicted and labeled
    # toxicity class.
    column_names = [
        "pred_not_toxic",
        "pred_toxic",
        "pred_severe_toxic",
        "pred_obscene",
        "pred_threat",
        "pred_insult",
        "pred_identity_hate",
        "label_not_toxic",
        "label_toxic",
        "label_severe_toxic",
        "label_obscene",
        "label_threat",
        "label_insult",
        "label_identity_hate"
    ]

    # Loop over all batches and generate labels
    n_batches = testing_dataset.__len__()
    results = []
    current_batch = 0
    for batch in testing_loader:
        current_batch += 1

        # Send the current batch to the GPU
        batch = batch.to("cuda")
        # Generate the output string for the given input
        output = model.generate(input_ids=batch["input_ids"])
        # Extract the predictions from the generated output
        # Here we strip away the <pad> and </s> tokens added to each output
        predictions = tokenizer.decode(output[0][1:-1]).strip().split(" ")
        # Decode the ground-truth labels associated with this sample
        labels = tokenizer.decode(batch["labels"][0][:-1]).strip().split(" ")

        # Build a dictionary containing the comment text and indicators
        # for each predicted and labeled toxicity class. These will be
        # The rows of the dataframe that we output.
        row = {c:0 for c in column_names}
        row["comment_text"] = tokenizer.decode(batch["input_ids"][0])
        for p in predictions:
            row[f"pred_{p}"] = 1
        for l in labels:
            row[f"label_{l}"] = 1
        results.append(row)

        # Every 100 batches provide a status update
        if current_batch%100 == 0:
            status = "Batch {} of {}".format(
                current_batch,
                n_batches
            )
            print(status)

    # Return a dataframe containing all predictions and labels
    return pd.DataFrame(results)

In [None]:
# Run predictions on the test dataset
df_pred = generate_predictions(test_dataset, tokenizer, model)

Batch 100 of 31915
Batch 200 of 31915
Batch 300 of 31915
Batch 400 of 31915
Batch 500 of 31915
Batch 600 of 31915
Batch 700 of 31915
Batch 800 of 31915
Batch 900 of 31915
Batch 1000 of 31915
Batch 1100 of 31915
Batch 1200 of 31915
Batch 1300 of 31915
Batch 1400 of 31915
Batch 1500 of 31915
Batch 1600 of 31915
Batch 1700 of 31915
Batch 1800 of 31915
Batch 1900 of 31915
Batch 2000 of 31915
Batch 2100 of 31915
Batch 2200 of 31915
Batch 2300 of 31915
Batch 2400 of 31915
Batch 2500 of 31915
Batch 2600 of 31915
Batch 2700 of 31915
Batch 2800 of 31915
Batch 2900 of 31915
Batch 3000 of 31915
Batch 3100 of 31915
Batch 3200 of 31915
Batch 3300 of 31915
Batch 3400 of 31915
Batch 3500 of 31915
Batch 3600 of 31915
Batch 3700 of 31915
Batch 3800 of 31915
Batch 3900 of 31915
Batch 4000 of 31915
Batch 4100 of 31915
Batch 4200 of 31915
Batch 4300 of 31915
Batch 4400 of 31915
Batch 4500 of 31915
Batch 4600 of 31915
Batch 4700 of 31915
Batch 4800 of 31915
Batch 4900 of 31915
Batch 5000 of 31915
Batch 510

In [None]:
# Save our predictions to Google Drive
df_pred.to_csv("/content/gdrive/MyDrive/T5/assignment_test_predictions.csv", index=None)

In [None]:
"""
Here we will compute accuracy metrics on each toxicity class.
Specifically we will compute:
- Precision: Among all positive predictions, the fraction that were
    correctly predicted.
- Recall: Among all positives in the groud-truth, the fraction that were
    correctly predicted.
- F1-score: Harmonic mean of precision and recall.
- Support: The number of positive instances in the ground-truth.
- Rate: The fraction of positives in the ground truth.
"""

from sklearn.metrics import precision_recall_fscore_support

# Categories that we will report metrics for
categories = [
    "toxic",
    "severe_toxic",
    "obscene",
    "threat",
    "insult",
    "identity_hate"
]

# Loop over categories and print metrics
for c in categories:
  metrics = precision_recall_fscore_support(
      df_pred[f"label_{c}"],
      df_pred[f"pred_{c}"]
  )
  print(c)
  print(f"Precision: {metrics[0][1].round(3)}")
  print(f"Recall: {metrics[1][1].round(3)}")
  print(f"F1-score: {metrics[2][1].round(3)}")
  print(f"Support: {metrics[3][1]}")
  label_category = f"label_{c}"
  print(f"Rate: {df_pred[label_category].mean().round(3)}")
  print()

toxic
Precision: 0.837
Recall: 0.802
F1-score: 0.819
Support: 3079
Rate: 0.096

severe_toxic
Precision: 0.938
Recall: 0.044
F1-score: 0.084
Support: 343
Rate: 0.011

obscene
Precision: 0.959
Recall: 0.479
F1-score: 0.638
Support: 1701
Rate: 0.053

threat
Precision: 0.692
Recall: 0.189
F1-score: 0.298
Support: 95
Rate: 0.003

insult
Precision: 0.875
Recall: 0.432
F1-score: 0.579
Support: 1557
Rate: 0.049

identity_hate
Precision: 0.671
Recall: 0.221
F1-score: 0.332
Support: 249
Rate: 0.008



We achieve a reasonably good F1-score of 82% on the "toxic" class, with well-balanced precision and recall. For other classes, the F1-scores appear to depend on the rate of positive labels associated with the class. This is expected, since rare classes will generally be the most difficult to predict accurately. The categories "severe_toxic" and "threat" occur very infrequently and have the lowest F1-scores (only 1.1% and 0.3% of samples in the test set have these labels, respectively). F1-scores increase with in line with increasing frequency of occurence on the remaining categories.

---
### Cross-Dataset Evaluation

Now we will check [the different dataset](https://huggingface.co/datasets/OxAISH-AL-LLM/wiki_toxic) referenced in the assignment to measure our model's generalization capabilities.

In [None]:
# We uploaded this dataset to Google Drive. Load from there.
df_wiki_toxic = pd.read_csv("/content/gdrive/MyDrive/T5/wiki_toxic_test.csv")

In [None]:
# Inspect the first few rows
df_wiki_toxic.head()

Unnamed: 0,id,comment_text,label
0,0001ea8717f6de06,Thank you for understanding. I think very high...,0
1,000247e83dcc1211,:Dear god this site is horrible.,0
2,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0
3,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0
4,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0


We see that the schema of this new dataset is a bit different than the toxic comment dataset that we worked with previously. Specifically, there is now only a single ground truth label indicating whether or not the comment is toxic. We will convert this table into a format compatible with our previous dataset so that we can use our existing code to produce predictions. Specifically, we will treat this "label" column as the "toxic" column in our previous dataset, then add dummy columns for all of the other toxic comment classes.

In [None]:
# Rename the "label" column
df_wiki_toxic = df_wiki_toxic.rename(columns={"label": "toxic"})
# Add dummy columns for all other toxic classes
df_wiki_toxic["severe_toxic"] = 0
df_wiki_toxic["obscene"] = 0
df_wiki_toxic["threat"] = 0
df_wiki_toxic["insult"] = 0
df_wiki_toxic["identity_hate"] = 0

In [None]:
# We will just run inference on a sample, since the original dataset is quite large
# Save this sample to Google Drive so that we can load it as a ToxicityDataset
df_wiki_toxic.sample(10000).to_csv("/content/gdrive/MyDrive/T5/wiki_toxic_test_new_schema.csv", index=False)

In [None]:
# Load the dataset for testing
test_dataset = ToxicityDataset("/content/gdrive/MyDrive/T5/wiki_toxic_test_new_schema.csv", tokenizer)

In [None]:
# Generate predictions
df_pred_wiki_toxic = generate_predictions(test_dataset, tokenizer, model)

Batch 100 of 10000
Batch 200 of 10000
Batch 300 of 10000
Batch 400 of 10000
Batch 500 of 10000
Batch 600 of 10000
Batch 700 of 10000
Batch 800 of 10000
Batch 900 of 10000
Batch 1000 of 10000
Batch 1100 of 10000
Batch 1200 of 10000
Batch 1300 of 10000
Batch 1400 of 10000
Batch 1500 of 10000
Batch 1600 of 10000
Batch 1700 of 10000
Batch 1800 of 10000
Batch 1900 of 10000
Batch 2000 of 10000
Batch 2100 of 10000
Batch 2200 of 10000
Batch 2300 of 10000
Batch 2400 of 10000
Batch 2500 of 10000
Batch 2600 of 10000
Batch 2700 of 10000
Batch 2800 of 10000
Batch 2900 of 10000
Batch 3000 of 10000
Batch 3100 of 10000
Batch 3200 of 10000
Batch 3300 of 10000
Batch 3400 of 10000
Batch 3500 of 10000
Batch 3600 of 10000
Batch 3700 of 10000
Batch 3800 of 10000
Batch 3900 of 10000
Batch 4000 of 10000
Batch 4100 of 10000
Batch 4200 of 10000
Batch 4300 of 10000
Batch 4400 of 10000
Batch 4500 of 10000
Batch 4600 of 10000
Batch 4700 of 10000
Batch 4800 of 10000
Batch 4900 of 10000
Batch 5000 of 10000
Batch 510

In [None]:
# Save predictions to Google Drive
df_pred_wiki_toxic.to_csv("/content/gdrive/MyDrive/T5/wiki_toxic_test_predictions.csv", index=None)

Now we will evaluate the performance of our classifier on this dataset. We will only consider the "toxic" category since this is the only label contained in this dataset.


In [None]:
# Categories that we will report metrics for
categories = ["toxic"]

# Loop over categories and print metrics
for c in categories:
  metrics = precision_recall_fscore_support(
      df_pred_wiki_toxic[f"label_{c}"],
      df_pred_wiki_toxic[f"pred_{c}"]
  )
  print(c)
  print(f"Precision: {metrics[0][1].round(3)}")
  print(f"Recall: {metrics[1][1].round(3)}")
  print(f"F1-score: {metrics[2][1].round(3)}")
  print(f"Support: {metrics[3][1]}")
  label_category = f"label_{c}"
  print(f"Rate: {df_pred_wiki_toxic[label_category].mean().round(3)}")
  print()

toxic
Precision: 0.569
Recall: 0.827
F1-score: 0.674
Support: 973
Rate: 0.097



The overall F1-score is lower on this dataset, which is reasonable. It is possible that toxicity was evaluated according to somewhat different conventions in this dataset compared to the dataset that we trained on. Specifically, here we see that recall is still high but precision is now lower. This suggests that this dataset might label fewer comments as toxic than in the dataset that we trained on. That is, when our model predicts that a comment is toxic based on its training, that comment is less likely to be labeled as toxic in this dataset than in our previous test set. However, our model is still capable of accurately predicting comments that were labeled as toxic in this dataset.

---
### Training a Model from Scratch

Now we will repeat the process of training and evaluating a model, but we will do it this time with no pre-training. To re-create the process of training a model from scratch, we will load the T5-small model then reinitialize its weights to random values. This will allow us to train a model with the same architecture as T5, but will undo the effects of any pre-training already performed.

In [None]:
# Reload the T5-small model
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [None]:
def reset_parameters(model):
    """
    This function will recursively move through a model's
    layers and reset the weights on any that can be reset.
    """
    for layer in model.children():
        if hasattr(layer, 'reset_parameters'):
            layer.reset_parameters()
        else:
            reset_parameters(layer)

In [None]:
# Reset the parameters of the entire T5 model
reset_parameters(model)

In [None]:
# Move the model to the GPU
model = model.to("cuda")

Next let's verify that the model now produces random output on a sentiment classification input that the pretrained model could handle sensibly.

In [None]:
# Evaluate a previous sentiment classification example
input_ids = tokenizer(
    "sst2 sentence: The movie was too long to be interesting.",
    return_tensors="pt"
).input_ids
output = model.generate(input_ids=input_ids.to("cuda"))
print(tokenizer.decode(output[0]))

<pad>happiness transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér transfér


In [None]:
# Load a dataset for our training data
dataset = ToxicityDataset("/content/gdrive/MyDrive/T5/assignment_train.csv", tokenizer)

# Create a DataLoader for the training data
collator = DataCollatorForSeq2Seq(tokenizer)
training_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=collator
)

In [None]:
# Create the optimizer that we will use for training the model
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

In [None]:
"""
Here we implement the training loop for training our model.
This is essentially equivalent to the training loop we used
when fine-tuning the model. The only difference is that the model's
initial weights are random, rather than the pre-trained T5 weights.
"""

# Get the number of batches, which we will print in status messages
n_batches = dataset.__len__()

# Loop over batches
current_batch = 0
total_loss = 0
for batch in training_loader:

    # Move the current batch to the GPU
    batch = batch.to("cuda")
    # Evaluate the loss on the current batch and perform
    # a single step of parameter optimization.
    output = model(**batch)
    optimizer.zero_grad(set_to_none=True)
    output.loss.backward()
    optimizer.step()

    # Give a status update every 100 iterations
    total_loss += output.loss.item()
    current_batch += 1
    if current_batch%100 == 0:
        status = "Mean loss: {}. Batch {} of {}".format(
            total_loss/100,
            current_batch,
            n_batches
        )
        print(status)
        total_loss = 0

    # Save a checkpoint every 5000 iterations
    if current_batch%5000 == 0:
        model.save_pretrained("/content/gdrive/MyDrive/T5/t5_no_pretrain_checkpoint")
        print("Checkpointing model...")

Mean loss: 10.02941198348999. Batch 100 of 127656
Mean loss: 9.795738954544067. Batch 200 of 127656
Mean loss: 9.56429648399353. Batch 300 of 127656
Mean loss: 9.289384927749634. Batch 400 of 127656
Mean loss: 9.010039625167847. Batch 500 of 127656
Mean loss: 8.74532172203064. Batch 600 of 127656
Mean loss: 8.468506603240966. Batch 700 of 127656
Mean loss: 8.13690592288971. Batch 800 of 127656
Mean loss: 7.853899726867676. Batch 900 of 127656
Mean loss: 7.536981410980225. Batch 1000 of 127656
Mean loss: 7.196212720870972. Batch 1100 of 127656
Mean loss: 6.862560253143311. Batch 1200 of 127656
Mean loss: 6.544826788902283. Batch 1300 of 127656
Mean loss: 6.167839441299439. Batch 1400 of 127656
Mean loss: 5.625668239593506. Batch 1500 of 127656
Mean loss: 5.227845764160156. Batch 1600 of 127656
Mean loss: 4.759560990333557. Batch 1700 of 127656
Mean loss: 4.555448789596557. Batch 1800 of 127656
Mean loss: 3.9650766611099244. Batch 1900 of 127656
Mean loss: 3.7731185245513914. Batch 2000 

Next we will evaluate this newly trained model on the first test set that we evaluated our fine-tuned model on.

In [None]:
# Load the test dataset
test_dataset = ToxicityDataset("/content/gdrive/MyDrive/T5/assignment_test.csv", tokenizer)

In [None]:
# Generate predictions with the new model
model = model.to("cuda")
df_pred = generate_predictions(test_dataset, tokenizer, model)

Batch 100 of 31915
Batch 200 of 31915
Batch 300 of 31915
Batch 400 of 31915
Batch 500 of 31915
Batch 600 of 31915
Batch 700 of 31915
Batch 800 of 31915
Batch 900 of 31915
Batch 1000 of 31915
Batch 1100 of 31915
Batch 1200 of 31915
Batch 1300 of 31915
Batch 1400 of 31915
Batch 1500 of 31915
Batch 1600 of 31915
Batch 1700 of 31915
Batch 1800 of 31915
Batch 1900 of 31915
Batch 2000 of 31915
Batch 2100 of 31915
Batch 2200 of 31915
Batch 2300 of 31915
Batch 2400 of 31915
Batch 2500 of 31915
Batch 2600 of 31915
Batch 2700 of 31915
Batch 2800 of 31915
Batch 2900 of 31915
Batch 3000 of 31915
Batch 3100 of 31915
Batch 3200 of 31915
Batch 3300 of 31915
Batch 3400 of 31915
Batch 3500 of 31915
Batch 3600 of 31915
Batch 3700 of 31915
Batch 3800 of 31915
Batch 3900 of 31915
Batch 4000 of 31915
Batch 4100 of 31915
Batch 4200 of 31915
Batch 4300 of 31915
Batch 4400 of 31915
Batch 4500 of 31915
Batch 4600 of 31915
Batch 4700 of 31915
Batch 4800 of 31915
Batch 4900 of 31915
Batch 5000 of 31915
Batch 510

In [None]:
# Save predictions to Google Drive
df_pred.to_csv("/content/gdrive/MyDrive/T5/assignment_test_predictions_no_pretrain.csv", index=None)

In [None]:
# Compute accuracy metrics on each toxicity class

# Categories that we will report metrics for
categories = [
    "toxic",
    "severe_toxic",
    "obscene",
    "threat",
    "insult",
    "identity_hate"
]

# Loop over categories and print metrics
for c in categories:
  metrics = precision_recall_fscore_support(
      df_pred[f"label_{c}"],
      df_pred[f"pred_{c}"]
  )
  print(c)
  print(f"Precision: {metrics[0][1].round(3)}")
  print(f"Recall: {metrics[1][1].round(3)}")
  print(f"F1-score: {metrics[2][1].round(3)}")
  print(f"Support: {metrics[3][1]}")
  label_category = f"label_{c}"
  print(f"Rate: {df_pred[label_category].mean().round(3)}")
  print()

toxic
Precision: 0.693
Recall: 0.807
F1-score: 0.746
Support: 3079
Rate: 0.096

severe_toxic
Precision: 0.0
Recall: 0.0
F1-score: 0.0
Support: 343
Rate: 0.011

obscene
Precision: 0.805
Recall: 0.664
F1-score: 0.727
Support: 1701
Rate: 0.053

threat
Precision: 0.0
Recall: 0.0
F1-score: 0.0
Support: 95
Rate: 0.003

insult
Precision: 0.688
Recall: 0.62
F1-score: 0.652
Support: 1557
Rate: 0.049

identity_hate
Precision: 0.0
Recall: 0.0
F1-score: 0.0
Support: 249
Rate: 0.008



F1 scores on this model are surprisingly not bad given the lack of pretraining. A side-by-side comparison between this and the previous model is in the table below:

| Class | F1 with pretraining | F1 with no pretraining |
| :--- | :--- | :--- |
| toxic | 0.819 | 0.746 |
| severe_toxic | 0.084 | 0.0 |
| obscene | 0.638 | 0.727 |
| threat | 0.298 | 0.0 |
| insult | 0.579 | 0.652 |
| identity_hate | 0.332 | 0.0 |

The most common class "toxic" has a lower F1-score, but F1-scores are actually higher on the next two most common classes. The three remaining classes have an F1-score of zero because the model never predicts these classes. This is likely an area where pretraining helped, since the first model was able to learn how to attempt to predict these classes after exposure to few instances in the training set.

We can also see that training was far less efficient, in the sense that training losses decreased much more slowly when training a model from scratch versus fine-tuning on top of a pretrained model. For example, when fine-tuning the pretrained model, mean training loss was less than 0.1 after 100 iterations and was less than 0.06 after 200 iterations. When training from scratch, a loss less than 0.1 was not observed until after 19,200 iterations and a loss less than 0.06 was not observed until after 45,500 iterations.