# Pretrained One-vs-Rest Classifier

This notebook implements an one-vs-rest classifier that fine-tunes several BERT models to tell if a sentence contains problematic metaphors. It starts out from our BERT model pretrained on a large corpus of data.

<div hidden>
TODO: add extend data3/data.json with better data in the same format that actually makes sense.
</div>

## Imports and Setup

In [1]:
%pip install transformers -Uqq
%pip install sklearn -Uqq
%pip install datasets -Uqq
%pip install torch -Uqq
%pip install numpy -Uqq
%pip install evaluate -Uqq

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import evaluate
import numpy as np
import torch
from datasets import Dataset, load_dataset
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)
import os
from itertools import product

## Loading Dataset

In [3]:
dataset = load_dataset("json", data_files="data/data.json", field="data")
dataset

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-1eea92d117a608be/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-1eea92d117a608be/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'agency', 'humanComparison', 'hyperbole', 'historyComparison', 'unjustClaims', 'deepSounding', 'sceptics', 'deEmphasize', 'performanceNumber', 'inscrutable'],
        num_rows: 714
    })
})

In [4]:
dataset["train"][0:3]

{'text': ['A new vision of artificial intelligence for the people',
  'The gig workers fighting back against the algorithms',
  'How the AI industry profits from catastrophe'],
 'agency': [False, True, False],
 'humanComparison': [False, True, False],
 'hyperbole': [False, True, True],
 'historyComparison': [False, False, False],
 'unjustClaims': [False, False, False],
 'deepSounding': [False, False, False],
 'sceptics': [False, False, False],
 'deEmphasize': [False, False, False],
 'performanceNumber': [False, False, False],
 'inscrutable': [False, False, False]}

In [5]:
labels = [label for label in dataset["train"].features.keys() if label not in ["text"]]

num_epochs = {
    "agency": 10,
    "humanComparison": 2,
    "hyperbole": 2,
    "historyComparison": 2,
    "unjustClaims": 5,
    "deepSounding": 2,
    "sceptics": 2,
    "deEmphasize": 7,
    "performanceNumber": 2,
    "inscrutable": 2,
}

labels

['agency',
 'humanComparison',
 'hyperbole',
 'historyComparison',
 'unjustClaims',
 'deepSounding',
 'sceptics',
 'deEmphasize',
 'performanceNumber',
 'inscrutable']

## Preprocess Data, Create Train/Test Split

In [6]:
processed_dataset = {}
for label in labels:
    projected_dataset = (
        dataset["train"]
        .map(remove_columns=[l for l in labels if l != label])
        .rename_column(label, "labels")
        .class_encode_column("labels")
    )
    processed_dataset[label] = projected_dataset.train_test_split(
        test_size=0.2, stratify_by_column="labels"
    )
    # print(f"{label}:\n\t{processed_dataset[label]['test'][0:3]}\n")

processed_dataset

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/714 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/714 [00:00<?, ? examples/s]

{'agency': DatasetDict({
     train: Dataset({
         features: ['text', 'labels'],
         num_rows: 571
     })
     test: Dataset({
         features: ['text', 'labels'],
         num_rows: 143
     })
 }),
 'humanComparison': DatasetDict({
     train: Dataset({
         features: ['text', 'labels'],
         num_rows: 571
     })
     test: Dataset({
         features: ['text', 'labels'],
         num_rows: 143
     })
 }),
 'hyperbole': DatasetDict({
     train: Dataset({
         features: ['text', 'labels'],
         num_rows: 571
     })
     test: Dataset({
         features: ['text', 'labels'],
         num_rows: 143
     })
 }),
 'historyComparison': DatasetDict({
     train: Dataset({
         features: ['text', 'labels'],
         num_rows: 571
     })
     test: Dataset({
         features: ['text', 'labels'],
         num_rows: 143
     })
 }),
 'unjustClaims': DatasetDict({
     train: Dataset({
         features: ['text', 'labels'],
         num_rows: 571
     })
  

In [7]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def preprocess_data(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [8]:
tokenized_dataset = {
    k: ds.map(
        preprocess_data,
        remove_columns="text",
        batched=True,
    )
    for k, ds in processed_dataset.items()
}

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

### Verify dataset

In [9]:
example = tokenized_dataset["agency"]["train"][0]
print(example.keys())

dict_keys(['labels', 'input_ids', 'token_type_ids', 'attention_mask'])


In [10]:
tokenizer.decode(example["input_ids"])

'[CLS] The White House just unveiled a new AI Bill of Rights [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

In [11]:
example["labels"]

0

## Load Pre-Trained Model

In [12]:
# use_fast uses fast tokenizers backed by rust. Remove it if it causes errors
# model = AutoModelForSequenceClassification.from_pretrained(
    # "bert-base-cased",
    # num_labels=2,
# )

### Verify data-model interaction

In [13]:
# forward pass
# outputs = model(
# input_ids=tokenized_dataset[labels[0]]["train"]["input_ids"][0],
# labels=tokenized_dataset[labels[0]]["train"][0]["labels"],
# )
# outputs

## Define Metrics

In [14]:
metrics = {
    "accuracy": evaluate.load("accuracy"),
    "presicion": evaluate.load("precision"),
    "recall": evaluate.load("recall"),
    "f1": evaluate.load("f1"),
}

In [15]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    values = {}
    
    for name, metric in metrics.items():
        result = metric.compute(predictions=predictions, references=labels)
        for val in result.values() if isinstance(result, dict) else [result]:
            values[name] = val

    return values

In [16]:
class HighPrecisionTrainer(Trainer):
    """A trainer class, which computes loss based on a weighted MSE, where the error for the positive labels is
    weighted more than the error for the negative labels, leading to a higher precision
    """

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        label_mask = torch.FloatTensor(1, 2).cuda().zero_()
        label_mask[0, labels[0]] = 1
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # compute custom loss (suppose one has 3 labels with different weights)
        loss = torch.sum(torch.tensor([[1, 300]]).cuda() * ((logits - label_mask) ** 2))
        return (loss, outputs) if return_outputs else loss

## Train the Model

In [17]:
  # TODO: increase if we have more data
metric_name = "f1"

In [19]:
for model_name, weight_decay in product(['xt0r3/aihype_article_bert_fine_tune', 'bert-base-cased'], [0.0, 0.01]):
    for i in range(0, 5):
        batch_size = 2 ** i
        print(f'batch size: {2 ** i}\nmodel: {model_name}\nweight decay: {weight_decay}')

        for label in ['agency']:  # labels:
            print(f"training model for {label}")

            model = AutoModelForSequenceClassification.from_pretrained(
                model_name,
                num_labels=2,
            )

            training_args = TrainingArguments(
                f"{model_name.replace('/', '_')}_{label}-vs-rest",
                evaluation_strategy="epoch",
                save_strategy="epoch",
                learning_rate=2e-5,
                per_device_train_batch_size=batch_size,
                per_device_eval_batch_size=batch_size,
                num_train_epochs=num_epochs[label] + i * 2,
                weight_decay=weight_decay,
                report_to="none",
                load_best_model_at_end=True,
                metric_for_best_model=metric_name,
                save_total_limit = 1,
                # push_to_hub=True,  # TODO: enable once model seems good
            )

            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_dataset[label]["train"],
                eval_dataset=tokenized_dataset[label]["test"],
                compute_metrics=compute_metrics,
            )

            trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Presicion,Recall,F1
1,No log,0.483678,0.762238,0.5,0.117647,0.190476
2,No log,0.730268,0.727273,0.439024,0.529412,0.48
3,No log,1.143035,0.769231,0.518519,0.411765,0.459016
4,0.360300,1.535273,0.741259,0.428571,0.264706,0.327273
5,0.360300,1.611903,0.769231,0.518519,0.411765,0.459016
6,0.360300,1.716453,0.769231,0.521739,0.352941,0.421053


## Upload the Model

In [None]:
# agency-vs-rest/checkpoint-263: 0.75 precision, 0.85 recall
#

In [None]:
# trainer.push_to_hub()