# One-vs-Rest Classifier

This notebook implements an one-vs-rest classifier that fine-tunes several BERT models to tell if a sentence contains problematic metaphors.

<div hidden>
TODO: add extend data3/data.json with better data in the same format that actually makes sense.
</div>

## Imports and Setup

In [None]:
!pip install transformers -Uqq
!pip install sklearn -Uqq
!pip install datasets -Uqq
!pip install torch -Uqq
!pip install numpy -Uqq
!pip install evaluate -Uqq

In [21]:
import evaluate
import numpy as np
import torch
from datasets import Dataset, load_dataset
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from transformers import (
    AutoModelForNextSentencePrediction,
    AutoTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)

In [22]:
MODEL_NAME = "one-vs-rest-bert"

## Loading Dataset

In [23]:
dataset = load_dataset("json", data_files="data/data_unlab_clean.json", field="data")
dataset

Using custom data configuration default-db1d0de5b8eb1a4a


Downloading and preparing dataset json/default to /home/xt0r3/.cache/huggingface/datasets/json/default-db1d0de5b8eb1a4a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/xt0r3/.cache/huggingface/datasets/json/default-db1d0de5b8eb1a4a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8284
    })
})

In [24]:
dataset["train"][0:3]

{'text': [['By',
   'Australian',
   'Associated',
   'Press',
   'Published',
   ':',
   '06:13',
   ',',
   '23',
   'April',
   '2019',
   '|',
   'Updated',
   ':',
   '06:13',
   ',',
   '23',
   'April',
   '2019',
   'A',
   'new',
   'artificial',
   'intelligence',
   '(',
   'AI',
   ')',
   'centre',
   'of',
   'excellence',
   'will',
   'be',
   'tasked',
   'with',
   'preventing',
   'another',
   '``',
   'robodebt',
   "''",
   'disaster',
   'if',
   'Labor',
   'wins',
   'the',
   'federal',
   'election',
   '.',
   'The',
   'AI',
   'centre',
   'will',
   'be',
   'based',
   'in',
   'Melbourne',
   'with',
   'a',
   '$',
   '3',
   'million',
   'commitment',
   'from',
   'Labor',
   'and',
   '$',
   '1',
   'million',
   'from',
   'the',
   'Victorian',
   'government',
   '.',
   'Labor',
   "'s",
   'digital',
   'technology',
   'spokesman',
   'Ed',
   'Husic',
   'said',
   'the',
   'centre',
   'will',
   'devise',
   'a',
   'strategy',
   'for',

In [25]:
num_epochs = 30

## Preprocess Data, Create Train/Test Split

In [26]:
processed_dataset = dataset["train"].train_test_split(test_size=0.2)
processed_dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 6627
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1657
    })
})

In [65]:
joined_dataset = processed_dataset.map(lambda entry: {"text": " ".join(entry["text"])})
joined_dataset["test"][0]

Loading cached processed dataset at /home/xt0r3/.cache/huggingface/datasets/json/default-db1d0de5b8eb1a4a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-bde19c90e03502da.arrow
Loading cached processed dataset at /home/xt0r3/.cache/huggingface/datasets/json/default-db1d0de5b8eb1a4a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-066a1ebdde2c570e.arrow




In [29]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [68]:
def preprocess_data(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [69]:
tokenized_dataset = joined_dataset.map(
    preprocess_data,
    remove_columns="text",
    batched=True,
)

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

### Verify dataset

In [74]:
example = tokenized_dataset["train"][3]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [75]:
tokenizer.decode(example["input_ids"])

"[CLS] By Associated Press Published : 18 : 40, 19 August 2019 | Updated : 19 : 00, 19 August 2019 PITTSBURGH ( AP ) - When celebrity chef Lidia Bastianich decided to open a restaurant in Pittsburgh's Strip District in 2001, she arrived in a neighborhood filled with warehouses and factories. This narrow stretch of streets in the shadow of the city's downtown office towers had long been home to food purveyors like Wholey's Fish Market and the Pennsylvania Macaroni Company, known to locals simply as Penn Mac. But a high - end restaurant helmed by a James Beard award - winning chef? That wasn't something anyone expected. Nearly two decades later, as Bastianich's eponymous Pittsburgh restaurant is set to close in September, the neighborhood around it has changed dramatically. Along what is now called Robotics Row, tech startups vie for office space in new buildings while Argo AI tests autonomous cars. In the process, Pittsburgh's restaurant scene has become almost as unrecognizable. The ci

## Load Pre-Trained Model

In [77]:
# use_fast uses fast tokenizers backed by rust. Remove it if it causes errors
model = AutoModelForNextSentencePrediction.from_pretrained(
    "bert-base-cased",
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Verify data-model interaction

In [78]:
# forward pass
# outputs = model(
# input_ids=tokenized_dataset[labels[0]]["train"]["input_ids"][0],
# labels=tokenized_dataset[labels[0]]["train"][0]["labels"],
# )
# outputs

## Define Metrics

In [80]:
metrics = {
    "accuracy": evaluate.load("accuracy"),
    "presicion": evaluate.load("precision"),
    "recall": evaluate.load("recall"),
    "f1": evaluate.load("f1"),
}

In [81]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        name: metric.compute(predictions=predictions, references=labels)
        for name, metric in metrics.items()
    }

## Train the Model

In [85]:
batch_size = 8  # TODO: increase if we have more data
num_epochs = 30
# metric_name = "f1"

In [86]:
training_args = TrainingArguments(
    f"bert_ai",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    report_to="all",
    # load_best_model_at_end=True,
    # metric_for_best_model=metric_name,
    # push_to_hub=True,  # TODO: enable once model seems good
)

In [87]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

In [88]:
trainer.train()

***** Running training *****
  Num examples = 6627
  Num Epochs = 30
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 24870
  Number of trainable parameters = 108311810


ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,token_type_ids,attention_mask.

## Upload the Model

In [None]:
# agency-vs-rest/checkpoint-263: 0.75 precision, 0.85 recall
#

In [None]:
# trainer.push_to_hub()