# One-vs-Rest Classifier

This notebook implements an one-vs-rest classifier that fine-tunes several BERT models to tell if a sentence contains problematic metaphors.

<div hidden>
TODO: add extend data3/data.json with better data in the same format that actually makes sense.
</div>

## Imports and Setup

In [1]:
!pip install transformers -U
!pip install sklearn -U
!pip install datasets -U
!pip install torch -U
!pip install numpy -U
!pip install evaluate -U

In [1]:
import evaluate
import numpy as np
import torch
from datasets import Dataset, load_dataset
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from transformers import (
    BertForNextSentencePrediction,
    BertTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)
import gc



In [2]:
MODEL_NAME = "aihype_bert_fine_tune"

## Loading Dataset

In [3]:
dataset = load_dataset("json", data_files="data/pairs_unlabelled.json", field="data")
dataset

Using custom data configuration default-cfbde239128bba84
Found cached dataset json (/home/xt0r3/.cache/huggingface/datasets/json/default-cfbde239128bba84/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 199
    })
})

In [4]:
dataset["train"][0:3]

{'sen1': ['Not one savings account is currently able to keep anywhere near the pace of rising costs .',
  'People can be forgiven for losing interest and whether there is a point of tucking money away for interest paying in some circumstances 5 per cent below the rate of inflation - and this gap could grow bigger in the coming months .',
  'However , firstly for those without a savings pot whatsoever , it is important to have a rainy day fund to fall back on .'],
 'sen2': ['By Ed Magnus For Thisismoney.co.uk Published : 07:50 , 13 January 2022 | Updated : 19:04 , 13 January 2022 8 View comments Surging inflation means the outlook for savers hunting returns is bleaker than bleak .',
  'Not one savings account is currently able to keep anywhere near the pace of rising costs .',
  'People can be forgiven for losing interest and whether there is a point of tucking money away for interest paying in some circumstances 5 per cent below the rate of inflation - and this gap could grow bigger in

In [5]:
num_epochs = 30

## Preprocess Data, Create Train/Test Split

In [6]:
dataset = dataset.class_encode_column('ans')
processed_dataset = dataset["train"].train_test_split(test_size=0.2, stratify_by_column='ans')
processed_dataset

Loading cached processed dataset at /home/xt0r3/.cache/huggingface/datasets/json/default-cfbde239128bba84/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-bfc0944b23a1635e.arrow
Loading cached processed dataset at /home/xt0r3/.cache/huggingface/datasets/json/default-cfbde239128bba84/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-bd8978a7c9633bcd.arrow


DatasetDict({
    train: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 159
    })
    test: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 40
    })
})

In [7]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [8]:
def preprocess_data(examples):
    return tokenizer(examples["sen1"], examples['sen2'], padding='max_length', truncation=True)

In [9]:
tokenized_dataset = processed_dataset.map(
    preprocess_data,
    remove_columns=("sen1", "sen2"),
    batched=True,
).rename_column('ans', 'next_sentence_label')

tokenized_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 159
    })
    test: Dataset({
        features: ['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 40
    })
})

### Verify dataset

In [10]:
example = tokenized_dataset['train'][3]
example.keys()

dict_keys(['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [11]:
tokenizer.decode(example["input_ids"])

"[CLS] However, it may not appeal to those looking to stash away large amounts as the 2 per cent deal is capped at £3, 000. [SEP] It is also offering savers a deal paying 2 per cent - they don't have to pay for coaching to benefit either. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

## Load Pre-Trained Model

In [12]:
# use_fast uses fast tokenizers backed by rust. Remove it if it causes errors
model = BertForNextSentencePrediction.from_pretrained(
    "bert-base-cased",
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Verify data-model interaction

In [13]:
# forward pass
# outputs = model(
# input_ids=tokenized_dataset[labels[0]]["train"]["input_ids"][0],
# labels=tokenized_dataset[labels[0]]["train"][0]["labels"],
# )
# outputs

## Define Metrics

In [14]:
metrics = {
    "accuracy": evaluate.load("accuracy"),
    "presicion": evaluate.load("precision"),
    "recall": evaluate.load("recall"),
    "f1": evaluate.load("f1"),
}

In [15]:
def compute_metrics(eval_pred):
    # print(eval_pred)
    # print(list(eval_pred))
    logits, labels = eval_pred
    # print(logits, labels)
    predictions = np.argmax(logits, axis=-1)
    return {
        name: metric.compute(predictions=predictions, references=labels)
        for name, metric in metrics.items()
    }

def compute_best(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    val = metrics['f1'].compute(predictions=predictions, labels=labels)
    print(val)
    return val

## Train the Model

In [16]:
batch_size = 1  # TODO: increase if we have more data
num_epochs = 30

In [17]:
training_args = TrainingArguments(
    MODEL_NAME,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    report_to="all",
    label_names=['next_sentence_label'],
    load_best_model_at_end=True,
    push_to_hub=True,
)

In [19]:
small_train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(3000))
small_eval_dataset = tokenized_dataset["test"].shuffle(seed=42).select(range(1000))

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,  # tokenized_dataset["train"],
    eval_dataset=small_eval_dataset,  # tokenized_dataset["test"],
    compute_metrics=compute_metrics,  # compute_metrics,
)

In [21]:
trainer.train()

***** Running training *****
  Num examples = 50
  Num Epochs = 15
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 108311810


Epoch,Training Loss,Validation Loss,Accuracy,Presicion,Recall,F1
1,No log,3.3e-05,{'accuracy': 1.0},{'precision': 0.0},{'recall': 0.0},{'f1': 0.0}
2,No log,2.5e-05,{'accuracy': 1.0},{'precision': 0.0},{'recall': 0.0},{'f1': 0.0}
3,No log,2.4e-05,{'accuracy': 1.0},{'precision': 0.0},{'recall': 0.0},{'f1': 0.0}


***** Running Evaluation *****
  Num examples = 40
  Batch size = 1
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
Saving model checkpoint to bert_ai_fine_tune/checkpoint-50
Configuration saved in bert_ai_fine_tune/checkpoint-50/config.json
Model weights saved in bert_ai_fine_tune/checkpoint-50/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 40
  Batch size = 1
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
Saving model checkpoint to bert_ai_fine_tune/checkpoint-100
Configuration saved in bert_ai_fine_tune/checkpoint-100/config.json
Model weights saved in bert_ai_fine_tune/checkpoint-100/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 40
  Batch size = 1
  _warn_prf(average, modifier, 

## Upload the Model

In [None]:
# Free the memory
gc.collect()

with torch.no_grad():
    torch.cuda.empty_cache()
    
model = None
trainer = None
training_args = None
gc.collect()

In [None]:
# agency-vs-rest/checkpoint-263: 0.75 precision, 0.85 recall
#

In [None]:
# trainer.push_to_hub()