This notebook fine-tunes Google's BigBird model for the Coleridge Intiaitive Show US the data competition.  

It was difficult to get it to run even a single epoch within Kaggle's 9hr timout limit.  

In order to do that I separated out the data preparation, here: https://www.kaggle.com/danieldorosz/show-us-the-data-bigbird-dataprep  
and the inference, here: https://www.kaggle.com/danieldorosz/show-us-the-data-bigbird-inference  

A chunk of the logic is farmed-out to a coleridge-helpers utility script.   

The main intuition behind this effort was that I wanted to include as much context as possible in my training examples. 
Also that I wanted to keep related context together. We have a ready-provided demarkation of context expressed as 
sections in the training data. So what I did was create contextual 'snippets' as my training examples. Each snippet 
contains one or more sections such that my training examples get as close as possible to BigBird's maximum of 4096
tokens, without breaking up any sections. If a single section is longer the training example limit, I break it up 
at the last period prior to the limit.   

The code is very much a rough-and-ready first draft, please don't judge me ;-) There is much to be improved for which 
I didn't have time. This mainly serves as a baseline to assess the score I could expect from this kind of approach.  

I ran it a couple of times using the last checkpoint from the first (timed-out) run as input to the next.

# Imports & Preamble

In [None]:
!pip install -qU --no-warn-conflicts transformers --no-index --find-links=file:///kaggle/input/coleridge-packages
!pip install -qU --no-warn-conflicts tokenizers --no-index --find-links=file:///kaggle/input/coleridge-packages
!pip install -qU --no-warn-conflicts datasets --no-index --find-links=file:///kaggle/input/coleridge-packages
!pip install -qU --no-warn-conflicts fsspec --no-index --find-links=file:///kaggle/input/coleridge-packages
!pip install -qU --no-warn-conflicts seqeval --no-index --find-links=file:///kaggle/input/coleridge-packages
    
# need to set wandb off otherwise we get errors using this kernel offline
!wandb off

In [None]:
import numpy as np 
from transformers import (
    BigBirdForTokenClassification,
    BigBirdTokenizerFast,
    BigBirdConfig,
    TrainingArguments, 
    Trainer,
    DataCollatorForTokenClassification,
)
from datasets import (
    Dataset,
    load_metric,
)

# Load Training Data

In [None]:
# tokenized_dataset = Dataset.from_json("../input/coleridgetaggedsnippets/tokenized_dataset_reduced.json")
# tokenized_dataset = tokenized_dataset.shuffle(seed=42)

tokenized_dataset = Dataset.from_json("../input/show-us-the-data-bigbird-dataprep/tokenized_dataset.json")

# Instantiate Pretrained BigBird Model & Tokenizer

In [None]:
# BigBird roberta-base
model_class, tokenizer_class, pretrained_weights = (BigBirdForTokenClassification, BigBirdTokenizerFast, '../input/huggingfacebigbirdrobertabase')

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

data_collator = DataCollatorForTokenClassification(tokenizer)

label_list = ["O", "B", "I"]
label2id = {label : id for id, label in enumerate(label_list)}
id2label = {id : label for label, id in label2id.items()}

def get_pretrained_model(checkpoint=pretrained_weights):
    config = BigBirdConfig(attention_type="block_sparse", gradient_checkpointing=True, num_labels=3, id2label=id2label, label2id=label2id)
    return model_class.from_pretrained(checkpoint, config=config)

# Fine Tuning

## Metrics

The metrics will only reflect training accuracy as we are using the traning set for evaluation.
This could be improved by using kfold validation.

In [None]:
# copy my_seqeval.py to the working directory because the input directory is non-writable
!cp ../input/coleridge-packages/seqeval_script.py ./

In [None]:
# TODO: This could be improved to use f0.5 score (used on LB evaluation) instead of f1

metric = load_metric("seqeval_script.py")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

## Run Training

In [None]:
batch_size = 4

args = TrainingArguments(
    "model_checkpoints",
    evaluation_strategy = "epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    save_total_limit=1,
    save_strategy="steps",
    save_steps=1000,
)

trainer = Trainer(
    get_pretrained_model("../input/coleridgemodelcheckpoint"),
    args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
trainer.save_model("/saved_model")