# Simple PyTorch NER Training on GPU

#### A simple way to train an NER model that will classify each token into a discourse type. This uses the Hugging Face `Trainer` and their default `ModelForTokenClassification`, but you can also stick a custom `nn.Module` into it as well. It has already been seen that Longformer does well, so let's try its close relative: BigBird. It's another model that has a novel attention mechanism that can handle long sequence lengths. There are some benchmarks for long range models that show BigBird doing better than Longformer (see image below), but we'll have to see what works better for this competition. This is using the base version of BigBird, but you'll probably get better performance on the large version -- it will just take longer to train. 

#### This is only 1 fold and includes macro-f1 CV score. It would be wise to do k-fold validation with your favorite value of k to get a better sense of how your model is doing.

## Coming soon:
### âœ… Weights and biases reporting ([jump to section](#Weights-and-Biases-Report))
### âœ… Custom evaluate function that does competition-specific f1 scoring ([jump to section](#Custom-Trainer))
### ðŸ”³ Model with custom output layers
### âœ… Model that has been pretrained further on in-domain data
    - (See Version 9 for results) Modest improvement of CV from 0.535 -> 0.543
    - More experiments will be needed to verify value
    - Pretrained BigBird base model available here: https://www.kaggle.com/nbroad/bigbird-base-idpt-feedback-prize
### ðŸ”³ Comparison of BIO vs IO labeling scheme
### ðŸ”³ Comparison of tokenizing after splitting at whitespace vs tokenizing while keeping whitespace
### ðŸ”³ Any suggestions?


### ðŸš¨ ~~It appears that there might be an issue with the bigbird tokenizer ([warning similar to here](https://github.com/huggingface/transformers/pull/11075#issuecomment-833404825)). Will update this when confirming it is/isn't working properly.~~ False alarm ðŸ˜…


### ðŸ‘‰ Version 5 (CV F1: 0.535) has the outputs of the first full training without pretraining
### ðŸ‘‰ Version 9 (CV F1: 0.543) has the outputs of the first full training with pretraining
### ðŸ‘‰ Version 16 fixes error when calcualting CV score. Also uses corrected training file
### ðŸ‘‰ Version 17 (CV F1: 0.609) has cleaner train/val splits. Now it doesn't have overlapping file ids. Rookie mistake ðŸ˜¬

<img src="https://user-images.githubusercontent.com/1694368/102184635-baab7400-3eea-11eb-8113-b3fb6d4b8bbc.png" alt="benchmark comparison" width="800"/>

### Code based off of: 

- [run_ner.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner.py)
- [Rob Mulla's implementation of f1 score (@robikscube)](https://www.kaggle.com/robikscube/student-writing-competition-twitch)
- [zzy's infer notebook](https://www.kaggle.com/zzy990106/pytorch-ner-infer?scriptVersionId=82677278&cellId=13)

In [None]:
# necessary for metrics
!pip install seqeval -qq

In [None]:
import os
from pathlib import Path
from functools import partial
from dataclasses import dataclass, field
from typing import Optional

import pandas as pd
import numpy as np
import torch
import datasets
from datasets import ClassLabel, load_dataset, load_metric, Dataset

import transformers
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    HfArgumentParser,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    set_seed,
)

# Arguments hidden in next cell

In [None]:
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
            "with private models)."
        },
    )
    reinit_layers: int = field(
        default=None,
        metadata={
            "help": "Number of layers from the output to reinitialize in encoder."
        }
    )
    use_custom_output: bool = field(
        default=False,
        metadata={"help": "Set to true to use the custom model defined below"}
    )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    task_name: Optional[str] = field(default="ner", metadata={"help": "The name of the task (ner, pos...)."})
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    train_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a csv or JSON file)."}
    )
    validation_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate on (a csv or JSON file)."},
    )
    test_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input test data file to predict on (a csv or JSON file)."},
    )
    val_split_percentage: Optional[float] = field(
        default=None,
        metadata={"help": "How much of the data to use for doing train_test_split. If None, already assumes "
                  "there are separate traiin and validation files"},
    )
    text_column_name: Optional[str] = field(
        default=None, metadata={"help": "The column name of text to input in the file (a csv or JSON file)."}
    )
    label_column_name: Optional[str] = field(
        default=None, metadata={"help": "The column name of label to input in the file (a csv or JSON file)."}
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_seq_length: int = field(
        default=None,
        metadata={
            "help": "The maximum total input sequence length after tokenization. If set, sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    pad_to_max_length: bool = field(
        default=False,
        metadata={
            "help": "Whether to pad all samples to model maximum sentence length. "
            "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
            "efficient on GPU but very bad for TPU."
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
            "value if set."
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
            "value if set."
        },
    )
    max_predict_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this "
            "value if set."
        },
    )
    label_all_tokens: bool = field(
        default=False,
        metadata={
            "help": "Whether to put the label for one word on all tokens of generated by that word or just on the "
            "one (in which case the other tokens will have a padding index)."
        },
    )
    return_entity_level_metrics: bool = field(
        default=False,
        metadata={"help": "Whether to return all the entity levels during evaluation or just the overall ones."},
    )
    use_bio: Optional[bool] = field(default=True, metadata={"help": "If True, use BIO tagging. Otherwise use IO."})
    data_file: str = field(default=None, metadata={"help": "Path to single data file that will be used to make train and validation files."})
    ground_truth_file: Optional[str] = field(default="../input/feedback-prize-2021/train.csv", metadata={"help": "Path to single data file that will be used to make train and validation files."})
    def __post_init__(self):
        if self.dataset_name is None and self.train_file is None and self.validation_file is None:
            raise ValueError("Need either a dataset name or a training/validation file.")
        else:
            if self.train_file is not None:
                extension = self.train_file.split(".")[-1]
                assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
            if self.validation_file is not None:
                extension = self.validation_file.split(".")[-1]
                assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file."
        self.task_name = self.task_name.lower()

In [None]:
model_args = ModelArguments(
    model_name_or_path="../input/bigbird-base-feedback-prize-idpt",
    reinit_layers=0,
    use_custom_output=True,
)
data_args = DataTrainingArguments(
    train_file="train.json",
    validation_file="validation.json",
    val_split_percentage=0.1,
    text_column_name="words",
    label_column_name="bio",
    max_seq_length=1024,
    pad_to_max_length=True,
    label_all_tokens=True,
    return_entity_level_metrics=False, # I wouldn't recommend this-it produces a ton of metrics
    preprocessing_num_workers=4,
    use_bio=True,
    data_file="../input/feedbackprize-bio-ner-train-data/corrected/split_at_whitespace.json",
    ground_truth_file="../input/feedback-prize-corrected-train-csv/corrected_train.csv",
    max_train_samples=None, # Useful for quick debugging runs. None means run all
    max_eval_samples=None, # Useful for quick debugging runs. None means run all
)
training_args = TrainingArguments(
    output_dir="feedback-prize-bigbird-base",
    do_train=True,
    do_eval=True,
    do_predict=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    weight_decay=0.0,
    num_train_epochs=3,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=600,
    save_strategy="steps",
    save_steps=600,
    save_total_limit=5,
    load_best_model_at_end=True,
    metric_for_best_model="eval_Overall-CV-F1",
    seed=18,
    fp16=True,
    report_to="wandb",
    run_name="bb-base-ner-corrected-weighted-custom-output-idpt-no-dropout-no-wd",
)

## I use wandb to track progress

#### Feel free to disable it by setting `report_to="none"` in `TrainingArguments`

In [None]:
if "wandb" in training_args.report_to:
    !pip install -U wandb -qq
    import wandb

    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    wandb_key = user_secrets.get_secret("wandb")

    os.environ["WANDB_PROJECT"] = "feedback-prize"
    wandb.login(key=wandb_key)

# train-test data split

(Version 13+) Used to be 80-20 but it looks like everyone else is doing 90-10  
Data is in BIO format.  
Here is a notebook showing how to make the data:  https://www.kaggle.com/nbroad/feedback-prize-bio-format-for-ner  
Here is a dataset to use in your own notebooks: https://www.kaggle.com/nbroad/feedbackprize-bio-ner-train-data

In [None]:
from transformers.utils.logging import set_verbosity, ERROR

set_verbosity(ERROR)
set_seed(training_args.seed)

if data_args.data_file is not None:
    full_dataset = datasets.load_dataset("json", data_files=data_args.data_file, split="train")

if data_args.val_split_percentage is not None:
    # set_seed should work just fine, but to be safe I'll do it again
    np.random.seed(training_args.seed)

    file_ids = np.array(full_dataset["id"])
    np.random.shuffle(file_ids)
    file_ids = list(set(file_ids))

    num_files = len(file_ids)
    threshold = int(num_files*data_args.val_split_percentage)
    train_file_ids = set(file_ids[threshold:])

    train_dataset = full_dataset.filter(lambda x: x["id"] in train_file_ids)
    val_dataset = full_dataset.filter(lambda x: x["id"] not in train_file_ids)
    
    train_dataset.to_json("train.json")
    val_dataset.to_json("validation.json")
    del train_dataset, val_dataset

ground_truth_dataset = load_dataset("csv", data_files=data_args.ground_truth_file, split="train")
if "corrected" in data_args.ground_truth_file:
    keep_cols = ["id", "discourse_type", "new_predictionstring"]
    ground_truth_dataset = ground_truth_dataset.remove_columns([x for x in ground_truth_dataset.column_names if x not in keep_cols])
    ground_truth_dataset = ground_truth_dataset.rename_column("new_predictionstring", "predictionstring")

# Load raw data

In [None]:
data_files = {}
if data_args.train_file is not None:
    data_files["train"] = data_args.train_file
if data_args.validation_file is not None:
    data_files["validation"] = data_args.validation_file
if data_args.test_file is not None:
    data_files["test"] = data_args.test_file
extension = data_args.train_file.split(".")[-1]
raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

# Labels

In [None]:
if training_args.do_train:
    column_names = raw_datasets["train"].column_names
    features = raw_datasets["train"].features
else:
    column_names = raw_datasets["validation"].column_names
    features = raw_datasets["validation"].features

if data_args.text_column_name is not None:
    text_column_name = data_args.text_column_name
elif "tokens" in column_names:
    text_column_name = "tokens"
else:
    text_column_name = column_names[0]

if data_args.label_column_name is not None:
    label_column_name = data_args.label_column_name
elif f"{data_args.task_name}_tags" in column_names:
    label_column_name = f"{data_args.task_name}_tags"
else:
    label_column_name = column_names[1]

# In the event the labels are not a `Sequence[ClassLabel]`, we will need to go through the dataset to get the
# unique labels.
def get_label_list(labels):
    unique_labels = set()
    for label in labels:
        unique_labels = unique_labels | set(label)
    label_list = list(unique_labels)
    label_list.sort()
    return label_list

if isinstance(features[label_column_name].feature, ClassLabel):
    label_list = features[label_column_name].feature.names
    # No need to convert the labels since they are already ints.
    label2id = {i: i for i in range(len(label_list))}
else:
    label_list = get_label_list(raw_datasets["train"][label_column_name])
    label2id = {l: i for i, l in enumerate(label_list)}
num_labels = len(label_list)

if not data_args.use_bio:
    # Map that sends B-Xxx label to its I-Xxx counterpart
    b_to_i_label = []
    for idx, label in enumerate(label_list):
        if label.startswith("B-") and label.replace("B-", "I-") in label_list:
            b_to_i_label.append(label_list.index(label.replace("B-", "I-")))
        else:
            b_to_i_label.append(idx)

# Custom Bigbird Model

In [None]:
from transformers import BigBirdPreTrainedModel, BigBirdModel
from transformers.modeling_outputs import TokenClassifierOutput

class BigBirdForTokenClassification(BigBirdPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.transformer = BigBirdModel(config)
        self.preclassifier = torch.nn.Linear(config.hidden_size, config.hidden_size//2)
        self.classifier = torch.nn.Linear(config.hidden_size//2, config.num_labels)

        # Initialize weights and apply final processing
#         self.post_init()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.transformer(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        sequence_output = self.preclassifier(sequence_output)
        sequence_output = torch.nn.functional.gelu(sequence_output)        
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            loss_fct = torch.nn.CrossEntropyLoss()
            # Only keep active parts of the loss
            if attention_mask is not None:
                active_loss = attention_mask.view(-1) == 1
                active_logits = logits.view(-1, self.num_labels)
                active_labels = torch.where(
                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
                )
                loss = loss_fct(active_logits, active_labels)
            else:
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

# Config, Tokenizer, Model

In [None]:
config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    num_labels=num_labels,
    label2id=label2id,
    id2label={i: l for l, i in label2id.items()},
    finetuning_task=data_args.task_name,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)

tokenizer_name_or_path = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=True,
    add_prefix_space=True,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)

if model_args.use_custom_output:
    model = BigBirdForTokenClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )
else:
    model = AutoModelForTokenClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )

def reinit_weights(encoder, reinit_layers, config):
    print(f"Reinitializing last {reinit_layers} layers in the encoder.")
    for layer in encoder.layer[-reinit_layers:]:
        for module in layer.modules():
            if isinstance(module, torch.nn.Linear):
                module.weight.data.normal_(mean=0.0, std=config.initializer_range)
                if module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, torch.nn.Embedding):
                module.weight.data.normal_(mean=0.0, std=config.initializer_range)
                if module.padding_idx is not None:
                    module.weight.data[module.padding_idx].zero_()
            elif isinstance(module, torch.nn.LayerNorm):
                module.bias.data.zero_()
                module.weight.data.fill_(1.0)

if model_args.reinit_layers:
    reinit_weights(model.transformer.encoder, model_args.reinit_layers, model.config)

# Tokenizing

### Old method that tokenizes after splitting at whitespace

In [None]:
# Tokenize all texts and align the labels with them.
def tokenize_and_align_labels(examples, train=True):
    tokenized_inputs = tokenizer(
        examples[data_args.text_column_name],
        padding=padding if train else "longest",
        truncation=True,
        max_length=data_args.max_seq_length if train else tokenizer.model_max_length,
        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        is_split_into_words=True,
    )
    labels = []
    all_word_ids = []
    for i, label in enumerate(examples[data_args.label_column_name]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label2id[label[word_idx]])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                if data_args.label_all_tokens:
                    # The default script converts all b-labels to i-labels
                    if data_args.use_bio:
                        label_ids.append(label2id[label[word_idx]])
                    else:
                        label_ids.append(b_to_i_label[label2id[label[word_idx]]])
                else:
                    label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)
        if not train:
            all_word_ids.append(word_ids)
    tokenized_inputs["labels"] = labels
    
    if not train:
        tokenized_inputs["word_ids"] = all_word_ids
    
    return tokenized_inputs

In [None]:
# Preprocessing the dataset
# Padding strategy
padding = "max_length" if data_args.pad_to_max_length else False


if training_args.do_train:
    if "train" not in raw_datasets:
        raise ValueError("--do_train requires a train dataset")
    train_dataset = raw_datasets["train"]
    if data_args.max_train_samples is not None:
        train_dataset = train_dataset.select(range(data_args.max_train_samples))

    with training_args.main_process_first(desc="train dataset map pre-processing"):
        train_dataset = train_dataset.map(
            partial(tokenize_and_align_labels, train=True),
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=not data_args.overwrite_cache,
            desc="Running tokenizer on train dataset",
        )

if training_args.do_eval:
    if "validation" not in raw_datasets:
        raise ValueError("--do_eval requires a validation dataset")
    eval_dataset = raw_datasets["validation"]
    if data_args.max_eval_samples is not None:
        eval_dataset = eval_dataset.select(range(data_args.max_eval_samples))

    with training_args.main_process_first(desc="validation dataset map pre-processing"):
        eval_dataset = eval_dataset.map(
            partial(tokenize_and_align_labels, train=False),
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=not data_args.overwrite_cache,
            desc="Running tokenizer on validation dataset",
        )

# Collator and metrics

In [None]:
# Data collator
pad_to_multiple_of = 8 if training_args.fp16 else None

# These models have minimum padding sizes
if "bigbird" in model_args.model_name_or_path:
    pad_to_multiple_of = 1024
if "longformer" in model_args.model_name_or_path:
    pad_to_multiple_of = 512
    
data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=pad_to_multiple_of)

# Metrics
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if data_args.return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

# CV Calculation Functions

In [None]:
# Rob Mulla @robikscube
# https://www.kaggle.com/robikscube/student-writing-competition-twitch
def calc_overlap(pred, ground_truth):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(pred.split(' '))
    set_gt = set(ground_truth.split(' '))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter/ len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df = pred_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    # Step 1. all ground truths and predictions for a given discourse_type are compared.
    joined = pred_df.merge(gt_df,
                           left_on=['id','discourse_type'],
                           right_on=['id','discourse_type'],
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    joined['predictionstring_gt'] = joined['predictionstring_gt'].fillna(' ')
    joined['predictionstring_pred'] = joined['predictionstring_pred'].fillna(' ')

    joined['overlaps'] = [calc_overlap(pred, gt) for pred, gt in joined[['predictionstring_pred', 'predictionstring_gt']].values]

    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['overlap1'] = joined['overlaps'].apply(lambda x: eval(str(x))[0])
    joined['overlap2'] = joined['overlaps'].apply(lambda x: eval(str(x))[1])


    joined['potential_TP'] = (joined['overlap1'] >= 0.5) & (joined['overlap2'] >= 0.5)
    joined['max_overlap'] = joined[['overlap1','overlap2']].max(axis=1)
    tp_pred_ids = joined.query('potential_TP') \
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','predictionstring_gt']).first()['pred_id'].values

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined['pred_id'].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query('potential_TP')['gt_id'].unique()
    unmatched_gt_ids = [c for c in joined['gt_id'].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    denominator = (TP + 0.5*(FP+FN))
    if denominator == 0:
        return 0.0
    my_f1_score = TP / denominator
    return {
        "F1": round(my_f1_score, 4),
        "Precision": TP/(TP+FP),
        "Recall": TP/(FP+FN),
    }
        

id2label={i: l for l, i in label2id.items()}
# https://www.kaggle.com/zzy990106/pytorch-ner-infer?scriptVersionId=82677278&cellId=13
def get_label_predictions(dataset, preds):

    ids = dataset["id"]
    word_ids = dataset["word_ids"]
    words = dataset["words"]
    
    all_preds = []

    for id_, sample_preds, sample_word_ids, words in zip(ids, preds, word_ids, words):
        label_preds = [""]*len(words)

        for pred, w_id in zip(sample_preds, sample_word_ids):
            if w_id is None:
                continue
            if label_preds[w_id] == "":
                label_preds[w_id] = id2label[pred]

        j = 0
        while j < len(label_preds):
            label = label_preds[j].lstrip("BI-")
            if label == 'O' or label == '':
                j += 1
            else:
                end = j + 1
                while end < len(label_preds) and label_preds[end].lstrip("BI-") == label:
                    end += 1

                if end - j > 8:
                    all_preds.append((id_, label, ' '.join(map(str, list(range(j, end))))))

                j = end
                
    return all_preds

# Custom Trainer

In [None]:
class FeedbackPrizeTrainer(Trainer):
    
    def __init__(self, *args, **kwargs):
        # The Trainer will remove the important columns needed for cv from the eval_dataset,
        # so we'll just store it like this
        if "cv_dataset" in kwargs:
            self.cv_dataset = kwargs.pop("cv_dataset")
        super().__init__(*args, **kwargs)
        
        
    def evaluation_loop(
        self, 
        dataloader,
        description,
        prediction_loss_only = None,
        ignore_keys = None,
        metric_key_prefix = "eval",
    ):
        
        eval_output =  super().evaluation_loop(
            dataloader,
            description,
            prediction_loss_only,
            ignore_keys,
            metric_key_prefix
        )
        
        # This same loop gets called during predict, and we can't do CV when predicting
        is_in_eval = metric_key_prefix == "eval"
        
        # Custom CV F1 calculation
        if is_in_eval:
            
            eval_id_preds = eval_output.predictions.argmax(-1)
            
            eval_label_preds = get_label_predictions(self.cv_dataset, eval_id_preds)
            
            eval_pred_df = pd.DataFrame(eval_label_preds, columns=["id", "discourse_type", "predictionstring"])
            ground_truth_df = ground_truth_dataset.to_pandas()
            
            eval_gt_df = ground_truth_df[ground_truth_df["id"].isin(self.cv_dataset["id"])].reset_index(drop=True).copy()
            
            classes = ['Lead', 'Position', 'Evidence', 'Claim', 'Concluding Statement', 'Counterclaim', 'Rebuttal']
            f1_scores = []
            for class_ in classes:
                gt_df = eval_gt_df.loc[eval_gt_df['discourse_type'] == class_].copy()
                pred_df = eval_pred_df.loc[eval_pred_df['discourse_type'] == class_].copy()
                eval_scores = score_feedback_comp(pred_df, gt_df)
                for score_name, score in eval_scores.items():
                    eval_output.metrics[f"{metric_key_prefix}_{class_}-CV-{score_name}"] = score
                f1_scores.append(eval_scores["F1"])
                
            eval_output.metrics[f"{metric_key_prefix}_Overall-CV-F1"] = np.mean(f1_scores)
        
        return eval_output

# Initialize our Trainer
trainer = FeedbackPrizeTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    cv_dataset=eval_dataset, 
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Training

### Scores for each evaluation will be shown in a table. If you go all the way to the right on the table, you'll see the CV F1 score.

In [None]:
%env TOKENIZERS_PARALLELISM=true

# Training
if training_args.do_train:
    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.save_model()  # Saves the tokenizer too for easy upload

    max_train_samples = (
        data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
    )
    metrics["train_samples"] = min(max_train_samples, len(train_dataset))

    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

# Final evaluation

In [None]:
# Evaluation
if training_args.do_eval:

    metrics = trainer.evaluate()

    max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
    metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Predicting on test set

In [None]:
test_data = {
    "id": [],
    data_args.text_column_name: []
}

def tokenize(examples):
    tokenized_inputs = tokenizer(
        examples[data_args.text_column_name],
        truncation=True,
        max_length=4096,
        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        is_split_into_words=True,
    )
    num_samples = len(tokenized_inputs["input_ids"])
    tokenized_inputs["word_ids"] = [tokenized_inputs.word_ids(batch_index=i) for i in range(num_samples)]
    
    return tokenized_inputs

# If you actually want to submit, it would probably be best in another notebook.
# Take your model, stick it in the Trainer, and run it through this to get your submission file.
# A lot of the code above is for training and validation, so only copy what you need.
if training_args.do_predict:
    for file in Path("../input/feedback-prize-2021/test").glob("*.txt"):
        file_id = file.stem

        with open(file) as fp:
            text = fp.read()

        test_data[data_args.text_column_name].append(text.split())
        test_data["id"].append(file_id)

    raw_test_dataset = Dataset.from_dict(test_data)

    test_dataset = raw_test_dataset.map(
            tokenize,
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=not data_args.overwrite_cache,
            desc="Running tokenizer on test dataset",
        )

    test_predictions = trainer.predict(test_dataset)
    
    test_id_preds = test_predictions.predictions.argmax(-1)
    
    test_label_preds = get_label_predictions(test_dataset, test_id_preds)
    test_pred_df = pd.DataFrame(test_label_preds, columns=["id", "class", "predictionstring"])
    
    test_pred_df.to_csv("submission.csv", index=False)
    
    display(test_pred_df)

# Weights and Biases Report

<iframe src="https://wandb.ai/nbroad/feedback-prize/reports/-Feedback-Prize-Bigbird-base-NER-fine-tuning--VmlldzoxMzc3ODQz" style="border:none;height:1024px;width:100%">

### I'm borrowing Darek's beautiful displacy code :)

kaggle.com/thedrcat/feedback-prize-huggingface-baseline-training

In [None]:
# kaggle.com/thedrcat/feedback-prize-huggingface-baseline-training
import spacy
from spacy import displacy
from pylab import cm, matplotlib

colors = {
            'Lead': '#8000ff',
            'Position': '#2b7ff6',
            'Evidence': '#2adddd',
            'Claim': '#80ffb4',
            'Concluding Statement': 'd4dd80',
            'Counterclaim': '#ff8042',
            'Rebuttal': '#ff0000',
            'Other': '#007f00',
         }

def visualize(df, text):
    ents = []
    example = df['id'].loc[0]

    for i, row in df.iterrows():
        ents.append({
                        'start': int(row['discourse_start']), 
                         'end': int(row['discourse_end']), 
                         'label': row['discourse_type']
                    })

    doc2 = {
        "text": text,
        "ents": ents,
        "title": example
    }

    options = {"ents": train.discourse_type.unique().tolist() + ['Other'], "colors": colors}
    displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True)

In [None]:
# kaggle.com/thedrcat/feedback-prize-huggingface-baseline-training
def get_class(c):
    return id2label[c].replace("B-", "").replace("I-")

def pred2span(pred, example, viz=False):
    example_id = example['id']
    n_tokens = len(example['input_ids'])
    classes = []
    all_span = []
    for i, c in enumerate(pred.tolist()):
        if i == n_tokens-1:
            break
        if i == 0:
            cur_span = example['offset_mapping'][i]
            classes.append(get_class(c))
        elif i > 0 and (c == pred[i-1] or (c-7) == pred[i-1]):
            cur_span[1] = example['offset_mapping'][i][1]
        else:
            all_span.append(cur_span)
            cur_span = example['offset_mapping'][i]
            classes.append(get_class(c))
    all_span.append(cur_span)
    
    text = open(f"../input/feedback-prize-2021/test/{example_id}.txt").read()
    
    # map token ids to word (whitespace) token ids
    predstrings = []
    for span in all_span:
        span_start = span[0]
        span_end = span[1]
        before = text[:span_start]
        token_start = len(before.split())
        if len(before) == 0: token_start = 0
        elif before[-1] != ' ': token_start -= 1
        num_tkns = len(text[span_start:span_end+1].split())
        tkns = [str(x) for x in range(token_start, token_start+num_tkns)]
        predstring = ' '.join(tkns)
        predstrings.append(predstring)
                    
    rows = []
    for c, span, predstring in zip(classes, all_span, predstrings):
        e = {
            'id': example_id,
            'discourse_type': c,
            'predictionstring': predstring,
            'discourse_start': span[0],
            'discourse_end': span[1],
            'discourse': text[span[0]:span[1]+1]
        }
        rows.append(e)


    df = pd.DataFrame(rows)
    df['length'] = df['discourse'].apply(lambda t: len(t.split()))
    
    # short spans are likely to be false positives, we can choose a min number of tokens based on validation
    df = df[df.length > min_tokens].reset_index(drop=True)
    if viz: visualize(df, text)

    return df