Introduction
------------

This notebook is a small preparation stage for [Training RoBERTa in 10 minutes][1]. 
By splitting the work into two parts I aim to save the TPU quota, separate concerns, and keep each part easy to read and follow.

Many thanks to [Chris Deotte][2] for his [amazing work][3]. 
This notebook is its Copy-Edit version with just a handful of changes.

All the outputs of this notebook are also published to [this dataset][4].

Imports
-------

[1]: https://www.kaggle.com/nickuzmenkov/feedback-prize-training-roberta-in-10-minutes/
[2]: https://www.kaggle.com/cdeotte
[3]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-633
[4]: https://www.kaggle.com/nickuzmenkov/feedback-prize-roberta-tokens-1024

In [None]:
from tqdm.notebook import tqdm
import transformers
import pandas as pd
import numpy as np
import typing
import os

Configuration
-------------

In [None]:
BASE_MODEL = "roberta-base"
SEQ_LEN = 1024
TEXT_PATH = "../input/feedback-prize-2021/train"
CSV_PATH = "../input/feedback-prize-2021/train.csv"
LABEL_MAP = {
    "Lead": 0,
    "Position": 1,
    "Evidence": 2,
    "Claim": 3,
    "Concluding Statement": 4,
    "Counterclaim": 5,
    "Rebuttal": 6,
}

Tokenizer code
--------------

Treat it simply as a black box unless you want to discover it in-depth.

In [None]:
class Tokenizer:
    InputIds = typing.TypeVar("InputIds", bound=typing.List[int])
    AttentionMask = typing.TypeVar("AttentionMask", bound=typing.List[int])
    Labels = typing.TypeVar("Labels", bound=typing.List[int])
    OffsetMapping = typing.TypeVar(
        "OffsetMapping",
        bound=typing.List[typing.Tuple[int, int]],
    )
    TokenizerOutput = typing.NamedTuple(
        "TokenizerOutput",
        [
            ("input_ids", InputIds),
            ("attention_mask", AttentionMask),
            ("offset_mapping", OffsetMapping),
        ],
    )
    Output = typing.NamedTuple(
        "Output",
        [
            ("input_ids", InputIds),
            ("attention_mask", AttentionMask),
            ("labels", Labels),
        ],
    )

    def __init__(
        self, df: pd.DataFrame, base_model: transformers.AutoTokenizer
    ) -> None:
        """
        Initialize tokenizer instance

        :param df: DataFrame with labels for all texts
        :param base_model: pre-loaded tokenizer
        """
        self._df = df
        self._base_model = base_model

    def _init_output(self, n: int) -> Output:
        """
        Return output-like arrays of zeros

        :param n: number of unique text ids
        """
        return self.Output(
            input_ids=np.zeros((n, SEQ_LEN), dtype="int32"),
            attention_mask=np.zeros((n, SEQ_LEN), dtype="int32"),
            labels=np.zeros((n, SEQ_LEN, 2 * len(LABEL_MAP) + 1), dtype="int32"),
        )

    def _get_tokenizer_output(self, text: str) -> TokenizerOutput:
        """
        Return input token ids, attention mask and offset mapping of given text

        :param text: essay text to tokenize
        """
        encoding = self._base_model.encode_plus(
            text,
            max_length=SEQ_LEN,
            padding="max_length",
            truncation=True,
            return_offsets_mapping=True,
        )
        return self.TokenizerOutput(
            input_ids=encoding["input_ids"],
            attention_mask=encoding["attention_mask"],
            offset_mapping=encoding["offset_mapping"],
        )

    @staticmethod
    def _get_labels(df: pd.DataFrame, offset_mapping: OffsetMapping) -> Labels:
        """
        Return labels translated from initial text words (given) to tokens (machine understandable)

        :param df: slice of DataFrame containing single text id
        :param offset_mapping: offset mapping returned by tokenizer
        """
        labels = np.zeros((SEQ_LEN, 2 * len(LABEL_MAP) + 1), dtype="int32")
        offset_index = 0

        for _, (discourse_start, discourse_end, discourse_type) in df.iterrows():
            if offset_index > len(offset_mapping) - 1:
                break

            k = LABEL_MAP[discourse_type]

            token_start = offset_mapping[offset_index][0]
            token_end = offset_mapping[offset_index][1]

            first = True

            while discourse_end > token_start:
                if (token_start >= discourse_start) and (token_end <= discourse_end):
                    if first:
                        labels[offset_index, 2 * k] = 1
                        first = False
                    else:
                        labels[offset_index, 2 * k + 1] = 1

                offset_index += 1

                if offset_index > len(offset_mapping) - 1:
                    break

                token_start = offset_mapping[offset_index][0]
                token_end = offset_mapping[offset_index][1]

        labels[:, -1] = 1 - np.max(labels, axis=-1)
        return labels

    def tokenize(self, verbose: int = 0) -> Output:
        n = self._df.index.nunique()
        ids = enumerate(self._df.index.unique())

        if verbose > 0:
            ids = tqdm(ids, total=n, desc="Tokenizing")

        output = self._init_output(n=n)

        for i, id_ in ids:
            with open(os.path.join(TEXT_PATH, id_ + ".txt")) as file:
                text = file.read().strip().lower()
                tokenizer_output = self._get_tokenizer_output(text)

            output.input_ids[i] = tokenizer_output.input_ids
            output.attention_mask[i] = tokenizer_output.attention_mask
            output.labels[i] = self._get_labels(
                self._df[self._df.index == id_],
                tokenizer_output.offset_mapping,
            )

        return output

Load data and RoBerta tokenizer
-------------------------------

In [None]:
df = pd.read_csv(
    CSV_PATH,
    usecols=["id", "discourse_start", "discourse_end", "discourse_type"],
    dtype={
        "id": "object",
        "discource_start": "int32",
        "discourse_end": "int32",
        "discourse_type": "category",
    },
    index_col="id",
)
base_model = transformers.AutoTokenizer.from_pretrained(BASE_MODEL)

The tricky part
---------------

In the training data, each **word** is marked by one of 8 classes (lead, position, evidence, claim, counterclaim, rebuttal, concluding statement, or none of those). But what tokenizer produces is **tokens** that are loosely connected to words. Tokens **can be** words as well as word parts, special signs (e.g. text start `0` and text end `2`), or punctuation marks:


In [None]:
quote = "It's understood that Hollywood sells Californication..."
encoding = base_model.encode_plus(quote, return_offsets_mapping=True)

pd.DataFrame(
    {"token": [quote[x[0] : x[1]] for x in encoding["offset_mapping"]]},
    index=pd.Index(encoding["input_ids"], name="token_id"),
)

But what we have is word-wise labels, not token-wise labels. So we have to map the former to the latter.

Run tokenizing
--------------

In [None]:
tokenizer = Tokenizer(df=df, base_model=base_model)
output = tokenizer.tokenize(verbose=1)

Save results
------------

In [None]:
np.save("input_ids.npy", output.input_ids)
np.save("attention_mask.npy", output.attention_mask)
np.save("labels.npy", output.labels)

Conclusion
----------

Thanks for reading. If you like this work, please visit the next part: [Training RoBERTa in 10 minutes][1].
All the outputs of this notebook are also published to [this dataset][2].

I am in no way good at NLP, so feel free to correct me if you feel so.

[1]: https://www.kaggle.com/nickuzmenkov/feedback-prize-training-roberta-in-10-minutes/
[2]: https://www.kaggle.com/nickuzmenkov/feedback-prize-roberta-tokens-1024