Introduction
------------

This notebook intends to speed up the training part as much as possible leaving you room for improvements.

It is based on this [amazing work][1] by [Chris Deotte][2]. I left only the training loop and moved tokenizing part to a [separate notebook][3]. I also had to change the original backbone Longformer to RoBERTa, because of TensorFlow TPU-related errors I had been unable to fix (if you know how to fix them, please, let me know in the comments below).

As the title says, it takes only 10 minutes to train the RoBERTa model—single fold, 5 epochs—which is 30 times faster compared to the original work.

Imports
-------

[1]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-633
[2]: https://www.kaggle.com/cdeotte
[3]: https://www.kaggle.com/nickuzmenkov/feedback-prize-making-roberta-tokens

In [None]:
from sklearn.model_selection import KFold
import plotly.graph_objects as go
import tensorflow as tf
import transformers
import numpy as np
import logging
import typing

Configuration
-------------

In [None]:
def hardware_config() -> tuple:
    """Return strategy and batch size according to hardware state"""
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
        batch_size = 16 * strategy.num_replicas_in_sync
    except Exception:
        tpu = None
        strategy = tf.distribute.get_strategy()
        batch_size = 4

    return strategy, tpu, batch_size


BASE_MODEL = "roberta-base"
SEQ_LEN = 1024
SEED = 42

N_FOLDS = 5
USED_FOLDS = [0]
LRS = [0.25e-4, 0.25e-4, 0.25e-4, 0.25e-4, 0.25e-5]
EPOCHS = 5
VERBOSE = 2
LABEL_MAP = {
    "Lead": 0,
    "Position": 1,
    "Evidence": 2,
    "Claim": 3,
    "Concluding Statement": 4,
    "Counterclaim": 5,
    "Rebuttal": 6,
}
STRATEGY, TPU, BATCH_SIZE = hardware_config()

In [None]:
print("TensorFlow", tf.__version__)

if TPU is not None:
    print("Using TPU v3-8")
else:
    print("Using GPU (CPU)")
    
print("Batch size:", BATCH_SIZE)

Utilities
---------

Treat it simply as a black box unless you want to discover it in-depth.

In [None]:
def get_dataset(
    input_ids: np.array,
    attention_mask: np.array,
    labels: typing.Optional[np.array] = None,
    ordered: bool = False,
    repeated: bool = False,
) -> tf.data.Dataset:
    """Return batched and prefetched dataset"""
    if labels is not None:
        dataset = tf.data.Dataset.from_tensor_slices(
            ({"input_ids": input_ids, "attention_mask": attention_mask}, labels)
        )
    else:
        dataset = tf.data.Dataset.from_tensor_slices(
            {"input_ids": input_ids, "attention_mask": attention_mask}
        )

    if repeated:
        dataset = dataset.repeat()
    if not ordered:
        dataset = dataset.shuffle(1024)
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)

    return dataset


def get_model() -> tf.keras.Model:
    """Return compiled instance of Keras model"""
    backbone = transformers.TFRobertaForTokenClassification.from_pretrained(BASE_MODEL)

    input_ids = tf.keras.layers.Input(
        shape=(SEQ_LEN,),
        dtype=tf.int32,
        name="input_ids",
    )
    attention_mask = tf.keras.layers.Input(
        shape=(SEQ_LEN,),
        dtype=tf.int32,
        name="attention_mask",
    )
    x = backbone(
        {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        },
    )
    x = tf.keras.layers.Dense(256, activation="relu")(x[0])
    outputs = tf.keras.layers.Dense(
        2 * len(LABEL_MAP) + 1, activation="softmax", dtype="float32"
    )(x)

    model = tf.keras.Model(
        inputs=[input_ids, attention_mask],
        outputs=outputs,
    )
    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss=tf.keras.losses.CategoricalCrossentropy(),
        metrics=[tf.keras.metrics.CategoricalAccuracy()],
    )
    return model

Load data
---------

Here we load tokens prepared in [this notebook][1].

We have 7 statement types: lead, position, evidence, claim, counterclaim, rebuttal, and concluding statement. This results in 15 possible classes: the first word in each of the above statements (7 classes), the non-first word in each of the above statements (7 classes), and none of above (1 class).

[1]: https://www.kaggle.com/nickuzmenkov/feedback-prize-making-roberta-tokens

In [None]:
input_ids = np.load("../input/feedback-prize-roberta-tokens-1024/input_ids.npy")
attention_mask = np.load("../input/feedback-prize-roberta-tokens-1024/attention_mask.npy")
labels = np.load("../input/feedback-prize-roberta-tokens-1024/labels.npy")

print("Input ids shape:", input_ids.shape)
print("Attention_mask shape:", attention_mask.shape)
print("Labels shape:", labels.shape)

Run training
------------

In [None]:
kfold = KFold(n_splits=N_FOLDS, shuffle=True, random_state=42)
callbacks = [tf.keras.callbacks.LearningRateScheduler(lambda x: LRS[x], verbose=1)]

history = [None] * N_FOLDS
scores = [None] * N_FOLDS

for i, (train_index, val_index) in enumerate(kfold.split(input_ids)):
    if i not in USED_FOLDS:
        continue

    if TPU is not None:
        tf.tpu.experimental.initialize_tpu_system(TPU)

    with STRATEGY.scope():
        model = get_model()

    train_dataset = get_dataset(
        input_ids=input_ids[train_index],
        attention_mask=attention_mask[train_index],
        labels=labels[train_index],
        repeated=True,
    )
    val_dataset = get_dataset(
        input_ids=input_ids[val_index],
        attention_mask=attention_mask[val_index],
        labels=labels[val_index],
        ordered=True,
    )

    steps_per_epoch = len(train_index) // BATCH_SIZE
    validation_steps = len(val_index) // BATCH_SIZE

    history[i] = model.fit(
        train_dataset,
        validation_data=val_dataset,
        steps_per_epoch=steps_per_epoch,
        callbacks=callbacks,
        epochs=EPOCHS,
        verbose=VERBOSE,
    ).history
    scores[i] = max(history[i]["val_categorical_accuracy"])

    model.save_weights(f"model_{i}.h5")    

Results
-------

In [None]:
for i in range(N_FOLDS):
    if history[i] is None:
        continue

    go.Figure(
        data=(
            go.Scatter(
                y=history[i]["categorical_accuracy"],
                name="train",
            ),
            go.Scatter(
                y=history[i]["val_categorical_accuracy"],
                name="validation",
            ),
        ),
        layout=dict(
            height=400,
            width=600,
            margin=dict(t=75, b=25),
            title_text=f"Fold {i} (Best score {scores[i]:.4f})",
            xaxis=dict(title_text="Epoch", dtick=1),
            yaxis=dict(title_text="Categorical accuracy"),
        ),
    ).show()

Conclusion
----------

I am in no way good at NLP, so feel free to correct me if you feel so.

Thanks for reading.