<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/35887/logos/header.png?t=2022-05-09-22-33-02">

<h1><center>[2/3] AI4Code TensorFlow TPU with CodeBert - Training</center></h1>

This is the second part of my **AI4Code TensorFlow TPU with CodeBert** series:

* [1/3] [Data Preparation][1] (~5 hours)
* **[2/3] TPU Training ← (you're here)**
* [3/3] [GPU Inference][2] (~2 hours)

This is basically a translation of **[Khoi Nguyen's][3]** works [[1][4], [2][5]] from PyTorch to TensorFlow with minor changes and updates for TPU support. The **[original][4]** PyTorch work takes up to 40 hours per epoch on Kaggle GPU, whereas this version takes only 50 minutes per epoch on Kaggle TPU, so it's lightning fast ⚡.

Model weights are already saved to the dataset **[AI4Code CodeBert Weights][6]**.

### About Solution

- Input data: markdown + code context (512 tokens) + features
    - Markdown (up to 64 tokens)
    - Code context (all code cells or up to 20 code cells each up to 23 tokens)
    - Features: markdown cells to total cells ratio (appended to backbone outputs)
- Model and hyperparameters
    - CodeBert Base model
    - L1 loss (MAE)
    - AdamW optimizer
    - Learning rate schedule with warmup and linear decay
    - Total 5 epochs

### Input Data

- **[AI4Code-CodeBert-Tokens][7]**: output from **[Data Preparation][1]** step

### Warning

This notebook uses Kaggle environment variables. If you run it on Google Colab make sure you explicitly set `VERBOSE` hyperparameter to either 1 or 2.

[1]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-data-preparation
[2]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-inference
[3]: https://www.kaggle.com/suicaokhoailang
[4]: https://github.com/suicao/ai4code-baseline/tree/main/code
[5]: https://www.kaggle.com/code/suicaokhoailang/stronger-baseline-with-code-cells
[6]: https://www.kaggle.com/datasets/nickuzmenkov/ai4code-codebert-weights
[7]: https://www.kaggle.com/datasets/nickuzmenkov/ai4code-codebert-tokens

# Setup

In [None]:
import os
from typing import List

import numpy as np
import tensorflow as tf
import transformers
from kaggle_datasets import KaggleDatasets
from sklearn.model_selection import KFold

In [None]:
RANDOM_STATE = 42
N_SPLITS = 5
TOTAL_MAX_LEN = 512
BASE_MODEL = "microsoft/codebert-base"
GCS_PATH = KaggleDatasets().get_gcs_path("ai4code-codebert-tokens")
EPOCHS = 5
LR = 5e-4
WARMUP_RATE = 0.05
VERBOSE = 1 if os.environ["KAGGLE_KERNEL_RUN_TYPE"] == "Interactive" else 2

try:
    TPU = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(TPU)
    tf.tpu.experimental.initialize_tpu_system(TPU)
    STRATEGY = tf.distribute.experimental.TPUStrategy(TPU)
    BATCH_SIZE = 64 * STRATEGY.num_replicas_in_sync
except Exception:
    TPU = None
    STRATEGY = tf.distribute.get_strategy()
    BATCH_SIZE = 4

print("TensorFlow", tf.__version__)

if TPU is not None:
    print("Using TPU v3-8")
else:
    print("Using GPU/CPU")

print("Batch size:", BATCH_SIZE)

In [None]:
def count_samples(filenames: List[str]) -> int:
    return sum(int(os.path.basename(x).split(".")[0].split("-")[-1]) for x in filenames)


def read_tfrecord(example):
    features = {
        "input_ids": tf.io.FixedLenFeature(
            [
                TOTAL_MAX_LEN,
            ],
            tf.int64,
        ),
        "attention_mask": tf.io.FixedLenFeature(
            [
                TOTAL_MAX_LEN,
            ],
            tf.int64,
        ),
        "feature": tf.io.FixedLenFeature([], tf.float32),
        "label": tf.io.FixedLenFeature([], tf.float32),
    }
    example = tf.io.parse_single_example(example, features)
    return (
        {
            "input_ids": tf.cast(example["input_ids"], tf.int32),
            "attention_mask": tf.cast(example["attention_mask"], tf.int32),
            "feature": example["feature"],
        },
        example["label"],
    )


def get_dataset(
    filenames: List[str],
    ordered: bool = False,
    repeated: bool = True,
    cached: bool = False,
) -> tf.data.Dataset:
    auto = tf.data.experimental.AUTOTUNE
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=auto)
    if not ordered:
        ignore_order = tf.data.Options()
        ignore_order.experimental_deterministic = False
        dataset = dataset.with_options(ignore_order)
    dataset = dataset.map(read_tfrecord, num_parallel_calls=auto)
    if not ordered:
        dataset = dataset.shuffle(2048, seed=RANDOM_STATE)
    if repeated:
        dataset = dataset.repeat()
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
    if cached:
        dataset = dataset.cache()
    dataset = dataset.prefetch(auto)
    return STRATEGY.experimental_distribute_dataset(dataset)


def get_model() -> tf.keras.Model:
    backbone = transformers.TFAutoModel.from_pretrained(BASE_MODEL)
    input_ids = tf.keras.layers.Input(
        shape=(TOTAL_MAX_LEN,),
        dtype=tf.int32,
        name="input_ids",
    )
    attention_mask = tf.keras.layers.Input(
        shape=(TOTAL_MAX_LEN,),
        dtype=tf.int32,
        name="attention_mask",
    )
    feature = tf.keras.layers.Input(
        shape=(1,),
        dtype=tf.float32,
        name="feature",
    )
    x = backbone({"input_ids": input_ids, "attention_mask": attention_mask})[0]
    x = tf.concat([x[:, 0, :], feature], axis=1)
    outputs = tf.keras.layers.Dense(1, activation="linear", dtype="float32")(x)
    return tf.keras.Model(
        inputs=[input_ids, attention_mask, feature],
        outputs=outputs,
    )


class WarmupLinearDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(
        self,
        base_learning_rate: float,
        warmup_steps: int,
        total_steps: int,
    ) -> None:
        self._base_learning_rate = base_learning_rate
        self._warmup_steps = warmup_steps
        self._total_steps = total_steps

    def __call__(self, step: int) -> float:
        return self._base_learning_rate * tf.cond(
            tf.math.less_equal(step, warmup_steps),
            lambda: step / self._warmup_steps,
            lambda: (step - total_steps) / (self._warmup_steps - self._total_steps),
        )

# Training

In [None]:
for i, (train_index, val_index) in enumerate(KFold(n_splits=N_SPLITS).split(range(N_SPLITS))):
    if TPU is not None:
        tf.tpu.experimental.initialize_tpu_system(TPU)

    train_filenames = np.ravel(
        [
            tf.io.gfile.glob(os.path.join(GCS_PATH, "tfrec", str(x), "*.tfrec"))
            for x in train_index
        ]
    )
    steps_per_epoch = count_samples(train_filenames) // BATCH_SIZE
    train_dataset = get_dataset(train_filenames)

    val_filenames = np.ravel(
        [
            tf.io.gfile.glob(os.path.join(GCS_PATH, "tfrec", str(x), "*.tfrec"))
            for x in val_index
        ]
    )
    validation_steps = count_samples(val_filenames) // BATCH_SIZE
    val_dataset = get_dataset(val_filenames, ordered=True, repeated=False, cached=True)

    with STRATEGY.scope():
        model = get_model()

        total_steps = steps_per_epoch * EPOCHS
        warmup_steps = int(WARMUP_RATE * total_steps)

        optimizer = transformers.AdamWeightDecay(
            learning_rate=WarmupLinearDecay(
                base_learning_rate=LR,
                warmup_steps=warmup_steps,
                total_steps=total_steps,
            ),
            weight_decay_rate=0.01,
            exclude_from_weight_decay=[
                "bias",
                "LayerNorm.bias",
                "LayerNorm.weight",
            ],
        )
        model.compile(loss="mae", optimizer=optimizer)

    model.fit(
        train_dataset,
        steps_per_epoch=steps_per_epoch,
        validation_data=val_dataset,
        validation_steps=validation_steps,
        epochs=EPOCHS,
        verbose=VERBOSE,
    )

    model.save_weights(f"model_{i}.h5")
    break

# Next Steps

Go to the model weights dataset **[here][1]** or continue exploring:

* [1/3] [Data Preparation][2] (~3 hours)
* <span style="color:lightgray">[2/3] TPU Training ← (you're here)</span>
* [3/3] [GPU Inference][3] (~2 hours)


[1]: https://www.kaggle.com/datasets/nickuzmenkov/ai4code-codebert-weights
[2]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-data-preparation
[3]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-inference