Being a novice in Natural Language Processing based DL tasks, this Kaggle competition seems like a great opportunity to break the ice and be comfortable with this subdomain of Deep Learning. 

Having said that I have some prior experience using HuggingFace Transformers in TensorFlow/Keras ecosystem. I have also written this W&B report on [How to Fine-Tune HuggingFace Tranformer with W&B?](https://wandb.ai/ayush-thakur/huggingface/reports/How-to-Fine-Tune-HuggingFace-Tranformer-with-W-B---Vmlldzo0MzQ2MDc) that might come in handy to few.

This kernel is about training a HuggingFace transformer using TensorFlow/Keras. I have tried to make it as intuitive as possible for a regular Keras users and is trying out NLP or HuggingFace for the first time. This Kernel can be divided into few blocks:

* **Import and Setups** - here we will import relevant libraries and setup Weights and Biases related steps for tracking experiments. 

* **Hyperparameters** - configuration dictionary for all hyperparameters.

* **Prepare Dataset** - here we will create K-fold split of the training data based on this easy to understand [kernel](https://www.kaggle.com/abhishek/step-1-create-folds) by [Abhishek Thakur](https://www.kaggle.com/abhishek). This is followed by building an input pipeline using `tf.data` API.

* **Build Model** - here we will build our model definition. Note that you can use any Transformer of your choice. 

* **Train with W&B** - here we do a K (5) fold training and will use Weights and Biases for experiment tracking. 

* **Evaluate** - here we will evaluate the model for local CV score.

# üß∞ Imports and Setups

In [None]:
# TensorFlow related
import tensorflow as tf
print(tf.__version__)
from tensorflow.keras.layers import *
from tensorflow.keras.models import * 

import os
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import StratifiedKFold

# HuggingFace related
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertModel

from kaggle_secrets import UserSecretsClient

In [None]:
# Get the latest version of W&B
!pip install -q wandb

# Weights and Biases related imports
import wandb
from wandb.keras import WandbCallback

# W&B login - please visit wandb.ai/authorize to get your auth key
wandb.login()

> üìå Learn more about why and how to use Weights and Biases in this kernel: [Experiment Tracking with Weights and Biases](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases)

# üìÄ Hyperparameters

In [None]:
CONFIG = dict(
    # Dataset related 
    num_splits = 5, 
    
    # Model related
    model_name = 'DistilBERT',
    max_token_length = 256,
    
    # Training related
    batch_size = 64,
    epochs = 100,
    init_lr = 1e-4,
    earlys_patience = 10,
    reduce_lr_plateau = 5,
    
    # Misc
    seed = 42,
    wandb_kernel = True,
    competition = 'commonlit'
)

save_dir = 'trained/'
os.makedirs(save_dir, exist_ok=True)

# üî® Prepare Dataset

## 1. Create Folds

This is based on the [Step 1: Create Folds](https://www.kaggle.com/abhishek/step-1-create-folds) kernel by [Abhishek Thakur](https://www.kaggle.com/abhishek). Even though the dataset is small this competition is not that easy to crack.

In [None]:
# Ref: https://www.kaggle.com/abhishek/step-1-create-folds
def create_folds(data, num_splits):
    # we create a new column called kfold and fill it with -1
    data["kfold"] = -1
    
    # the next step is to randomize the rows of the data
    data = data.sample(frac=1).reset_index(drop=True)

    # calculate number of bins by Sturge's rule
    # I take the floor of the value, you can also
    # just round it
    num_bins = int(np.floor(1 + np.log2(len(data))))
    
    # bin targets
    data.loc[:, "bins"] = pd.cut(
        data["target"], bins=num_bins, labels=False
    )
    
    # initiate the kfold class from model_selection module
    kf = StratifiedKFold(n_splits=num_splits)
    
    # fill the new kfold column
    # note that, instead of targets, we use bins!
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
        data.loc[v_, 'kfold'] = f
    
    # drop the bins column
    data = data.drop("bins", axis=1)

    # return dataframe with folds
    return data

In [None]:
# read training data
df = pd.read_csv("../input/commonlitreadabilityprize/train.csv")

# create folds
df = create_folds(df, num_splits=CONFIG['num_splits'])
df.head()

## 2. Create Dataloader

**Two words on Tokenization**: Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. The conversion of tokens to ids through a look-up table depends on the vocabulary(the set of all unique words and tokens used) which depends on the dataset, the task, and the resulting pre-trained model. **HuggingFace tokenizer automatically downloads the vocab used during pretraining or fine-tuning a given model. We need not create our own vocab from the dataset for fine-tuning.**

In [None]:
# Use the tokenizer of your choice
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
# Save the tokenizer so that you can download the files and move it to a Kaggle dataset.
tokenizer.save_pretrained(save_dir)

> üìå The name of the tokenizer will depend on the choice of your Transformer model. 

### For explaination purposes

Few of the cells below are just for explaination pusposes. Since we are training using K-Folds, the input pipeline will be created inside the for loop. The cells below will give you an insight on how I am using `tf.data` and HuggingFace's Tokenizer to build the input pipeline.

1. Here (K-1)th split is taken as validation data. We will train on the rest of the data. 

In [None]:
def get_train_val_split(fold_num):
    # Get training split
    train_df = df.loc[df['kfold']!=fold_num]
    train_df = train_df[['excerpt', 'target']]
    
    # Get validation split
    val_df = df.loc[df['kfold']==fold_num]
    val_df = val_df[['excerpt', 'target']]
    
    # Extract texts and labels.
    train_text, train_label = list(train_df.excerpt.values), list(train_df.target.values)
    val_text, val_label = list(val_df.excerpt.values), list(val_df.target.values)

    return train_text, train_label, val_text, val_label

train_text, train_label, val_text, val_label = get_train_val_split(0)

2. Pass in the raw text to the tokenizer. The output of this process is a dictionary `*_encodings` with `input_ids` and `attention_mask` as keys. For some tokenizers you will have `token_ids` as another key. Please modify the training pipeline and model accordingly. It will be as easy as cooking noodles. 


In [None]:
train_encodings = tokenizer(train_text, truncation=True, padding=True, max_length=CONFIG['max_token_length'])
val_encodings = tokenizer(val_text, truncation=True, padding=True, max_length=CONFIG['max_token_length'])

3. We will use `tf.data` wrap the encoding and label. `tf.data` can create highly efficient input pipeline. In case your tokenizer's output also has a key `token_ids` consider modifying the `parse_data` appropriately. 

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

# Note that some tokenizers also returns 'token_id'. Modify this function accordingly. 
@tf.function
def parse_data(from_tokenizer, target):
    input_ids = from_tokenizer['input_ids']
    attention_mask = from_tokenizer['attention_mask']
    
    target = tf.cast(target, tf.float32)
    
    return {'input_ids': input_ids,
            'attention_mask': attention_mask}, target

# Utility function to build dataloaders
def get_dataloaders(train_encodings, train_label, val_encodings, val_label):
    trainloader = tf.data.Dataset.from_tensor_slices((dict(train_encodings), list(train_label)))
    validloader = tf.data.Dataset.from_tensor_slices((dict(val_encodings), list(val_label)))

    trainloader = (
        trainloader
        .shuffle(1024)
        .map(parse_data, num_parallel_calls=AUTOTUNE)
        .batch(CONFIG['batch_size'])
        .prefetch(AUTOTUNE)
    )

    validloader = (
        validloader
        .map(parse_data, num_parallel_calls=AUTOTUNE)
        .batch(CONFIG['batch_size'])
        .prefetch(AUTOTUNE)
    )
    
    return trainloader, validloader

trainloader, validloader = get_dataloaders(train_encodings, train_label, val_encodings, val_label)

In [None]:
# Visualize a batch of data
next(iter(trainloader))

# üê§ Build Model

In [None]:
# You can use a Transformer model of your choice.
transformer_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')

In [None]:
def CommonLitModel():
    # Input layers
    input_ids = Input(shape=(CONFIG['max_token_length'],), dtype=tf.int32, name="input_ids")
    attention_mask = Input(shape=(CONFIG['max_token_length'],), dtype=tf.int32, name="attention_mask")
    
    # Transformer backbone to extract features
    sequence_output = transformer_model(input_ids=input_ids, attention_mask=attention_mask)[0]
    clf_output = sequence_output[:, 0, :]
    
    # Dropout to regularize 
    clf_output = Dropout(0.1)(clf_output)
    
    # Output layer with linear activation as we are doing regression. 
    out = Dense(1, activation='linear')(clf_output)
    
    # Build model 
    model = Model(inputs=[input_ids, attention_mask], outputs=out)
    
    return model

# Sanity check model
tf.keras.backend.clear_session()
model = CommonLitModel()
model.summary()

# üöÑ Train with W&B

Since it's going to be a K-fold training we will have a for loop that will iterate `CONFIG['num_splits']` number of times. In the loop the following steps will be repeated:

1. Get the training and validation dataset.
2. Pass the dataset to the tokenizer.
3. Prepare training and validation dataloader.
4. Initialize model.
5. Train the model.
6. Evaluate on validation dataset.
7. Save model.

We will use W&B to log all the metrics. We will use a `group` argument when initializing a W&B run (`wandb.init`) that will enable us to group all the runs in the W&B dashboard. 

### [Check out the W&B dashboard $\rightarrow$](https://wandb.ai/ayush-thakur/commonlit?workspace=user-ayush-thakur)
![img](https://i.imgur.com/WYo0b4q.gif)


In [None]:
# Early stopping 
earlystopper = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=CONFIG['earlys_patience'], verbose=0, mode='min',
    restore_best_weights=True
)

# Reduce LR on Plateau
reducelrplateau = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.2, patience=CONFIG['reduce_lr_plateau']
)

In [None]:
for fold in range(CONFIG['num_splits']):
    # 1. Get the training and validation dataset.
    train_text, train_label, val_text, val_label = get_train_val_split(fold)
    
    # 2. Pass the dataset to the tokenizer.
    train_encodings = tokenizer(train_text, truncation=True, padding=True, max_length=CONFIG['max_token_length'])
    val_encodings = tokenizer(val_text, truncation=True, padding=True, max_length=CONFIG['max_token_length'])
    
    # 3. Prepare training and validation dataloader.
    trainloader, validloader = get_dataloaders(train_encodings, train_label, val_encodings, val_label)
    
    # 4. Initialize model
    tf.keras.backend.clear_session()
    model = CommonLitModel()
    
    # Compile
    optimizer = tf.keras.optimizers.Adam(lr=1e-5)
    model.compile(optimizer, loss='mean_squared_error', metrics=[tf.keras.metrics.RootMeanSquaredError()])    
    
    # Initialize Weights and Biases run
    run = wandb.init(project='commonlit', 
                     config=CONFIG,
                     group='DistilBERT-K-Fold',
                     job_type='train_kfold')
    
    # 5. Train the model
    _ = model.fit(trainloader, 
              epochs=CONFIG['epochs'], 
              validation_data=validloader,
              callbacks=[WandbCallback(),
                         reducelrplateau,
                         earlystopper])
    
    # 6. Evaluate on validation dataset.
    loss, rmse = model.evaluate(validloader)
    wandb.log({'valid_rmse': rmse})
    
    # 7. Save model
    model.save(f'{save_dir}/distil-bert_{fold}')
    
    # Close W&B run
    run.finish()