## Bidrectional Encoder Representation of Transformers (BERT)
* Architecture
  * Transformers (Encoder)
    * Attention mechanism
      * Multi headed self attention to associate different parts of sequences together
      * Calculate a similarity score between the current token and the tokens in the sequence and use that as weights to get a new understanding of the sequence
* Training
   * Masked Language Modelling (MLM)
   * Next Sentence Prediction (NSP)
 
## Robustly Optimized BERT Pre-training Approach (RoBERTa)
* Architecture
  * Transformers (Encoder)
    * Same as BERT
* Training
  * Longer training time 
  * Larger training data, batch size, vocab size
  * Removal of NSP
  * Dynamic masking for MLM by duplicating data and using different masking methods
 
## Decoding-Enhanced BERT with Disentangled Attention (DeBERTa)
* Architecture
  * Transformers (Encoder)
    * Disentagled Attention
      * Attention broken down into dot product of content vectors of token *i* and *j*, content vector *i* to relative position vector *j* and relative position vector *j* to content vector *i*
      * Compared to BERT/RoBERTa, dot product of sum(content, absolute position) of *i* and sum(content, absolute position) of *j*
      * This way, the relationship/similarity of word at *i* and *j* can be fully explored by explicitly comparing the content of the words and the content and position of the respective words. 
      * Summing everything into a single vector before comparing may hide certain important patterns in the content/position of the words
    * Enhanced Mask Decoding
      * Incoporate absolute positions into the model by adding it in after all the Transformer layers
      * Add a bit more information about the exact position of word (E.g. front or back of sentence)
* Training
  * Scale invariant Fine Tuning (SiFT)
     * Normalise embedding then add pertubations to the vector before fine tuning
     * Normalising the vector will help to reduce variation in vector due to different words/models
     * Help model to generalise better by teaching the model to recognise similar inputs, thus becoming less sensitive to small changes, less overfitting as a result

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow_addons as tfa  
from transformers import AutoTokenizer, TFAutoModel, AutoConfig
import tensorflow as tf
# from tensorflow.keras import mixed_precision

import matplotlib.pyplot as plt
from random import sample
import math
import re
import gc
import os
from sklearn.model_selection import KFold
import warnings
# warnings.filterwarnings("ignore")
# tf.get_logger().setLevel('ERROR')

## Parameters ##

In [2]:
TRAIN_DATASET = '../input/feedback-prize-english-language-learning/train.csv'
TEST_DATASET = '../input/feedback-prize-english-language-learning/test.csv'
Y_VARIABLES = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions'] 
DEBERTA_MODEL = '../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/' #'../input/roberta-base/' 

In [3]:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

1 Physical GPUs, 1 Logical GPUs


2022-11-20 06:34:03.679688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-20 06:34:03.770666: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-20 06:34:03.771442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-20 06:34:03.782787: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

In [4]:
# Equivalent to the two lines above
# mixed_precision.set_global_policy('mixed_float16')

In [5]:
class HParams(object):
    def __init__(self):
        self.num_fc_units = 32
        self.batch_size = 4
        self.epochs = 20
        self.seq_len = 512
        self.kfolds = 5
        self.regr_units = 6
        self.l2_alpha = 0.0001

HPARAMS = HParams()

In [6]:
seed=42
np.random.seed(seed)
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

## Utility Functions ##

In [7]:
def train_valid_split(data, percent=0.8):
    ranges = range(data.shape[0])
    prop = math.ceil(percent*data.shape[0])
    train_idx = sample(ranges, prop)
    valid_idx = [i for i in ranges if i not in train_idx]
    return train_idx, valid_idx

def MCRMSE(y_true, y_pred):
    mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    mcrmse = tf.reduce_mean(tf.sqrt(mse), axis=-1, keepdims=True)
    return mcrmse

def clean_text(text):
    new_text = re.sub(r"[\n\r\t]", ' ', text)
    return new_text

## Data ##

In [8]:
df = pd.read_csv(TRAIN_DATASET)
test_df = pd.read_csv(TEST_DATASET)

X_train = df['full_text'].apply(lambda x: clean_text(x))
Y_train = df[Y_VARIABLES]

X_test = test_df['full_text'].apply(lambda x: clean_text(x))

## Models ##

In [9]:
deberta_tokenizer = AutoTokenizer.from_pretrained(DEBERTA_MODEL, use_fast=False)
deberta_config = AutoConfig.from_pretrained(DEBERTA_MODEL)
deberta_config.attention_probs_dropout_prob = 0.0
deberta_config.hidden_dropout_prob = 0.0
deberta_config.output_hidden_states = True

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
def batch_tokenize(data):
    input_ids = []
    attention_masks = []
    for text in data:
        data_token = deberta_tokenizer(text, add_special_tokens=True,  max_length=HPARAMS.seq_len, return_attention_mask=True, 
                                   return_tensors="np",truncation=True,  padding='max_length')
        input_ids.append(data_token['input_ids'][0])
        attention_masks.append(data_token['attention_mask'][0])
    result = {
        'input_ids':np.array(input_ids, dtype='int32'),
        'attention_masks':np.array(attention_masks, dtype='int32')
    }
    return result 

In [11]:
class WeightedAverage(tf.keras.layers.Layer):
    
    def __init__(self):
        super(WeightedAverage, self).__init__()
        
    def build(self, input_shape):
        self.W = self.add_weight(name='w',
                    shape=(1, input_shape[-1]),
                    initializer='uniform',
                    dtype=tf.float32,
                    trainable=True)
        
    def call(self, inputs):

        # inputs is a list of tensor of shape [(n_batch, n_feat), ..., (n_batch, n_feat)]
        # expand last dim of each input passed [(n_batch, n_feat, 1), ..., (n_batch, n_feat, 1)]
        weights = tf.nn.softmax(self.W, axis=-1) # (1,1,n_inputs)
        # weights sum up to one on last dim

        return tf.reduce_sum(weights*inputs, axis=-1) # (n_batch, n_feat) 

In [12]:
# embedding_model = TFAutoModel.from_pretrained(DEBERTA_MODEL, config=deberta_config)

In [13]:
# for encoder_block in embedding_model.deberta.encoder.layer[-:]:
#     for layer in encoder_block.submodules:
#         print(layer)

In [14]:
def create_model():
    input_ids = tf.keras.Input(shape=(None,), dtype=tf.int32, name="input_ids")
    attention_masks = tf.keras.Input(shape=(None,), dtype=tf.int32, name="attention_masks")
    
    embedding_model = TFAutoModel.from_pretrained(DEBERTA_MODEL, config=deberta_config)
#     for encoder_block in embedding_model.deberta.encoder.layer[-5:]:
#         for layer in encoder_block.submodules:
#             layer.trainable = True
    embedding_model.trainable = True

    embedding_output = embedding_model(input_ids=input_ids, attention_mask=attention_masks)
    x = tf.stack(embedding_output.hidden_states[-4:], axis = -1)
    x = tf.reduce_mean(x, axis = 1)
    x = WeightedAverage()(x)
    x = tf.keras.layers.LayerNormalization(axis=1)(x)

    #Output layer without activation function because regression task
    x = tf.keras.layers.Dense(HPARAMS.num_fc_units)(x)
    x = tf.keras.layers.Dense(HPARAMS.regr_units, 
                              activation="sigmoid", 
                              kernel_initializer=tf.keras.initializers.GlorotUniform())(x)
    output = tf.keras.layers.Rescaling(scale=4.0, offset=1.0)(x)
    model = tf.keras.models.Model(inputs=[input_ids, attention_masks], outputs=output)

    return model

## Training ##

Model parameters

In [15]:
#Compile model with an approximation of layer-wise learning rate decay
earlyStopping =  tf.keras.callbacks.EarlyStopping(monitor='val_MCRMSE', 
                                                  min_delta=1e-4, 
                                                  patience=3, 
                                                  verbose=1,
                                                  restore_best_weights=True)

lr_schedule_1 = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-5, 
    decay_steps=3910, 
    decay_rate=0.6)
lr_schedule_2 = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-4, 
    decay_steps=3910, 
    decay_rate=0.6)
optimizers = [tf.keras.optimizers.Adam(learning_rate=lr_schedule_1),
              tf.keras.optimizers.Adam(learning_rate=lr_schedule_2)]

In [16]:
# x_train_token = batch_tokenize(X_train.loc[:500].to_list())
# y_train = Y_train.loc[:500,:]
# print(len(x_train_token), len(y_train))
# tf.data.Dataset.from_tensor_slices((x_train_token, y_train))

In [17]:
# model.summary()

Train Model

In [18]:
def pack_tf_dataset(inputs, outputs):
    input1 = tf.data.Dataset.from_tensor_slices((inputs['input_ids']))
    input2 = tf.data.Dataset.from_tensor_slices((inputs['attention_masks']))
    label = tf.data.Dataset.from_tensor_slices((outputs))
    combined_dataset = tf.data.Dataset.zip(({'input_ids':input1, 'attention_masks':input2},label))
    return combined_dataset

In [19]:
BUFFER_SIZE = 10000
kf = KFold(n_splits=HPARAMS.kfolds)
idx = 0
for train_idx, valid_idx in kf.split(X_train):
    print(f'========================= FOLD {idx} =========================')
    x_train, x_valid = X_train.loc[train_idx], X_train.loc[valid_idx]
    y_train, y_valid = Y_train.loc[train_idx,:], Y_train.loc[valid_idx, :]
    
    x_train_token = batch_tokenize(x_train.to_list())
    x_valid_token = batch_tokenize(x_valid.to_list())
    
    # Wrap data in Dataset objects.
    train_data = pack_tf_dataset(x_train_token, y_train)
    train_data = train_data.cache().shuffle(BUFFER_SIZE).prefetch(tf.data.AUTOTUNE)
    train_data = train_data.batch(HPARAMS.batch_size)
    valid_data = pack_tf_dataset(x_valid_token, y_valid)
    valid_data = valid_data.batch(HPARAMS.batch_size)

    tf.keras.backend.clear_session()
    
    model = create_model()
    
    # Optimizer
    optimizers_and_layers = [(optimizers[0], model.layers[:4]),
                          (optimizers[1], model.layers[4:]),]
    optimizer = tfa.optimizers.MultiOptimizer(optimizers_and_layers)
    model.compile(optimizer=optimizer,
            loss='huber_loss', # combination of l1 and l2
            metrics=[MCRMSE],
            )

    # Callback
    modelCheckpoint =  tf.keras.callbacks.ModelCheckpoint(filepath=f'best_weights_{idx}',
                                                      save_weights_only=True,
                                                      monitor='val_MCRMSE',
                                                      mode='auto',
                                                      save_best_only=True)



    history = model.fit(train_data, # [x_train_token[0], x_train_token[1]],
                      validation_data=valid_data, #([x_valid_token[0], x_valid_token[1]], y_valid), 
                      epochs=HPARAMS.epochs,
                      shuffle=True,
                      callbacks=[earlyStopping, modelCheckpoint])
    
    idx += 1
    del x_train, x_valid, y_train, y_valid
    del model, history
    tf.keras.backend.clear_session()
    gc.collect()



All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


Epoch 1/20


2022-11-20 06:35:15.628236: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-11-20 06:35:28.462466: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping


All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


Epoch 1/20


2022-11-20 07:14:08.456129: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 7.04G (7557873664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping


All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Restoring model weights from the end of the best epoch.
Epoch 00008: early stopping


All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping


## Model Evaluation ##

In [20]:
# fig, ax = plt.subplots(3, figsize=(10,7))
# for idx in range(3):
#     val_mcrmse = history_arr[idx].history['val_MCRMSE']
#     train_mcrmse = history_arr[idx].history['MCRMSE']
#     ax[idx].plot(range(len(val_mcrmse)), val_mcrmse, label='Val MCRMSE')
#     ax[idx].plot(range(len(train_mcrmse)), train_mcrmse, label='Train MCRMSE')
# ax.legend()

## Test ##

In [21]:
x_test_token = batch_tokenize(X_test.to_list())
y_test_pred_arr = []
for idx in range(HPARAMS.kfolds):
    model = create_model()
    model.load_weights(f'best_weights_{idx}')
    result = model.predict([x_test_token['input_ids'], x_test_token['attention_masks']])
    y_test_pred_arr.append(result)
    tf.keras.backend.clear_session()
    gc.collect()
y_test_pred = np.mean(y_test_pred_arr, axis = 0)
# y_test_pred = y_test_pred/3

All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.
All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.
All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at ../input/deberta/microsoft-deberta-v3-base/microsoft-deberta-v3-base/.
I

In [22]:
sub_df = test_df.copy(deep=True)
sub_df[Y_VARIABLES] = y_test_pred
del sub_df['full_text']
sub_df.to_csv('submission.csv',index=False)
sub_df

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,2.836082,2.779548,3.114091,3.018112,2.696621,2.701116
1,000BAD50D026,2.626396,2.477513,2.687915,2.357104,2.151485,2.69022
2,00367BB2546B,3.703728,3.446792,3.635209,3.62221,3.362167,3.409767
