## About this notebook

This notebook trains from the XLM-Roberta large model which was finetuned with masked language modelling on the jigsaw test dataset [Link](https://www.kaggle.com/riblidezso/finetune-xlm-roberta-on-jigsaw-test-data-with-mlm).

This notebook also implements a few improvements compared to a previous starter notebook that I shared

* 1, It trains on translated data
* 2, It uses different learning rate for the head layer and the transformer
* 3, It restores the model weights after training to the checkpoint which had the highest validation score

Suggestions/improvements are appreciated!

---

### References:

- The shared XLM-Roberta large model, finetuned on the Jigsaw multilingual test data with masked language modelling Notebook [link]() / Dataset [link](https://www.kaggle.com/riblidezso/jigsaw-mlm-finetuned-xlm-r-large)
- My previous starter notebook [link](https://www.kaggle.com/riblidezso/tpu-custom-tensoflow2-training-loop)
- This notebook uses the translated versions of the training dataset too, big thanks to Michael Kazachok! [link](https://www.kaggle.com/miklgr500/jigsaw-train-multilingual-coments-google-api)
- This notebook uses different learning rate for the transformer and the head, I got the ideas from the writeup of the winning team of the Google QUEST Q&A Labeling competition  [link](https://www.kaggle.com/c/google-quest-challenge/discussion/129840), I have seen it described to be useful elsewhere too.
- This notebook heavily relies on the great [notebook]((https://www.kaggle.com/xhlulu//jigsaw-tpu-xlm-roberta) by, Xhulu: [@xhulu](https://www.kaggle.com/xhulu/) 
- The tensorflow distrubuted training tutorial: [Link](https://www.tensorflow.org/tutorials/distribute/custom_training)

In [1]:
MAX_LEN = 192 
DROPOUT = 0.5 # use aggressive dropout
BATCH_SIZE = 16 # per TPU core
TOTAL_STEPS_STAGE1 = 2000
VALIDATE_EVERY_STAGE1 = 200
TOTAL_STEPS_STAGE2 = 200
VALIDATE_EVERY_STAGE2 = 10

### Different learning rate for transformer and head ###
LR_TRANSFORMER = 5e-6
LR_HEAD = 1e-3

PRETRAINED_TOKENIZER=  'jplu/tf-xlm-roberta-large'
PRETRAINED_MODEL = '/kaggle/input/jigsaw-mlm-finetuned-xlm-r-large'
D = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification/'
D_TRANS = '/kaggle/input/jigsaw-train-multilingual-coments-google-api/'


import os
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import tensorflow as tf
print(tf.__version__)
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import transformers
from transformers import TFRobertaModel, AutoTokenizer
import logging
# no extensive logging 
logging.getLogger().setLevel(logging.NOTSET)

AUTO = tf.data.experimental.AUTOTUNE

2.1.0


## Connect to TPU

In [2]:
def connect_to_TPU():
    """Detect hardware, return appropriate distribution strategy"""
    try:
        # TPU detection. No parameters necessary if TPU_NAME environment variable is
        # set: this is always the case on Kaggle.
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        print('Running on TPU ', tpu.master())
    except ValueError:
        tpu = None

    if tpu:
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
    else:
        # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
        strategy = tf.distribute.get_strategy()

    global_batch_size = BATCH_SIZE * strategy.num_replicas_in_sync

    return tpu, strategy, global_batch_size


tpu, strategy, global_batch_size = connect_to_TPU()
print("REPLICAS: ", strategy.num_replicas_in_sync)

Running on TPU  grpc://10.0.0.2:8470
REPLICAS:  8


 ## Load text data into memory
 
 - Traning data is englih + all translations. The reason to use english too is that people use english in their foreign language comments all the time.
 - Not using the full dataset, downsampling negatives to 50-50%

In [3]:
def load_jigsaw_trans(langs=['tr','it','es','ru','fr','pt'], 
                      columns=['comment_text', 'toxic']):
    train_6langs=[]
    for i in range(len(langs)):
        fn = D_TRANS+'jigsaw-toxic-comment-train-google-%s-cleaned.csv'%langs[i]
        train_6langs.append(downsample(pd.read_csv(fn)[columns]))

    return train_6langs

def downsample(df):
    """Subsample the train dataframe to 50%-50%"""
    ds_df= pd.concat([
        df.query('toxic==1'),
        df.query('toxic==0').sample(sum(df.toxic))
    ])
    
    return ds_df
    

train_df = pd.concat(load_jigsaw_trans()) 
val_df = pd.read_csv(D+'validation.csv')
test_df = pd.read_csv(D+'test.csv')
sub_df = pd.read_csv(D+'sample_submission.csv')

## Tokenize  it with the models own tokenizer

- Note it takes some time ( approx 5 minutes)
- Note, we need to reshape the targets

In [4]:
%%time

def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    
    return np.array(enc_di['input_ids'])
    

tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_TOKENIZER)
X_train = regular_encode(train_df.comment_text.values, tokenizer, maxlen=MAX_LEN)
X_val = regular_encode(val_df.comment_text.values, tokenizer, maxlen=MAX_LEN)
X_test = regular_encode(test_df.content.values, tokenizer, maxlen=MAX_LEN)

y_train = train_df.toxic.values.reshape(-1,1)
y_val = val_df.toxic.values.reshape(-1,1)

CPU times: user 5min 24s, sys: 1.78 s, total: 5min 26s
Wall time: 5min 27s


## Create distributed tensorflow datasets

- Note, validation dataset does not contain labels, we keep track of it ourselves

In [5]:
def create_dist_dataset(X, y=None, training=False):
    dataset = tf.data.Dataset.from_tensor_slices(X)

    ### Add y if present ###
    if y is not None:
        dataset_y = tf.data.Dataset.from_tensor_slices(y)
        dataset = tf.data.Dataset.zip((dataset, dataset_y))
        
    ### Repeat if training ###
    if training:
        dataset = dataset.shuffle(len(X)).repeat()

    dataset = dataset.batch(global_batch_size).prefetch(AUTO)

    ### make it distributed  ###
    dist_dataset = strategy.experimental_distribute_dataset(dataset)

    return dist_dataset
    
    
train_dist_dataset = create_dist_dataset(X_train, y_train, True)
val_dist_dataset   = create_dist_dataset(X_val)
test_dist_dataset  = create_dist_dataset(X_test)

## Build model from pretrained transformer


Let's use a different learning rate for the head and the transformer like the winning team of the Google QUEST Q&A Labeling competition  [link](https://www.kaggle.com/c/google-quest-challenge/discussion/129840). 

The reasoning is the following, the transformer is trained for super long time and has a very good multilingual representaton, which we only want to change a little, while the head needs to be trained from scratch.

We define 2 separate optimizers for the transofmer and the head layer. This is a simple way to use different learning rate for the transformer and the head. The caffe style "lr_multiplier" option would be more elegant but that is not available in keras.

We add the name 'custom' to the head layer, so that we can find it later and use a different learning rate with this layer

- Note: Downloading the model takes some time!

In [6]:
%%time

def create_model_and_optimizer():
    with strategy.scope():
        transformer_layer = TFRobertaModel.from_pretrained(PRETRAINED_MODEL)                
        model = build_model(transformer_layer)
        optimizer_transformer = Adam(learning_rate=LR_TRANSFORMER)
        optimizer_head = Adam(learning_rate=LR_HEAD)
    return model, optimizer_transformer, optimizer_head


def build_model(transformer):
    inp = Input(shape=(MAX_LEN,), dtype=tf.int32, name="input_word_ids")
    # Huggingface transformers have multiple outputs, embeddings are the first one
    # let's slice out the first position, the paper says its not worse than pooling
    x = transformer(inp)[0][:, 0, :]  
    x = Dropout(DROPOUT)(x)
    ### note, adding the name to later identify these weights for different LR
    out = Dense(1, activation='sigmoid', name='custom_head')(x)
    model = Model(inputs=[inp], outputs=[out])
    
    return model


model, optimizer_transformer, optimizer_head = create_model_and_optimizer()
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 192)]             0         
_________________________________________________________________
tf_roberta_model (TFRobertaM ((None, 192, 1024), (None 559890432 
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 1024)]            0         
_________________________________________________________________
dropout_74 (Dropout)         (None, 1024)              0         
_________________________________________________________________
custom_head (Dense)          (None, 1)                 1025      
Total params: 559,891,457
Trainable params: 559,891,457
Non-trainable params: 0
_________________________________________________________________
CPU times: user 31.6 s, sys: 26.6 s, total: 58.2 s
Wall time: 59.4 s


### Define stuff for the custom training loop

We will need:
- 1, losses, and  optionally a training AUC metric here: these need to be defined in the scope of the distributed strategy. 
- 2, A full training loop
- 3, A distributed train step called in the training loop, which uses a single replica train step
- 4, A prediction loop with dstibute 


At the end of training we restore the parameters which had the best validation score.


For the different learning rate we need to apply gradients in two steps, check the train_step function for details.



- Note, we are using exact AUC, for the valdationdata, and approximate AUC for the training data

In [7]:
def define_losses_and_metrics():
    with strategy.scope():
        loss_object = tf.keras.losses.BinaryCrossentropy(
            reduction=tf.keras.losses.Reduction.NONE, from_logits=False)

        def compute_loss(labels, predictions):
            per_example_loss = loss_object(labels, predictions)
            loss = tf.nn.compute_average_loss(
                per_example_loss, global_batch_size = global_batch_size)
            return loss

        train_accuracy_metric = tf.keras.metrics.AUC(name='training_AUC')

    return compute_loss, train_accuracy_metric


def train(train_dist_dataset, val_dist_dataset=None, y_val=None,
          total_steps=2000, validate_every=200):
    best_weights, history = None, []
    step = 0
    ### Training lopp ###
    for tensor in train_dist_dataset:
        distributed_train_step(tensor) 
        step+=1

        if (step % validate_every == 0):   
            ### Print train metrics ###  
            train_metric = train_accuracy_metric.result().numpy()
            print("Step %d, train AUC: %.5f" % (step, train_metric))   
            
            ### Test loop with exact AUC ###
            if val_dist_dataset:
                val_metric = roc_auc_score(y_val, predict(val_dist_dataset))
                print("Step %d,   val AUC: %.5f" %  (step,val_metric))   
                
                # save weights if it is the best yet
                history.append(val_metric)
                if history[-1] == max(history):
                    best_weights = model.get_weights()

            ### Reset (train) metrics ###
            train_accuracy_metric.reset_states()
            
        if step  == total_steps:
            break
    
    ### Restore best weighths ###
    model.set_weights(best_weights)



@tf.function
def distributed_train_step(data):
    strategy.experimental_run_v2(train_step, args=(data,))

def train_step(inputs):
    features, labels = inputs
    
    ### get transformer and head separate vars
    # get rid of pooler head with None gradients
    transformer_trainable_variables = [ v for v in model.trainable_variables 
                                       if (('pooler' not in v.name)  and 
                                           ('custom' not in v.name))]
    head_trainable_variables = [ v for v in model.trainable_variables 
                                if 'custom'  in v.name]

    # calculate the 2 gradients ( note persistent, and del)
    with tf.GradientTape(persistent=True) as tape:
        predictions = model(features, training=True)
        loss = compute_loss(labels, predictions)
    gradients_transformer = tape.gradient(loss, transformer_trainable_variables)
    gradients_head = tape.gradient(loss, head_trainable_variables)
    del tape
        
    ### make the 2 gradients steps
    optimizer_transformer.apply_gradients(zip(gradients_transformer, 
                                              transformer_trainable_variables))
    optimizer_head.apply_gradients(zip(gradients_head, 
                                       head_trainable_variables))

    train_accuracy_metric.update_state(labels, predictions)



def predict(dataset):  
    predictions = []
    for tensor in dataset:
        predictions.append(distributed_prediction_step(tensor))
    ### stack replicas and batches
    predictions = np.vstack(list(map(np.vstack,predictions)))
    return predictions

@tf.function
def distributed_prediction_step(data):
    predictions = strategy.experimental_run_v2(prediction_step, args=(data,))
    return strategy.experimental_local_results(predictions)

def prediction_step(inputs):
    features = inputs  # note datasets used in prediction do not have labels
    predictions = model(features, training=False)
    return predictions


compute_loss, train_accuracy_metric = define_losses_and_metrics()

## Finally train it on english comments


- Note it takes some time
- Don't mind the warning: "Converting sparse IndexedSlices to a dense Tensor"

In [8]:
%%time
train(train_dist_dataset, val_dist_dataset, y_val,
      TOTAL_STEPS_STAGE1, VALIDATE_EVERY_STAGE1)

  num_elements)
  num_elements)


Step 200, train AUC: 0.87843
Step 200,   val AUC: 0.94492
Step 400, train AUC: 0.96208
Step 400,   val AUC: 0.94369
Step 600, train AUC: 0.96788
Step 600,   val AUC: 0.94477
Step 800, train AUC: 0.97115
Step 800,   val AUC: 0.94488
Step 1000, train AUC: 0.97312
Step 1000,   val AUC: 0.94648
Step 1200, train AUC: 0.97232
Step 1200,   val AUC: 0.94415
Step 1400, train AUC: 0.97294
Step 1400,   val AUC: 0.94720
Step 1600, train AUC: 0.97507
Step 1600,   val AUC: 0.94563
Step 1800, train AUC: 0.97659
Step 1800,   val AUC: 0.94600
Step 2000, train AUC: 0.97696
Step 2000,   val AUC: 0.94604
CPU times: user 3min 29s, sys: 52.2 s, total: 4min 21s
Wall time: 20min


## Finetune it on the validation data

In [9]:
%%time

# decrease LR for second stage in the head
optimizer_head.learning_rate.assign(1e-4)

# split validation data into train test
X_train, X_val, y_train, y_val = train_test_split(X_val, y_val, test_size = 0.1)

# make a datasets
train_dist_dataset = create_dist_dataset(X_train, y_train, training=True)
val_dist_dataset = create_dist_dataset(X_val, y_val)

# train again
train(train_dist_dataset, val_dist_dataset, y_val,
      total_steps = TOTAL_STEPS_STAGE2, 
      validate_every = VALIDATE_EVERY_STAGE2)  # not validating but printing now

Step 10, train AUC: 0.92982
Step 10,   val AUC: 0.94024
Step 20, train AUC: 0.92825
Step 20,   val AUC: 0.94277
Step 30, train AUC: 0.93556
Step 30,   val AUC: 0.94405
Step 40, train AUC: 0.94400
Step 40,   val AUC: 0.94574
Step 50, train AUC: 0.94940
Step 50,   val AUC: 0.94669
Step 60, train AUC: 0.94342
Step 60,   val AUC: 0.94765
Step 70, train AUC: 0.95696
Step 70,   val AUC: 0.94746
Step 80, train AUC: 0.97071
Step 80,   val AUC: 0.94820
Step 90, train AUC: 0.94957
Step 90,   val AUC: 0.94722
Step 100, train AUC: 0.94638
Step 100,   val AUC: 0.94718
Step 110, train AUC: 0.96697
Step 110,   val AUC: 0.94787
Step 120, train AUC: 0.96910
Step 120,   val AUC: 0.94755
Step 130, train AUC: 0.96364
Step 130,   val AUC: 0.94863
Step 140, train AUC: 0.97313
Step 140,   val AUC: 0.95042
Step 150, train AUC: 0.96709
Step 150,   val AUC: 0.95093
Step 160, train AUC: 0.96929
Step 160,   val AUC: 0.95133
Step 170, train AUC: 0.96161
Step 170,   val AUC: 0.95132
Step 180, train AUC: 0.97587
Ste

## Make predictions and submission

In [10]:
%%time
sub_df['toxic'] = predict(test_dist_dataset)[:,0]
sub_df.to_csv('submission.csv', index=False)

CPU times: user 24.5 s, sys: 5.41 s, total: 29.9 s
Wall time: 1min 45s
