<br><br>

<center><font size="7">üéØ Getting started with TPUs</font></center>

<br>

<center><font size="5">üíª Stack: Google TPUs and Tensorflow Keras</font></center>
 
<br>

<center><font size="5">üìê Model: Bidirectional LSTM</font></center></font></center>
   
<br>

<center>
    <font size="3">
    It took me a while to understand how to use TPUs on Kaggle and how to implement an efficient code that uses all TPUs-cores. Moreover, during the journey, I encountered some difficulties. Hereunder, I will try to summarize what does it happen under the hoods when compiling a TPU-model and how to train a model with TPUs efficiently. I will do my best to provide good advice and practical implementation tips that will (hopefully) make you gain some  time and have a more in-depth understanding of the matters.
    </font>
</center>

<br>

<center>
    <font size="3">
    The final model will be a simple LSTM model that will perform quite poorly in the leaderboard, about 0.63 AUC-ROC. Nonetheless, the model will be trained with TPUs super-fast; an epoch will take about 30 seconds. The goal here is to focus on the TPUs part rather than modelling. I will let you guys have fun in implementing more sophisticated models. Also, by choice, this notebook does not make use of any pre-trained transformers models such as Bert as this ease the focus of the TPUs part.
    </font>
</center>    

<br>

<center>
 <font size="3">
    I'm quite new in Kaggle and this is my first Kaggle-tutorial. I hope this can clarify things for many of you. Also, I would <font color="#42c2f5">LOVE</font> to hear your comment and <font color="#42c2f5">VERY APPRECIATED</font> feedback. This will motivate me to create other content and put all my effort into it.
    </font>
</center>

<br>
<center>
 <font size="3">
    üéâ Happy coding and enjoy the competition!
 </font>
</center>

In [None]:
"""
IMPORT
"""

import math, re, os
import tensorflow as tf
from tqdm import tqdm
import numpy as np
from matplotlib import pyplot as plt
from kaggle_datasets import KaggleDatasets
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
import io
import json


import numpy as np
import pandas as pd
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate
from tensorflow.keras.layers import LSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from tensorflow.keras.preprocessing import text, sequence
from gensim.models import KeyedVectors

from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential

import transformers
from tokenizers import BertWordPieceTokenizer

# <a id="1">Useful resources and Acknowledgement</a>

- [How to Use Kaggle TPUs](https://www.kaggle.com/docs/tpu)
- [Tensorflow core: distributed training with TensorFlow](https://www.tensorflow.org/guide/distributed_training)
- [Tensorflow core: Use a TPU](https://www.tensorflow.org/guide/tpu)
- [Simple LSTM](https://www.kaggle.com/thousandvoices/simple-lstm)
- [Getting started with 100+ flowers on TPU](https://www.kaggle.com/mgornergoogle/getting-started-with-100-flowers-on-tpu)
- [Jigsaw Multilingual Getting Started](https://www.kaggle.com/kivlichangoogle/jigsaw-multilingual-getting-started)
- [Jigsaw TPU: DistilBERT with Huggingface and Keras](https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras) 


# <a id="2">Activate TPU and check if it works</a>

First things first. On the settings box, bottom-right, select `TPU v3-8`
 and accept the conditions. Execute the next cell, you should see an output message like `Running on TPU: grpc://10.0.0.2:8470`.
 
The code:
   - 1. Initialize the TPU
   - 2. Instantiate a distribution strategy, this will permit to run the model in parallel on multiple TPU replicas
   - 3. Return the TPU object containing the distribution strategy settings 

In [None]:
# Adapted from https://www.kaggle.com/mgornergoogle/getting-started-with-100-flowers-on-tpu

PATH_TPU_WORKER = ''

def check_tpu():
    """
    Detect TPU hardware and return the appopriate distribution strategy
    """
    
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver() 
        print('Running on TPU: {}'.format(tpu.master()))
    except ValueError:
        tpu = None

    if tpu:
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
    else:
        tpu_strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.

    print("Num. replicas: {}".format(tpu_strategy.num_replicas_in_sync))
    
    return tpu, tpu_strategy
    
tpu, tpu_strategy = check_tpu()
PATH_TPU_WORKER = tpu.master()
NUM_REPLICAS = tpu_strategy.num_replicas_in_sync

### Data pre-processing

Here, we simply pre-process the text data and convert it into vectors. In the first place, I tried with a simple Keras Tokenizer but it performed very poorly due to the multilingual aspect of the validation dataset. The current pre-processing version uses a multilingual pre-trained tokenizer.

After this part of code we are left with the cleaned (numpy) arrays `x_train`, `x_test`, `x_valid` and `y_train`, `y_valid`.

In [None]:
%%time

"""
PATH
"""

PATH_CHALLENGE = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification/'

PATH_TRAIN_FILENAME = PATH_CHALLENGE + "jigsaw-toxic-comment-train.csv"
PATH_TEST_FILENAME = PATH_CHALLENGE + "test.csv"
PATH_VALID_FILENAME = PATH_CHALLENGE + "validation.csv"


"""
LOAD
"""

train_df = pd.read_csv(PATH_TRAIN_FILENAME)
test_df = pd.read_csv(PATH_TEST_FILENAME)
valid_df = pd.read_csv(PATH_VALID_FILENAME)

"""
PREPROCESSING
"""

MAX_LEN = 256

# Adapted https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')

save_path = '/kaggle/working/distilbert_base_uncased/'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.save_pretrained(save_path)

fast_tokenizer = BertWordPieceTokenizer('distilbert_base_uncased/vocab.txt', lowercase=True)


def encode(texts, tokenizer, chunk_size=256, maxlen=MAX_LEN):
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(max_length=maxlen)
    all_ids = []
    
    for i in range(0, len(texts), chunk_size):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

x_train = encode(train_df.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_valid = encode(valid_df.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_test = encode(test_df.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)

y_train = train_df.toxic.values
y_valid = valid_df.toxic.values

# <a id="3">Load CSV dataset with tf.data.Dataset</a>

Because TPUs are very fast, many models ported to TPU end up with a data bottleneck. We say that the TPU is sitting idle, i.e is waiting for the data most part of each training epoch.

Because of that, it's crucial to have an efficient data-loading-pipeline to stream data efficiently into the model. TPUs models work best with the tf.data API, a Tensorflow API that helps to build flexible and efficient input pipelines.

In the code below, the three dataset `train_dataset`, `test_dataset` and `valid_dataset` are `tf.data.Dataset` object:

For each dataset:
   1. `from_tensor_slices` return a tf.data.Dataset (a sequence of tensors, elements) object from a simple array
   2. Dataset is split into `BATCH_SIZE`
   3. With the `prefetch` transformation, data soon-to-be-consumer are loaded in memory beforehand
   4. `cache` transformation cache data in memory


If you want to know more about data pipeline optimization: [Better performance with the tf.data API](https://www.tensorflow.org/guide/data_performance).


Important: in the general case, to optimize the throughput, it's preferable to load the data directly from the Google Cloud storage. Also, it's preferable to load [TFrecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) data. 

For the sake of completeness, I tried to load directly from Google Cloud Storage the preprocessed data but didn't notice any improvement with respect to training time.

In [None]:
"""
SETTINGS
"""

AUTO = tf.data.experimental.AUTOTUNE

BATCH_SIZE = 16 
TOTAL_BATCH_SIZE = BATCH_SIZE * tpu_strategy.num_replicas_in_sync
print("Batch size: {}".format(BATCH_SIZE))
print("Total batch size: {}".format(TOTAL_BATCH_SIZE))


"""
DATA Loading
"""


train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(TOTAL_BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    #.repeat()
    .batch(TOTAL_BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(TOTAL_BATCH_SIZE)
    #.repeat()
    .cache()
    .prefetch(AUTO)
)

# <a id="4">Batch size and model compilation with 'strategy.scope()'</a>

When dealing with distributed computations, it's important to distinguish between the TOTAL_BATCH_SIZE and the BATCH_SIZE.

<center>
    <strong>
    BATCH_SIZE = TOTAL_BATCH_SIZE / NUM_REPLICAS
    </strong>
</center>

<br>

In other words, BATCH_SIZE is the number of examples a single TPU node will receive. The model therefore needs to be compiled taking into account this information.

To better visualize and understand what does it happen under the hoods, let's simulate a simple example.

We define a very simple Keras model and visualize it.

In [None]:
def simple_model(max_len=MAX_LEN):
    words = Input(shape=(max_len,), batch_size=TOTAL_BATCH_SIZE, dtype=tf.int32, name="words")
    x = Dense(10, activation='relu')(words)
    out = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=words, outputs=out)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model.summary()

simple_model()

Then, we do the same thing, but this time we compile the model with TPUs capabilities. To do that, we just have to compile the model inside the `tpu_strategy` scope.

In [None]:
with tpu_strategy.scope():
    simple_model()

As we can notice from the summary, this time, the first dimension of the layers of the neural network, i.e the batch size, is 8 times smaller than the previous version. That's because every TPUs replica will receive just a fraction of the TOTAL_BATCH_SIZE.

Note also that in this scenario we statically defined the input shape, but this is not strictly necessary. The actual version of Keras works also with dynamic input shapes.

<font color="#f54242">IMPORTANT</font> In most cases, it's preferably to define a dynamic batch size, as we will do in the next part. The reason is that, in case the total batch size is not divisible by the number of TPUs replicas, during training, the model will receive as input a batch with batch size less than the static-defined BATCH_SIZE, throwing an error.

# <a id="5"> Final model and training</a>

   1. Define a simple-yet-powerful [LSTM model](https://www.kaggle.com/thousandvoices/simple-lstm)
   2. Compile the model with TPUs capabilities
   3. Train the model
   4. Print AUC-ROC on the validation set


Notice how incredibly fast we can train a model with more than **32 millions** of parameters. By clicking on the top-right "TPU" bar, you should be able to see TPUs statistics. You will notice how the idle time is quite low, around 6%, this indicates that the TPUs receive the data quite fast. 

<font color="#f54242">IMPORTANT</font> When compiling the model, it's important to use as loss function `tf.keras.losses.BinaryCrossentropy()` rather than the string `"binary_crossentropy"`. That's because the latter is still not fully supported by TPUs. You will notice how faster the training will be by using the first option.

In [None]:
"""
DEFINE MODEL
"""


def lstm_model(vocab_size, max_len=MAX_LEN):
    
    words = Input(shape=(max_len,), dtype=tf.int32, name="words")
    x = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=512, input_length=max_len)(words)
    x = tf.keras.layers.SpatialDropout1D(0.3)(x)
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True))(x)
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True))(x)

    hidden = tf.keras.layers.concatenate([
        tf.keras.layers.GlobalMaxPooling1D()(x),
        tf.keras.layers.GlobalAveragePooling1D()(x),
    ])
    hidden = tf.keras.layers.add([hidden, Dense(4 * 256, activation='relu')(hidden)])
    hidden = tf.keras.layers.add([hidden, Dense(4 * 256, activation='relu')(hidden)])
    result = Dense(1, activation='sigmoid')(hidden)
    
    model = tf.keras.Model(inputs=words, outputs=result)

    model.compile(loss=tf.keras.losses.BinaryCrossentropy(), # much more faster with 
                  optimizer=tf.keras.optimizers.Adam(1e-4),
                  metrics=['accuracy'])
    return model

"""
BUILD
"""

with tpu_strategy.scope():
    vocab_size = tokenizer.vocab_size # Distil
    model = lstm_model(vocab_size)
model.summary()

In [None]:
%%time

"""
TRAIN
"""

EPOCHS = 5

N_TRAIN_STEPS = 219
N_VALID_STEPS = 63
train_history = model.fit(
    train_dataset,
    steps_per_epoch=N_TRAIN_STEPS,
    validation_data=valid_dataset,
    validation_steps=N_VALID_STEPS,
    epochs=EPOCHS
)


def auc_roc(dataset, ground_truth):
    from sklearn.metrics import roc_curve
    y_pred_keras = model.predict(dataset, verbose=1).ravel()
    fpr_keras, tpr_keras, thresholds_keras = roc_curve(ground_truth, y_pred_keras)
    from sklearn.metrics import auc
    return auc(fpr_keras, tpr_keras)

print("AUC-ROC validation set: ")
auc_roc(valid_dataset, y_valid)

# <a id="6"> Predictions and submission</a>

I believe this part deserves a section too as there is not much documentation regarding this out-there. Because of the distribution strategy, the output size of `model.predict(input)` **does not** always match the input size. The output size is always a multiple of `NUM_REPLICAS` (8 in our case). Therefore, before submitting the results, we need to keep only the first `input_size` results.

In [None]:
predictions = model.predict(test_dataset, verbose=1).ravel()

input_size = test_df.shape[0]
output_size = predictions.shape[0]

if input_size != output_size:
    print("Input size differs from output size. Input size: {}, Output size: {}".format(input_size,output_size))

if output_size % NUM_REPLICAS == 0:
    print("Predicitions is divisible by ".format(NUM_REPLICAS))
    
    
submission = pd.DataFrame.from_dict({
    'id': test_df.id,
    'toxic': predictions[:input_size]
})

print("Save submission to csv.")
submission.to_csv('submission.csv', index=False)


# <a id="7"> Key takeaways ü¶ê</a>

- TPUs works best with large files. In this case, you should opt for loading data directly from Google Cloud Storage.
- Develop an efficient data-pipeline and monitor the idle time
- For the loss, prefer tf.keras defined function such as `tf.keras.losses.BinaryCrossentropy()` rather than shortcodes like `"binary_crossentropy"`
- During model construction, use dynamic batch sizes
- The output size of the `model.predict(..)` may be different from input size due to the strategy distribution