This is an addition to my previous work :
* [[Tweet NLP Benchmark 1] CNN RNN with GloVe](https://www.kaggle.com/vicioussong/tweet-nlp-benchmark-1-cnn-rnn-with-glove)

This work is a Benchmark for comparing Elmo and Bert Family. I focused on the indicators below : 
* Performance (F1 / Precision / Recall)
* Training / Inference speed
* Model size

My work is inspired by and based on the following notebooks :
* [NLP (Disaster Tweets) with Glove and LSTM](https://www.kaggle.com/mariapushkareva/nlp-disaster-tweets-with-glove-and-lstm)
* [Disaster NLP: Keras BERT using TFHub](https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub)

## 0 - Benchmark

| Model           | F1   | Precision | Recall | Training Time | Inference Time |
|-----------------|------|-----------|--------|---------------|----------------|
| DistilBert Base | 0.78 | 0.83      | 0.74   | 198.1 s       | 3.9 ms         |
| AlBert Base     | 0.79 | 0.78      | 0.81   | 371.0 s       | 7.9 ms         |
| AlBert Large    | 0.78 | 0.84      | 0.73   | 1022.9 s      | 21.6 ms        |
| Bert Base       | 0.8  | 0.79      | 0.82   | 368.2 s       | 8.3 ms         |
| Bert Large      | 0.78 | 0.74      | 0.83   | 1060.9 s      | 21.7 ms        |
| RoBERTa Base    | 0.78 | 0.73      | 0.85   | 390.3         | 8.3 ms         |

* Is there a winner in terms of F1 in this task ? <span style="color:red">Not really.</span>
* In similar classification tasks, which model is more suitable to use, especially in a production environment ? <span style="color:red">DistilBert could be a good choice. It gives acceptable results with a relatively short delay.</span>

## 1 - Import

In [1]:
import os
import time

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import transformers
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

## 2 - Load & Proprocess

In [2]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
    
    return np.array(all_tokens)

def build_model(transformer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [3]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test  = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [4]:
%%time
# Distil Bert Base
# model_to_use = 'distilbert-base-uncased'

# model_to_use = 'albert-base-v1'
# model_to_use = 'albert-large-v1'
# model_to_use = 'albert-xlarge-v1'
# model_to_use = 'albert-xxlarge-v1'

model_to_use = 'bert-base-uncased'
# model_to_use = 'bert-large-uncased'

# model_to_use = 'roberta-base'
# model_to_use = 'roberta-large'

if model_to_use.split('-')[0] == 'distilbert':
    transformer_layer = transformers.TFDistilBertModel.from_pretrained(model_to_use)
    tokenizer = transformers.DistilBertTokenizer.from_pretrained(model_to_use)
    
if model_to_use.split('-')[0] == 'albert':
    transformer_layer = transformers.TFAlbertModel.from_pretrained(model_to_use)
    tokenizer = transformers.AlbertTokenizer.from_pretrained(model_to_use)
    
if model_to_use.split('-')[0] == 'bert':
    transformer_layer = transformers.TFBertModel.from_pretrained(model_to_use)
    tokenizer = transformers.BertTokenizer.from_pretrained(model_to_use)
    
if model_to_use.split('-')[0] == 'roberta':
    transformer_layer = transformers.TFRobertaModel.from_pretrained(model_to_use)
    tokenizer = transformers.RobertaTokenizer.from_pretrained(model_to_use)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


CPU times: user 16.3 s, sys: 3.6 s, total: 19.9 s
Wall time: 36.7 s


In [5]:
max_seq_len = 160

train_input = bert_encode(train.text.values, tokenizer, max_len=max_seq_len)
test_input  = bert_encode(test.text.values, tokenizer, max_len=max_seq_len)
train_label = train.target.values

# Data split
X_train, X_test, y_train, y_test = train_test_split(train_input, 
                                                    train_label, 
                                                    test_size=0.25,
                                                    random_state=42, 
                                                    shuffle=True)
# X_train.shape, X_test.shape = ((5709, 160), (1904, 160))

## 3 - Modeling

In [6]:
def metrics(y_true, y_pred):
    print("\nF1-score: ", round(f1_score(y_true, y_pred), 2))
    print("Precision: ", round(precision_score(y_true, y_pred), 2))
    print("Recall: ", round(recall_score(y_true, y_pred), 2))

In [7]:
model = build_model(transformer_layer, max_len=160)
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 160)]             0         
_________________________________________________________________
tf_bert_model (TFBertModel)  ((None, 160, 768), (None, 109482240 
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 768)]             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 769       
Total params: 109,483,009
Trainable params: 109,483,009
Non-trainable params: 0
_________________________________________________________________


In [8]:
# Training
start_time = time.time()
train_history = model.fit(X_train, y_train, epochs = 3, batch_size = 8)
end_time = time.time()
print("\n=>Training time :", round(end_time - start_time, 1), 's')

Epoch 1/3
Epoch 2/3
Epoch 3/3

=>Training time : 415.9 s


In [9]:
# Validation
start_time = time.time()
test_pred = model.predict(X_test, verbose=1).round().astype(int)
end_time = time.time()

print('\n=>Average Inference Time :', round((end_time - start_time) / len(test_pred) * 1000, 1), 'ms')
metrics(y_test, test_pred)


=>Average Inference Time : 8.0 ms

F1-score:  0.79
Precision:  0.84
Recall:  0.74


## 4 - Submission

In [10]:
submission['target'] = model.predict(test_input, verbose=1).round().astype(int)
submission.to_csv('submission.csv', index=False)

