## BERT Disaster Tweet Classification
This notebook illustrates the ease with which modern APIs (in this case Tensorflow) enable the use of SOTA models for down-stream NLP tasks.

### Less common installs for BERT pre-trained model

In [None]:
!pip install -q -U tensorflow-text
!pip install -q -U tf-models-official

### Preamble

In [None]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optmizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

print('TensorFlow:', tf.__version__)

strategy = tf.distribute.MirroredStrategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

### Hyperparameters and global variables

In [None]:
batch_size = 16
epochs = 6
init_lr = 3e-5

## Import Tweet Dataset
To squeeze all the information we have from the tweet dataset, if the tweet has a location and keyword, that is concatenated at the beginning of the sequence with a semi-colon.

We split the data into 80% train, and 20% validation. Typically one would partition the training set in k-folds for very reliable results, it is often the case with these larger SOTA models that this is no longer computationally feasible.

We then apply various BERT models and fine-tune them.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = pd.read_csv("../input/nlp-getting-started/train.csv")
y = train_ds['target']

x_len = len(train_ds['text'])

train_ds["keyword"] = train_ds["keyword"] + "; "
train_ds["location"] = train_ds["location"] + "; "

train_ds["keyword"] = train_ds["keyword"].fillna("")
train_ds["location"] = train_ds["location"].fillna("")

train_ds['text'] = train_ds["keyword"] + train_ds["location"] + train_ds['text']

train_len = int(0.8 * x_len)
val_len = int(0.2 * x_len)


full_df = tf.data.Dataset.from_tensor_slices((train_ds['text'],y)).batch(batch_size).cache().prefetch(buffer_size=AUTOTUNE)
train_df = full_df.take(int(train_len/batch_size)).shuffle(buffer_size=int(train_len/batch_size))
val_df = full_df.skip(int(train_len/batch_size)).shuffle(buffer_size=int(val_len/batch_size))

## BERT Model Selection
Luckily the people at tensorflow make it as easy as changing a key in a dictionary to select a SOTA pre-trained model

In [None]:
bert_model_name = 'talking-heads_base' 

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/3',
    'bert_en_wwm_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_wwm_uncased_L-24_H-1024_A-16/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_en_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/3',
    'bert_en_wwm_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_wwm_cased_L-24_H-1024_A-16/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'albert_en_large':
        'https://tfhub.dev/tensorflow/albert_en_large/2',
    'albert_en_xlarge':
        'https://tfhub.dev/tensorflow/albert_en_xlarge/2',
    'albert_en_xxlarge':
        'https://tfhub.dev/tensorflow/albert_en_xxlarge/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
    'talking-heads_large':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_wwm_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'bert_en_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'bert_en_wwm_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'albert_en_large':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'albert_en_xlarge':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'albert_en_xxlarge':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_large':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocessing model auto-selected: {tfhub_handle_preprocess}')

### Lets see what a pre-processed sequence looks like
Basically this transforms word, or sub-word units into integer tokens, which are typically associated with an embedded vector within the model.

In [None]:
with strategy.scope():
    bert_preprocess_model = hub.load(tfhub_handle_preprocess)

    test_inp = [train_ds['text'].iloc[0]]
    print(test_inp)
    print(bert_preprocess_model.tokenize(test_inp))

## Build and run BERT

In [None]:
def build_bert():
    bert_preprocess_model = hub.load(tfhub_handle_preprocess)
    
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
    return tf.keras.Model(text_input, net)

In [None]:
with strategy.scope():
    bert = build_bert()

    loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
    metrics = [tf.metrics.BinaryAccuracy()]

    steps_per_epoch = train_len
    num_train_steps = steps_per_epoch * epochs
    num_warmup_steps = int(0.2*num_train_steps)

    optimizer = optimization.create_optimizer(init_lr=init_lr,
                                              num_train_steps=num_train_steps,
                                              num_warmup_steps=num_warmup_steps,
                                              optimizer_type='adamw')

    bert.compile(optimizer=optimizer,
                            loss=loss,
                            metrics=metrics)

bert.summary()

history = bert.fit(x=train_df,
                  validation_data=val_df,
                  epochs=epochs)

### Dev Performances
 * **bert_multi_cased_L-12_H-768_A-12**
   * val_loss: 0.4034 - val_binary_accuracy: 0.8239
   * 16 batch, 1e-5 LR, 3/4 epochs
 * **bert_en_uncased_L-24_H-1024_A-16**
   * val_loss: 0.4089 - val_binary_accuracy: 0.8311
   * 16 batch, 1e-5 LR, 3/4 epochs
 * **bert_en_uncased_L-12_H-768_A-12**
   * val_loss: 0.4176 - val_binary_accuracy: 0.8297
   * 32 batch, 3-e5 LR, 4/4 epochs
 * **bert_en_cased_L-24_H-1024_A-16**
   * val_loss: 0.3840 - val_binary_accuracy: 0.8350
   * 16 batch, 1e-5 LR, 3/4 epochs
 * **electra_base**
   * val_loss: 0.4147 - val_binary_accuracy: 0.8232
   * 16 batch, 3e-5 LR, 2/4 epochs
 * **albert_en_xxlarge**
   * val_loss: 0.4116 - val_binary_accuracy: 0.8393
   * 8 batch, 3e-5 LR, 3/4 epochs
 * **albert_en large**
   * val_loss: 0.4517 - val_binary_accuracy: 0.8258
   * 16 batch, 2e-5 LR, 3/4 epochs
 * **talking_heads_large**
   * val_loss: 0.4345 - val_binary_accuracy: 0.8420
   * 8 batch, 2e-5 LR, 2/4 epochs

 * **talking_heads_base**
   * val_loss: 0.3835 - val_binary_accuracy: 0.8369
   * 32 batch, 2e-5 LR, 4/4 epochs
   * val_loss: 0.3723 - val_binary_accuracy: 0.8382
   * 32 batch, 3e-5 LR, 4/4 epochs
   * val_loss: 0.3844 - val_binary_accuracy: 0.8343
   * 32 batch, 2e-5 LR, 6/6 epochs

### Train model on entire training set

In [None]:
init_lr = 3e-5
epochs = 3

with strategy.scope():
    bert = build_bert()

    steps_per_epoch = train_len + val_len
    num_train_steps = steps_per_epoch * epochs
    num_warmup_steps = int(0.2*num_train_steps)


    optimizer = optimization.create_optimizer(init_lr=init_lr,
                                              num_train_steps=num_train_steps,
                                              num_warmup_steps=num_warmup_steps,
                                              optimizer_type='adamw')

    bert.compile(optimizer=optimizer,
                            loss=loss,
                            metrics=metrics)

    bert.summary()

history = bert.fit(x=full_df,epochs=epochs)

### Test set prediction

In [None]:
test_ds = pd.read_csv("../input/nlp-getting-started/test.csv")

test_ds["keyword"] = test_ds["keyword"] + "; "
test_ds["location"] = test_ds["location"] + "; "

test_ds["keyword"] = test_ds["keyword"].fillna("")
test_ds["location"] = test_ds["location"].fillna("")

test_ds['text'] = test_ds["keyword"] + test_ds["location"] + test_ds['text']

test_pred = bert.predict(test_ds['text'])

test_id = test_ds['id']

with open("./bert_talking_heads_predicitions.csv","w+") as f:
    f.write("id,target\n")
    
    for i,val in enumerate(test_pred):
        f.write("%d,%d\n" % (test_id[i],round(tf.sigmoid(val).numpy()[0])))