# Fine tuning a BERT model to classify *natural disaster* tweets to literal and figurative

The language describing natural disaster is commonly used in figurative way to describe other situation s. The sentence "The scientific community unleashed a tsunami of tsunami of articles criticising the paper" illustrates this linguistic phenomenon. In this notebook, you will see how to use the proven capabilities of fine tuning a pre-trained BERT model to distinguish between tweets meant figuratively and tweets intended to describe real natural disasters, and thus achieve transfer learning. Pretrained models are a promising area of application because it brings the value of state-of-the-art models built and optimised by the likes of Google (BERT is Google's creation) to the hands of every machine learning practitioner. 

# Loading the data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train_data = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
train_data.head(10)

In [None]:
test_data = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
print(test_data.size)
test_data.head(10)

# Exploring the data

The first thing we notice from the partially displayed dataframes above is that the columns: `keyword` and `location` have many `NaN` values in them; something that we need to find a way to remedy later.

## A peak into tweet lengths

The histogram below shows the number of tweets per length

In [None]:
plt.rcParams["figure.figsize"] = [15, 5.50]
plt.rcParams["figure.autolayout"] = True
tweet_lengths=map(lambda x:len(x),train_data['text'])
plt.hist(list(tweet_lengths),50)
                  
plt.grid(True)

## The keyword column
Let us, first explore the `keyword` column for usefulness for the current task

In [None]:
number_of_unique_keywords=train_data['keyword'].unique().size
print(f'Number of unique keywords: {number_of_unique_keywords}')

The number of unique values in `keyword` is 222. Such a reduction of number from the total number of samples (more than 7000) can be promising in the classification task. But let's keep exploring. Below is a plot of each keyword's count.

In [None]:
plt.rcParams["figure.figsize"] = [15, 30]
plt.rcParams["figure.autolayout"] = True
train_data['keyword'].value_counts().plot(kind='barh')
plt.grid(True)


Below is a table showing all the keywords in this columns.

In [None]:
utd=train_data['keyword'].unique()
#Convert NaN to 'NaN'
uutd=['NaN' if e is np.nan else e for e in utd]
uutd=np.array(uutd)
#Extend list to have a perfect square size
extension=15*15-uutd.size
xutd=np.hstack((uutd,
                    np.full((extension,),
                            'None')))
xutd=np.reshape(xutd,(-1,15))
pd.DataFrame(xutd)

While there are ways for utilizing the `keyword` column, in this notebook we will drop it because it requires rethinking our model's architecture. A second reason is that if we manage to train a model based on tweets alone we will have a more general model for sorting out tweets about natural disaster based only on the tweet text. We will also drop the `location` column.

# Preparing the training and the test sets

In [None]:
drop_columns = ['location','keyword']
target = train_data.pop('target')
train_data=train_data.drop(columns=drop_columns)
train_data

In [None]:
test_data=test_data.drop(columns=drop_columns)
test_data

You may have noticed from the keyword table above that some keywords had `%20` between words. This is a space character. It is better to convert it to a normal space so that our model handles it as a space. This is vary crucial to do, since spaces are a prominant feature of the English script -and many scripts for that matter. From transformer-based model point of view, such as BERT the pretrained attention heads encode high attention to spaces, since they signal changes in sequences of letters. So as a first step, we need to check whether the text column has any of these spaces.

In [None]:
tweet_substring=map(lambda x:'%20' in x,train_data['text'])
num_spaces = sum(list(tweet_substring))
print(f'{num_spaces} wierd spaces were found')

In [None]:
tweet_substring=map(lambda x:'%20' in x,test_data['text'])
num_spaces = sum(list(tweet_substring))
print(f'{num_spaces} wierd spaces were found')

No wierd spaces were found in either the training set or the test set. Indeed there are numerous other aspects of the data set we can investigate. For example, tweet texts have many short urls that do not seem to contain any information in their cryptic format, their presence in the tweet could be informative, it is a good idea to try and unify them to one string just to signal their presence without the unifromative variation they have. But, for the time being, let us limit ourselves to the preprocessing we've done above and delve right into building and our BERT model.

# Before using BERT



BERT uses a preprocessor unit as the first stage to convert the text to the proper embedding space that the encoder model understands. The package tensorflow-text is required for the preprocessor to work. We will also install tf-models-official to make use of the AdamW optimiser, which is much better optimiser for transformers than Adam. If you're running this notbook without GPU you the first install line below will produce several errors, please ignore them along with the warnings, as they should not affect the code execution below.

In [None]:
#Required for the preprocessor to work
!pip install -q -U "tensorflow-text==2.8.*"

In [None]:
#Required for AdamW
!pip install -q tf-models-official==2.7.0

In [None]:
import shutil

import tensorflow as tf
import tensorflow_hub as hub 
import tensorflow_text as text
from official.nlp import optimization

tf.get_logger().setLevel('ERROR')

# Preparing a TensorFlow dataset

Prepare the training set `disaster_ds` with a batch size of 32, and then check that two values look Ok for a good measure. TensorFlow datasets encapsulate many functionalities essential to data sets such as batch size and prefetching to speed up excecution, sample shuffling to mitigate sample correlation effect, setting training/validation splits and many others. We will only use batch size in this notebook to keep things simple. 

In [None]:
batch_size = 32
disaster_ds = tf.data.Dataset.from_tensor_slices((train_data['text'], target)).batch(batch_size)


In [None]:
#Test of the training set contains the desired data
for row in disaster_ds.take(2):
    print(row)

# The BERT with a thousand faces

BERT comes in many flavours, versions and sizes. You need to match the model to the compatible preprocessing module. The one we chose for this task is the English uncased (all letters are lower-cased) with 4 transformer layers (L), 128 characters maximum input length, and output embeding dimension of 512 per tocken (H). Each model and its corresponding preprocessor can be retrieved from TensorFlow Hub using the URL associated with it. `KerasLayer` uses URL as a handle to wrap the preprocessor and the encoder as Keras Layers.

In [None]:
encoder_url ='https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1'
preprocessor_url='https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

In [None]:
preprocessor = hub.KerasLayer(preprocessor_url)

Test the preprocessor model on a text string to get a feel of what the preprocessor does.

In [None]:
test_txt = ['His haircut is an absolute disaster']
text_out = preprocessor(test_txt)
text_out

# Building the model

- The first layer of the model is the `Input` layer, whose purpose is to produce a Keras tensor object based on the `shape` and `dtype`, as well as other parameters. The tensor contains enough information to build the model and connect the layers automatically to the next layer.
- The preprocessing layer follows the the `Input` layer and is the first layer a string goes through. This layer remains frozen, as its purpose is to recast a string into the model's embedded space.
- The pre-trained encoder comes next, and here we need to set `trainable` to `True` since the encoder weights will be optimised based on the `disaster_ds` data set.
- The encoder, by design, makes available the outputs of each transformer to allow for more flexibility in reusing BERT as part of other architectures. But the output of the model for sequential tasks such as language modeling tasks is all the outputs of the last transformer. In the case of classification tasks (like the one in hand) only the first vector of the 128 output vectors is normally used (The [CLS] vector, which is called the pooled output)
- The final layer is constituted of only one neuron, and has linear activation (default). Therefore, it is equivalent to linear regression. The ouput is simply: $\sum_{i=1}^{512} w_ix_i+b_i$

In [None]:
def build_bert_classifier():
  tweet = tf.keras.layers.Input(shape=(), dtype=tf.string, name='tweets')
  preprocessing_layer = hub.KerasLayer(preprocessor_url, name='preprocessing')
  encoder_inputs = preprocessing_layer(tweet)
#Make sure to make the encoder is trainable
  encoder = hub.KerasLayer(encoder_url, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(tweet, net)

Test that the model works properly.

In [None]:
#Test if the model works on sample text
bert_tweet_classifier = build_bert_classifier()
no_training_result = bert_tweet_classifier(tf.constant(["No one expected a fire of this scale"]))
print(tf.sigmoid(no_training_result))

Plot the model to make sure that everything is in place.

In [None]:
tf.keras.utils.plot_model(bert_tweet_classifier,show_shapes=True,rankdir='LR')

For the loss function we use binary cross entropy, and set from_logits to True because we are not using sigmoid activation for the output layer.

In [None]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

In [None]:
#Training run settings
epochs = 7
#Get the cardinality of the dataset (Simply put, cardinality is the total number of batches in a dataset)
steps_per_epoch = tf.data.experimental.cardinality(disaster_ds).numpy()
num_train_steps = steps_per_epoch * epochs
#Do a warm-up for number of steps = 10% of the total number of steps
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

In [None]:
#Compile the model
bert_tweet_classifier.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [None]:
#Get the model summary
bert_tweet_classifier.summary()

In [None]:
#Train the model
print(f'Training model with {encoder_url}')
history = bert_tweet_classifier.fit(x=disaster_ds,
                               epochs=epochs)

In [None]:
#Predicting state of test data. We just the test dataset as input to our model 
preds = bert_tweet_classifier(test_data['text'])

preds

In [None]:
#It is always a good idea to plot a histogram to check if anything is off
plt.rcParams["figure.figsize"] = [15, 5.50]
plt.rcParams["figure.autolayout"] = True
plt.hist(preds.numpy())

In [None]:
#Get the sigmoid ranges from the data
sigpreds= tf.sigmoid(preds)
sigpreds

In [None]:
#Plot a histogram to check if anything is off

plt.hist(sigpreds.numpy())

In [None]:
#Recast to 0 and 1
sigpreds = sigpreds*2
sigpreds=np.floor(sigpreds).astype(int)
sigpreds

In [None]:
#Plot a histogram to check if anything is off
plt.hist(sigpreds)

In [None]:
sigpreds=sigpreds.reshape(-1) #Flatten array
sigpreds

In [None]:
#Plot a histogram to check if anything is off
plt.hist(sigpreds)

In [None]:
#Save predictions
dfsub = pd.DataFrame({'id' : test_data['id'].to_list(),'target': sigpreds})
dfsub.to_csv('submission.csv', index=False)
dfsub = pd.read_csv('submission.csv')
dfsub
