## In this competition, a list of tweets must be classified as talking about a disaster or not

### The dataset is available publicly at https://www.kaggle.com/c/nlp-getting-started/data
#### The model implementing BERT obtained a F1 score of 0.834 on the test dataset.
#### An alternative model with whole-sentence tokenization (https://tfhub.dev/google/Wiki-words-250-with-normalization/2) obtained a significantly lower F1 score (0.789) and was discarded.

##### References:

###### -https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/2: Bert model with instructions to instantiate the tokenizer and the Bert Keras Layer
###### -https://github.com/tensorflow/models/blob/master/official/nlp/bert/tokenization.py: As explained in the previous link, the following class needs to be imported.
###### -https://arxiv.org/abs/1810.04805: Original BERT paper that explains the uses of BERT.

##### Regarding the three inputs to the BERT LAYER:
##### -Input_Word_Ids: This is just the tokenized version of the input with the pretrained embedding of BERT. The token ["CLS"] is added to the start of the sentence and the token ["SEP"] is appended at the end (for other applications, the model is trained with pairs of sequences, which are separated by ["SEP"].
##### -Segment_Ids: BERT can work with pairs of phrases. In this case, "0s" would be assigned to tokens refering to the first phrase and "1s" to tokens related with the second phrase. When handling individual phrases, the same number will be allocated to all tokens.
##### -Input_Mask: It depends on the source. Some authors claim that this is a positional embedding to be learnt by the model (the meaning of certain words change depending on their position in the phrase), wheras others claim that this is actually an Attention Mask which should have 1s in sentence tokens and 0s in the padding tokkens (given that all sentences are padded to the maximum length size). 
###### Both of these candidate inputs for the Input Mask were tried out and the latter obtained better results. It seems like the position is implicitly handled by the model and what Bert really needs to know is where the padding starts. According to https://github.com/google-research/bert/blob/master/run_classifier.py, "The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to."

In [1]:
# The libraries and the Tokenization class are imported:

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tokenization
import numpy as np
import pandas as pd
import os
from tensorflow import keras
import tensorflow as tf
import tensorflow_hub as hub

In [2]:
# Both train and test files are read into Pandas Dataframes. Only the text column is kept given that the tweet LOCATION
#does not seem relevant and Bert does not need a KEYWORD for each tweet either:

train_data = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_data = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

Id_test = test_data["id"] # This feature is written out to the output file.
Y = train_data["target"]

train_data = train_data["text"]
test_data = test_data["text"]

### It is really important to randomise the tweets' order before splitting them into training and validation sets given that their order might be alphabetical, having or non-having location first... or any other sort of non-random order.

In [3]:
cuttingNumber = 6613
random_list = np.random.permutation(train_data.shape[0])

train_X = train_data[random_list[:cuttingNumber]]
val_X = train_data[random_list[cuttingNumber:]]
test_X = test_data

train_Y = Y[random_list[:cuttingNumber]]
val_Y = Y[random_list[cuttingNumber:]]

print (train_X.shape[0], "tweets have been allocated to the training dataset and", val_X.shape[0],"tweets have been assigned to the validation dataset")

6613 tweets have been allocated to the training dataset and 1000 tweets have been assigned to the validation dataset


In [4]:
# First the BERT layer is imported and the tokenizer is instantiated according to "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/2"
max_seq_length = 128
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/2",
                            trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [5]:
# Converting tweets into Bert's input.
# Look at the third cell of the notebook to read the explanation of the three different Bert's inputs.
# The "bert_encode" function at https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub has been adapted:

def fromTweetsToBertInput(tweets, tokenizer, max_seq_length):
    
    input_word_ids = []
    input_masks = []
    segment_ids = []
    
    for sentence in tweets:
        
        # The tokenizer splits each tweet into bert tokens (words and subwords). 
        # If the tweet is too long, it is truncated.
        # The tokens ["CLS"] and ["SEP"] are appended as explained before.
        
        sentence = ["[CLS]"] + tokenizer.tokenize(sentence)[:max_seq_length-2] +["[SEP]"]
        
        # All the inputs must be of the same size. Therefore, we need to do padding for short sentences. Also
        #the tokens are converted into IDs (vectors of numbers) according to BERT pretrainned embedding:
        
        paddingLength = max_seq_length - len(sentence)
        
        input_word_id = tokenizer.convert_tokens_to_ids(sentence) + [0]*paddingLength
        
        # According to Google Research: "The mask has 1 for real tokens and 0 for padding tokens. Only real
        #tokens are attended to.":
        
        input_mask = [1] * len(sentence) + [0] * paddingLength
        
        # Given that our model will handle single phrases, the same number (0s) is assigned to all tokens:
        
        segment_id = [0] * max_seq_length
        
        input_word_ids.append(input_word_id)
        input_masks.append(input_mask)
        segment_ids.append(segment_id)
    
    input_word_ids = np.asarray(input_word_ids)
    input_masks    = np.asarray(input_masks)
    segment_ids    = np.asarray(segment_ids)
    return input_word_ids, input_masks, segment_ids

In [6]:
trainBertInput = fromTweetsToBertInput(train_X,tokenizer,max_seq_length)
valBertInput = fromTweetsToBertInput(val_X,tokenizer,max_seq_length)
testBertInput = fromTweetsToBertInput(test_X,tokenizer,max_seq_length)

# Training and Validation sets are only joined together for the final Kaggle Score on the test set once the arquitecture
#of the model has been previously defined with the validation set score on models trained only with the training set:

trainPlusValBertInput = fromTweetsToBertInput(train_data,tokenizer,max_seq_length) 

In [7]:
# The Bert Keras Model is defined (see the explanation of the three inputs in the third cell of this notebook):

input_word_ids = keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")

# Bert Layer has previously been imported:

pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

# Here we can choose the output we want to use from the Bert layer. The "pooled output"contains information of the 
#whole sentence (the output of all tokens is sintetized together). The "sequence output" contains the individual
#token vectors which can be processed with LSTMs, GRUs and other types of RNNs. For the purpose of this notebook,
#the Bert Model already needs more than 30 million parameters and the kaggle kernel with GPU is already working
#near its full capacity, so the "pooled output" was chosen.

# NOTE that if we increase the batch_size beyond 8, the notebook will crash because of the size of the model!

# One single neuron with a sigmoid activation function is enough to get decent scores without incrementing the
#number of parameters. If we train the model for more than one epoch, it will overfit the training set:

disasterOrNot = keras.layers.Dense(units = 1, activation = "sigmoid" ) (pooled_output)

model = keras.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=disasterOrNot)
model.summary()
model.compile(loss= "binary_crossentropy", optimizer=keras.optimizers.Adam(lr=1e-5), metrics=["binary_accuracy"])
model.fit(trainPlusValBertInput, Y, epochs = 1, batch_size = 8)

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 128)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

<tensorflow.python.keras.callbacks.History at 0x7ef9143704d0>

In [8]:
# The F1 score is used in the Kaggle competition. It takes into account the precision and the recall:

def f1Score (Y_true, Y_pred):
    TP = len (Y_true[Y_true==1][Y_true==Y_pred])
    FP = len (Y_true[Y_true==0][Y_true!=Y_pred])
    FN = len (Y_true[Y_true==1][Y_true!=Y_pred])
    TN = len (Y_true[Y_true==0][Y_true==Y_pred])
    PC = TP/(TP+FP)
    RC = TP/(TP+FN)
    F1 = (2*PC*RC)/(PC+RC)
    return F1

In [9]:
# This cell prints the F1 score on the validation set but it has been commented out given that for the final
#submission, the model has been trained on both training and validation dataset.

"""
valPredictions = np.squeeze(model.predict(valBertInput), axis=1)
valPredictions = (valPredictions>0.5).astype(int)
print (f1Score(val_Y, valPredictions))
"""

'\nvalPredictions = np.squeeze(model.predict(valBertInput), axis=1)\nvalPredictions = (valPredictions>0.5).astype(int)\nprint (f1Score(val_Y, valPredictions))\n'

In [10]:
predictions = np.squeeze(model.predict(testBertInput), axis=1)
predictions = (predictions>0.5).astype(int)
output = pd.DataFrame({'id': Id_test, 'target': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!
