# Introduction

In this notebook I try my first application of the BERT concept for NLP. As I understood I first have to clean up the data and then get it into a form that can be processed by BERT, so that the input sequences can be processed by a BERT layer that I have loaded from the tensorflow hub. I try to learn from the notebook https://www.kaggle.com/hassanamin/bert-nlp-real-or-not

In this notebook we will prepare submission based on an ensemble of the predictions of the following four models:

1. BERT pretrained word embeddings fed to a Dense Layer
2. pretrained smaller dimensional word embeddings by Universal Sentence Encoder fed to SVM
3. pretrained higher dimensional word embeddings by  Universal Sentence Encoder fed to a MultiLayerPerceptron
4. pretrained higher dimensional word embeddings by  Universal Sentence Encoder fed to SVM

# 2. Load Libraries and Data

In [3]:
import random
random.seed(42)

import numpy as np
np.random.seed(42)

import tensorflow as tf
tf.random.set_seed(42)
import cv2
import os

import numpy as np 
import pandas as pd

import tensorflow as tf 
import tensorflow_hub as hub 

import keras
from tensorflow.keras.layers import Dense, Input,LeakyReLU, Dropout, Softmax
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model

import bert

import matplotlib.pyplot as plt 

import re

In [4]:
# load the competition data
train_df = pd.read_csv("./train.csv")
test_df = pd.read_csv("./test.csv")

train_df = train_df.astype({"id" : int, "target" : int, "text" : str})
test_df = test_df.astype({"id" : int, "text" : str})
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


# 3. Data Cleaning

To clean the freetext in Text column, we will be removing the contents:
- Emojis
- symbols & pictographs
- hashtags
- line breaks, leading, trailing, and extra spaces

Also extracted hashtags, usernames and weblinks from text feature.

In [5]:
# helpful function for cleaning the text with regular experessions

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

def clean_text(text):
    text = re.sub(r'https?://\S+', '', text) # remove https? links
    text = re.sub(r'#', '', text) # remove hashtags by keeping the hashtag text
    text = re.sub(r'@\w+', '', text) # remove @usernames
    text = re.sub(r'\n',' ', text) # remove line breaks
    text = re.sub('\s+', ' ', text).strip() # remove leading, trailing, and extra spaces
    return text

# helpful function for extract hashtags, usernames and weblinks from tweets
def find_hashtags(tweet):
    return " ".join([match.group(0)[1:] for match in re.finditer(r"#\w+", tweet)]) or 'no'

def find_usernames(tweet):
    return " ".join([match.group(0)[1:] for match in re.finditer(r"@\w+", tweet)]) or 'no'

def find_links(tweet):
    return " ".join([match.group(0)[:] for match in re.finditer(r"https?://\S+", tweet)]) or 'no'

# function for pereprocessing the whole text
def preprocess_text(df):
    df['clean_text'] = df['text'].apply(lambda x: clean_text(x)) # cleaning the text
    df['hashtags'] = df['text'].apply(lambda x: find_hashtags(x)) # extracting the hashtags
    df['usernames'] = df['text'].apply(lambda x: find_usernames(x)) # extracting the @username(s)
    df['links'] = df['text'].apply(lambda x: find_links(x)) # extracting http(s)-links
    return df 
    
# preprocessing the 'text'-column in df and extending with additional columns 
# 'clean_text', 'hashtags', 'usernames' and 'links'
train_df = preprocess_text(train_df)
test_df = preprocess_text(test_df)

train_df.fillna(' ')
test_df.fillna(' ')
train_df['text_final'] = train_df['clean_text']+' '+ train_df['keyword']
test_df['text_final'] = test_df['clean_text']+' '+ test_df['keyword']

train_df['lowered_text'] = train_df['text_final'].str.lower()
test_df['lowered_text'] = test_df['text_final'].str.lower()

# 4. Build and Train Models
## 4.1. BERT Model

In the field of computer vision, researchers have repeatedly shown the value of **transfer learning** — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning i.e. using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks. We will show how a pre-trained neural network produces word embeddings which are then used as features in NLP models.

Now, we are trying a different model called BERT. BERT, which stands for **Bidirectional Encoder Representations from Transformers** and one of it's applications is text classification. BERT is a text representation technique like Word Embeddings. Like word embeddings, BERT is also a text representation technique which is a fusion of variety of state-of-the-art deep learning algorithms, such as bidirectional encoder LSTM and Transformers. BERT was developed by researchers **at Google AI Language in 2018** and has been proven to be state-of-the-art for a variety of natural language processing tasks such text classification, text summarization, text generation, etc. 
One of the mechanisms of the model is an **Transformer Encoder** that reads the text input. The input to Transformer Encoder is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.
BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model.
Classification task is done by adding a "classification layer" on top of the Transformer output for the token.

We are going to download the model using a url, where we can find all the prebuilt and pretrained models developed in TensorFlow. We will use the official tokenization script created by the Google team that is upload on github.  

As a part of text cleaning we will be removing links and non-ASCII characters, emoji, punctuations and also convert abbreviations such as ppl, omg, fyi, etc.

Following is the logic of the code in the next few cells-
1. Load BERT model from the Tensorflow Hub(tfhub)
2. Load tokenizer from the bert layer
3. Encode the text into tokens, masks, and segment flags
4. Modify the output layer of the pre-trained BERT model as follows and train-<br>
  **input-text  ===> Encoding for bert ==> BERT  ===> Classifier(FeedForward-Network with 'softmax'-output-layer)** <br>Below is the pictorial representation of the architecture.
  <img src="images/bert.png" width="900" height="200">

In [15]:
# encoding text for bert in an bert-compatible format like: [CLS]..text..[SEP][PAD][PAD] etc.
#  cls_token='[CLS]', sep_token='[SEP]', pad_token='[PAD]'=[0], mask_token='[MASK]',
# pass the text, the tokenizer from BERT an a max_len of the sequences 

def bert_encode(texts, tokenizer, max_len):  # length of encoded sequences 
    # prepare empty np-arrays for the token, mask and segments
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:  # for every text-sequence
        text = tokenizer.tokenize(text)# transform text-sequence into token-sequence
          
        text = text[:max_len-2]# cut the token-sequence at the end
        
        input_sequence = ["[CLS]"] + text + ["[SEP]"] # insert [CLS]-token at the beginning of sequence and a [SEP]-token at the end
        
        pad_len = max_len - len(input_sequence) # determine the length of the [PAD]-sequences to add on short input-sequences
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence) # transforms token to token-id
        tokens += [0] * pad_len # concatenate the missing space as [0]-PAD-token
       
        pad_masks = [1] * len(input_sequence) + [0] * pad_len # pad_mask of the form 11111...00000 with 111 for input, 000 for rest
        segment_ids = [0] * max_len # segment_id of the form 00000...000
        
        all_tokens.append(tokens) # concatenate the token-sequences
        all_masks.append(pad_masks) # concatenate the padding-masks
        all_segments.append(segment_ids) # concatenate the segment-ids
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments) # return all


# define a model by pass a bert-layer and a finite sequence-lenght as parameters
# to the function
def build_model(bert_layer, max_len): # etc. max_len=512, bert encoder works with sequences of finite lenght
    
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :] # for sentence classification, we’re only  interested 
    #in BERT’s output for the [CLS] token, 
    
    hidden1 = Dense(128, activation='relu')(clf_output) #128
    out = Dense(1, activation='sigmoid')(hidden1) 
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [16]:
BertTokenizer = bert.bert_tokenization.FullTokenizer

# load a pretrained, trainable bert-layer as Keras.layer from the tensorflow-Hub
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/1",trainable=True)

vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

# encoding the input- and test-features for BERT
train_input = bert_encode(train_df.lowered_text.values.astype(str), tokenizer, max_len=512) # final input-data
test_input = bert_encode(test_df.lowered_text.values.astype(str), tokenizer, max_len=512)# final test-data
train_labels = train_df.target.values # final target-data

model = build_model(bert_layer, max_len=512)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 512)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 512)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 512)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 108310273   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [17]:
history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=5, #5
    batch_size=16
)

predictions1 = model.predict(test_input)
print(predictions1[0:30])

Train on 6090 samples, validate on 1523 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
[[0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.6092735 ]
 [0.99568903]
 [0.07349035]
 [0.31909436]
 [0.01632243]
 [0.00715059]
 [0.01467803]
 [0.11484089]
 [0.03340209]
 [0.9929389 ]
 [0.0432989 ]
 [0.64377373]
 [0.01543269]
 [0.14688635]
 [0.08371693]
 [0.9987788 ]]


## 4.2 SVC with embeddings from Universal-Sentence-Encoder

While embedding a sentence, along with words the context of the whole sentence needs to be captured in that vector. This is where the “Universal Sentence Encoder” comes into the picture.
If you recall the GloVe word embeddings vectors which turns a word to 50-dimensional vector, the Universal Sentence Encoder is much more powerful, and it is able to embed not only words but phrases and sentences. 
The **Universal Sentence Encoder (USE)** developed by researchers at **Google AI** encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained Universal Sentence Encoder is publicly available in **Tensorflow-hub**. It comes with two variations i.e. one trained with Transformer encoder and other trained with Deep Averaging Network (DAN). 

Following is the logic of the code in the next few cells-
1. Load USE model from the Tensorflow Hub(tfhub)
2. Create the embeddings for Train and Test
4. Feed the embeddings to Support Vector Classifier(SVC) model and train-<br>
  **input-text  ===> Embeddings from USE ==> SupportVectorClassifier** <br>Below is the pictorial representation of the architecture.
  <img src="images/svm.png" width="900" height="200">

In [None]:
# try another model by using the google universal sentence encoder from https://tfhub.dev/google/universal-sentence-encoder/1 
embedding = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

embedded_xtrain = embedding(train_df['clean_text']).numpy()
embedded_xtest = embedding(test_df['clean_text']).numpy()
target = train_df["target"].to_numpy()

# prepare a support vector maschine with radial basis funtion kernels
from sklearn import svm
model2 = svm.SVR(kernel='rbf',gamma='auto')
model2.fit(emb edded_xtrain,target)

predictions2 = model2.predict(embedded_xtest)
predictions2 = np.mat(predictions2)
predictions2 = predictions2.T

## 4.3 MultiLayerPerceptron with large Large dimensional embeddings from Universal-Sentence-Encoder

Following is the logic of the 3rd model-
1. Load USE model from the Tensorflow Hub(tfhub)
2. Create the higher dimensional embeddings for Train and Test
4. Feed the embeddings to a MultiLayerPercepton and train-<br>
  <br>Below is the pictorial representation of the architecture.
  <img src="images/mlp.png" width="900" height="200">

In [None]:
# try another model by using the google universal sentence encoder https://tfhub.dev/google/universal-sentence-encoder-lite/2
sequence_lenght = 512
USElite2_embedding = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5") #"https://tfhub.dev/google/universal-sentence-encoder-lite/2")

USElite2_embedded_xtrain = USElite2_embedding(train_df['clean_text']).numpy()
USElite2_embedded_xtest = USElite2_embedding(test_df['clean_text']).numpy()
USElite2_target = train_df["target"].to_numpy() # no embedding
USElite2_embedded_xtest.shape

USE_for_m4_xtrain = USElite2_embedded_xtrain
USE_for_m4_xtest = USElite2_embedded_xtest
USE_for_m4_target = USElite2_target

from sklearn.model_selection import train_test_split
USElite2_x_train, USElite2_x_test, USElite2_y_train, USElite2_y_test = train_test_split(
    USElite2_embedded_xtrain,
    USElite2_target,
    test_size=0.1,
    random_state=0,
    shuffle=True
)

In [22]:
def make_my_model():
    input = keras.layers.Input(shape=(sequence_lenght,1), dtype='float32')
    
    #Conv1D-layer expected shape (Batchsize,Width,Channels)
   
    next_layer = keras.layers.Conv1D(265,kernel_size = 10, activation = "relu",padding="valid",strides = 1)(input)
    next_layer = keras.layers.MaxPooling1D(pool_size=2)(next_layer)
    
    next_layer = keras.layers.Conv1D(64,kernel_size = 5, padding="valid", strides = 1)(next_layer)
    next_layer = keras.layers.LeakyReLU(alpha=0.1)(next_layer)
    next_layer = keras.layers.MaxPooling1D(pool_size=3, strides=1)(next_layer)
    
    next_layer = keras.layers.Flatten()(next_layer)
    
    next_layer = keras.layers.Dense(64)(next_layer)
    next_layer = keras.layers.LeakyReLU(alpha=0.1)(next_layer)
    
    #next_layer = keras.layers.Dropout(0.2)(next_layer)
    
    #next_layer = keras.layers.LeakyReLU(alpha=0.1)(next_layer)
    
    output = keras.layers.Dense(1, activation="sigmoid")(next_layer)
      
    return keras.Model(inputs=input, outputs=output)

In [23]:
# Reshaping the inputs. The conv1d-Layer needs (batchsize x lenght x dim=1)
# shape[0]=batchsize=6090, shape[1]=length=512, dim=1
USElite2_x_train = np.reshape(USElite2_x_train, (USElite2_x_train.shape[0], USElite2_x_train.shape[1],1))
USElite2_y_train = np.reshape(USElite2_y_train, (USElite2_y_train.shape[0],1))
# shape[0]=batchsize=1523, shape[1]=length=512, dim=1
USElite2_x_test = np.reshape(USElite2_x_test, (USElite2_x_test.shape[0], USElite2_x_test.shape[1],1))
USElite2_y_test = np.reshape(USElite2_y_test, (USElite2_y_test.shape[0],1))

model3 = make_my_model()
model3.compile("adam", loss = "binary_crossentropy", metrics = ["acc"])
model3.summary()

model3.fit(
    USElite2_x_train,
    USElite2_y_train,
    batch_size = 128,
    epochs = 15,
    validation_data = (USElite2_x_test,USElite2_y_test)
)

USElite2_embedded_xtest = np.reshape(USElite2_embedded_xtest, (USElite2_embedded_xtest.shape[0],USElite2_embedded_xtest.shape[1],1))
predictions3 = model3.predict(USElite2_embedded_xtest)
predictions3
print(predictions3[0:30])

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 512, 1)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 503, 265)          2915      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 251, 265)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 247, 64)           84864     
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 247, 64)           0         
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 245, 64)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0   

## 4.4 Support Vector Regression with Large Dimensional Embeddings from Universal Sentence Encoder

Following is the logic of the 4th model-
1. Load USE model from the Tensorflow Hub(tfhub)
2. Create the higher dimensional embeddings for Train and Test
4. Feed the embeddings to a MultiLayerPercepton and train-<br>
  <br>Below is the pictorial representation of the architecture.
  
  <img src="images/svm.png" width="900" height="200">

In [None]:
# prepare a support vector maschine with radial basis function kernels
from sklearn import svm

model4 = svm.SVR(kernel='rbf', gamma='auto')
model4.fit(USE_for_m4_xtrain ,USE_for_m4_target)

predictions4 = model4.predict(USE_for_m4_xtest)
predictions4 = np.mat(predictions4)
predictions4 = predictions4.T

## Weighted Voting of Predictions 

Below is the weights for predictions from the above 4 models:

<br>
(0.5 * Predictions from BERT model with Embeddings from BERT) + <br>
(0.5 * Predictions from SVM with Lower Dimensional Embeddings from Universal Sentence Encoder) +  <br>
(0.1 * Predictions from Perceptron with Large Dimensional Embeddings from Universal Sentence Encoder) +  <br>
(0.3 * Predictions from SVM with Higher Dimensional Embeddings from Universal Sentence Encoder)  

In [None]:
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

submission['target'] =((0.5*predictions1+0.5*predictions2+0.1*predictions3+0.3*predictions4)*0.8).round().astype(int)

submission.to_csv('submission_bert_svm_conv.csv', index=False)