# Bidirectional Encoder Representations from Transformers (BERT)
## Introduction
Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. BERT’s key technical innovation is applying **the bidirectional training of Transformer**, a popular attention model, to language modelling. As opposed to directional models, which **read the text input sequentially (left-to-right or right-to-left)**, the Transformer encoder **reads the entire sequence of words at once**. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). The full description of BERT and its working are illustrated in the paper https://arxiv.org/pdf/1810.04805.pdf. 
## Model Archetecture

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in following paper https://arxiv.org/pdf/1706.03762.pdf. The following picture illustrates the BERT's architecture:

<img src="pic1.png">

## Attention Layer
### The central idea behind Attention
The central idea behind Attention is not to throw away those intermediate encoder states but to utilize all the states in order to construct the context vectors required by the decoder to generate the output sequence.
### How does Attention work?
For the illustrative purposes, consider example:
Input (English) Sentence: “Rahul is a good boy”
Target (Marathi) Sentence: “राहुल चांगला मुलगा आहे”

Let’s say we now want our decoder to start predicting the first word of the target sequence i.e. “राहुल”
At time step 1, we can break the entire process into five steps as below:

<img src="pic5.jpg">
Now in order to generate the next word “चांगला”, the decoder will repeat the same procedure. The input of the internal state will be the hidden state output of the previous step. These steps are repeated untill final word are translated.

Note that unlike the fixed context vector used for all the decoder time steps in case of the traditional Seq2Seq models, here in case of Attention, we compute a separate context vector for each time step by computing the attention weights every time.


## Bidirectional Training
### Input/Output
The input is a sequence of tokens which could represent a single sentence or a pair sentences.  The first token of every sequence is always a special classification token ([CLS]). 

The output is a sequence of vectors of the same size a the input sequence, in which each vector corresponds to an input token with the same index.The output corresponding to the ([CLS]) token is used as the aggregate sequence representation for classification
tasks.
### Pre-Training
#### Masked LM (MLM)
15% of the words in each sequence are replaced with a [MASK] token before feeding word sequences into BERT. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. 

1. Adding a classification layer on top of the encoder output.

2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.

3. Calculating the probability of each word in the vocabulary with softmax.

<img src="pic2.png">
The loss function for the Masked LM pretraining is the standard cross entropy loss. We assume that the embedding layer is a matrix $G\in \mathbb{R}^{n\times d}$ where $n, d$ are the number of words in the dictionary and the embedding dimension, respectively. As shown in the above picture, the output of the mask word $w_4$ is the vector $o_4\ \mathbb \in {R}^H$ where $H$ is the dimension the transformer encoder's output. Assume that the classification layer if a matrix $A \in \mathbb{R}^{H\times d}$, then the output we get ater going through the classification layer is $\sigma(Ao_4)$  where $\sigma$ is an activation function. We then multiply this vector to the embedding matrix and apply the solfmax function to obtain $solfmax(G\sigma(Ao_4))$.

#### Next Sentence Prediction
In the BERT training process, the model learns to predict if the second sentence in the pair is the subsequent sentence in the original document. 
The input is processed in the following way before entering the model:
1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each token. 
3. A positional embedding is added to each token to indicate its position in the sequence.

<img src="pic3.png">

To predict if the second sentence is indeed connected to the first, the following steps are performed:
1. The entire input sequence goes through the Transformer model.
2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
3. Calculating the probability of IsNextSequence with softmax.

Similar idea is applied to the next sentence prediction when computing the loss. The difference is that we use the output of the [CLS] tolken as the signal to compute the probability of being the next sentence.
### Fine-tuning
Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.

<img src="pic4.png">


In [None]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py 

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
import tokenization

In [3]:
def bert_encode(texts, tokenizer, max_len=50):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [1]:
def build_model(bert_layer, max_len=100):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :] #output of [CLS] tolken
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [5]:
# Download BERT layer from the internet
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 14.3 s, sys: 1.68 s, total: 16 s
Wall time: 16.4 s


In [6]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submission = pd.read_csv("sample_submission.csv")

In [7]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [8]:
train_input = bert_encode(train.text.values, tokenizer, max_len=50)
test_input = bert_encode(test.text.values, tokenizer, max_len=50)
train_labels = train.target.values

In [9]:
model = build_model(bert_layer, max_len=50)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 50)]         0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 50)]         0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 50)]         0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [10]:
train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=3,
    batch_size=16
)

model.save('model.h5')

Train on 6090 samples, validate on 1523 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## References
1. https://arxiv.org/pdf/1810.04805.pdf
2. https://arxiv.org/pdf/1706.03762.pdf
3. https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
4. https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub?fbclid=IwAR1cHijhBdL98O_HRmK9ll7NUqfgaXwqyC5ulzz4dr7xdgmfWFDp4Ci_dsI
5. https://medium.com/@Petuum/embeddings-a-matrix-of-meaning-4de877c9aa27
6. https://en.wikipedia.org/wiki/BERT_(language_model)
7. https://www.youtube.com/watch?v=iDulhoQ2pro&t=907s&fbclid=IwAR3Hu5IawjE9T9MWHk38PqyUoOBA7TzEtSETVi0QonYbKLLjIFRe64YWHxk
8. https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f