# 02-Model: Bert

**Topic:** Real or Not? NLP with Disaster Tweets
<br>
**Class:** MSCA 31009 Machine Learning 
<br>
**Professor:** Dr. Arnab Bose
<br>
**Link:** https://www.kaggle.com/c/nlp-getting-started/overview


# Why Bert?

**Disadvantages of LSTM**
<br>
1. Slow to train. Words are passed in sequentially and generated sequentially 
<br>
2. Not the best at capturing the true meaning of words (Even the bidrectional ones)

**Transformer model** - This came out in a paper in 2017 titled, "Attention is all you need" which solves the above problem. They are **faster** as words can be used simultaneously and understand true contextual meaning as they are deeply bi-directional.

<Br>

Lets have a look at the architecture

<br>

<img src ="https://cdn.analyticsvidhya.com/wp-content/uploads/2019/06/Screenshot-from-2019-06-17-20-01-32.png">


**Encoder** : Takes the words simultaneously and generates the embeddings simultaneously. Embeddings are vectors that encapsulate the meaning of the word. 

**Decoder** : Takes in the embeddings along with last outputs generated by the decoder model 

Since both of these parts learn some stuff individually they can be used indivually.  In case of english to french translation **Encoder** would learn What is english and what is contect. **Stacked Encoders =BERT**.  In the same example **Decoders** would learn how to map English to French words. Stacked Decoders = GPT







# Understanding a Transformer

- **Input Embedding** - We first input language data in form of emeddings i.e. numerical vectors that can encapsulate the meaning of the word. 
- **Positional Encoding** - Vectors that give context based on position of a word. They Sin/Cos for pos encoding
- **Encoder Block**
  - **Multi Head Attention**- Attention means which part should we focus on? So we are interested in knowing how any Ith word in the sentence is relevant to any other english word in the sentence. It is represnted in Ith attention vector. We find all the vectors and then take up weighted average because each word would give itself the highest attention. 
  - **Feed forward layer**-  We apply feed forward nets to all the attention vectors obtained above, also convert it to a shape accepted by the decoder block.
- **Decoder Block**
  - **Output Embedding** + **Positional Encoding**- We do the same thing. Convert outputs into embeddings and feed it to the decoder block 
  - **Multi Head Attention**- How much each word is related to other words in the embedding
  - **Multi Head Attention/ Encoder Decoder Block** - Vectors from Encoders and Vectors from Output embedding are then passed into this block. This is where the mapping happens. For example: Each vector represents the relation between words in both input and output 
  - **Feed Forward** - Makes the output layer more digestable for linear layer
- **Linear Layer** - Feed forward network that can convert the O/P into expected O/P length
- **Softmax** - Gives the probability 
- **Final Word**- Word with highest probability


**NOTE** - In the masked attention block for encoder I/P we use all the words in I/P whereas only previous words in O/P. So the matrix masks the next words to 0

# Limitation of a Transformer

- can only deal with fixed-length text strings. The text has to be split into a certain number of segments or chunks before being fed into the system as input. This chunking of text causes context fragmentation.For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the text is split without respecting the sentence or any other semantic boundary

# What is Bert? 

### Bidirectional Encoder Representations from Transformers

- Encoder blocks from the above architecture stack on top of each other

- BERT is a deeply bidirectional model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

- **Variants**
  - BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
  - BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters


# Text processing for Bert

<img src= "https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/bert_emnedding.png">

- **Token Embeddings:** These are the embeddings learned for the specific token from the WordPiece token vocabulary
- **Segment Embeddings:** BERT can also take sentence pairs as inputs for tasks (Question-Answering). That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
- **Position Embeddings:** BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information






# How to train Bert?

- **Pretrain BERT** to understand language. 

**Goal** - *What is language? What is context?*

1. **Masked Language Model** - Takes sentence with random sentences filled with Masks. It's like fill in the blanks to learn the language
<br>
<br>
Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:
<br>
- Adding a classification layer on top of the encoder output.
- Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
- Calculating the probability of each word in the vocabulary with softmax.

<img src="https://miro.medium.com/max/986/0*ViwaI3Vvbnd-CJSQ.png">

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models
2. Next Sentence prediction -  Takes two sentences to see if the second sentence follows the first.
<br>

<img src="https://miro.medium.com/max/1321/0*m_kXt3uqZH9e7H4w.png">

- A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
- A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
- A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.


# Fine Tuning Bert

Need to change thr O/P layer of model for respective tasks

- **Classification** - Add softmax
- **NER**- Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.

In [7]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 8.9MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.94


In [8]:
#importing libraries
import numpy as np
import pandas as pd

#libraries for Deep Learning
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
import tokenization

In [9]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [10]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [11]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 39.7 s, sys: 11.9 s, total: 51.6 s
Wall time: 3min 51s


In [12]:
#importing the dataset
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
%cd /content/drive/My Drive/Data_MSCA/

Mounted at /content/drive
/content/drive/My Drive/Data_MSCA


In [13]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submission = pd.read_csv("sample_submission.csv")

In [14]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [15]:
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values

In [16]:
model = build_model(bert_layer, max_len=160)
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]      

In [20]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=3,
    callbacks=[checkpoint],
    batch_size=16
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [21]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)

In [22]:
submission['target'] = test_pred.round().astype(int)
submission.to_csv('submission_colab.csv', index=False)