# 03-Model: Bert

**Topic:** Real or Not? NLP with Disaster Tweets
<br>
**Class:** MSCA 31009 Machine Learning 
<br>
**Professor:** Dr. Arnab Bose
<br>
**Link:** https://www.kaggle.com/c/nlp-getting-started/overview


# Why Bert?

**Disadvantages of LSTM**
<br>
1. Slow to train. Words are passed in sequentially and generated sequentially 
<br>
2. Not the best at capturing the true meaning of words (Even the bidrectional ones)

**Transformer model** - This came out in a paper in 2017 titled, "Attention is all you need" which solves the above problem. They are **faster** as words can be used simultaneously and understand true contextual meaning as they are deeply bi-directional.

<Br>

Lets have a look at the architecture

<br>

<img src ="https://cdn.analyticsvidhya.com/wp-content/uploads/2019/06/Screenshot-from-2019-06-17-20-01-32.png">


**Encoder** : Takes the words simultaneously and generates the embeddings simultaneously. Embeddings are vectors that encapsulate the meaning of the word. 

**Decoder** : Takes in the embeddings along with last outputs generated by the decoder model 

Since both of these parts learn some stuff individually they can be used indivually.  In case of english to french translation **Encoder** would learn What is english and what is contect. **Stacked Encoders =BERT**.  In the same example **Decoders** would learn how to map English to French words. Stacked Decoders = GPT







# Understanding a Transformer

- **Input Embedding** - We first input language data in form of emeddings i.e. numerical vectors that can encapsulate the meaning of the word. 
- **Positional Encoding** - Vectors that give context based on position of a word. They Sin/Cos for pos encoding
- **Encoder Block**
  - **Multi Head Attention**- Attention means which part should we focus on? So we are interested in knowing how any Ith word in the sentence is relevant to any other english word in the sentence. It is represnted in Ith attention vector. We find all the vectors and then take up weighted average because each word would give itself the highest attention. 
  - **Feed forward layer**-  We apply feed forward nets to all the attention vectors obtained above, also convert it to a shape accepted by the decoder block.
- **Decoder Block**
  - **Output Embedding** + **Positional Encoding**- We do the same thing. Convert outputs into embeddings and feed it to the decoder block 
  - **Multi Head Attention**- How much each word is related to other words in the embedding
  - **Multi Head Attention/ Encoder Decoder Block** - Vectors from Encoders and Vectors from Output embedding are then passed into this block. This is where the mapping happens. For example: Each vector represents the relation between words in both input and output 
  - **Feed Forward** - Makes the output layer more digestable for linear layer
- **Linear Layer** - Feed forward network that can convert the O/P into expected O/P length
- **Softmax** - Gives the probability 
- **Final Word**- Word with highest probability


**NOTE** - In the masked attention block for encoder I/P we use all the words in I/P whereas only previous words in O/P. So the matrix masks the next words to 0

# Limitation of a Transformer

- can only deal with fixed-length text strings. The text has to be split into a certain number of segments or chunks before being fed into the system as input. This chunking of text causes context fragmentation.For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the text is split without respecting the sentence or any other semantic boundary

# What is Bert? 

### Bidirectional Encoder Representations from Transformers

- Encoder blocks from the above architecture stack on top of each other

- BERT is a deeply bidirectional model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

- **Variants**
  - BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
  - BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters


# Text processing for Bert

<img src= "https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/bert_emnedding.png">

- **Token Embeddings:** These are the embeddings learned for the specific token from the WordPiece token vocabulary
- **Segment Embeddings:** BERT can also take sentence pairs as inputs for tasks (Question-Answering). That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
- **Position Embeddings:** BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information






# Let's start with coding

In [1]:
#importing libraries
#libearies for DL
import tensorflow as tf
import tensorflow_hub as hub
import nltk
from nltk.corpus import stopwords
import keras
from tqdm import tqdm
import pickle
from keras.models import Model
import keras.backend as K
from sklearn.metrics import confusion_matrix,f1_score,classification_report
import matplotlib.pyplot as plt
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from sklearn.utils import shuffle
#!pip install transformers
from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model

#libraries for data manipulation
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import re
import unicodedata
import itertools


In [2]:
#importing the dataset
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
%cd /content/drive/My Drive/Data_MSCA/

Mounted at /content/drive
/content/drive/My Drive/Data_MSCA


In [3]:
#importing the dataset
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
submission = pd.read_csv("sample_submission.csv")

In [4]:
#lets' have a look at the data
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
#cleaning the dataset and getting in the form

#removing the duplicates 
print(train_df.duplicated().sum())
train_df = train_df.drop_duplicates()

0


In [7]:
#lower case all the text
train_df['text']=train_df.apply(lambda row:row['text'].lower(),axis=1)
test_df['text']=test_df.apply(lambda row:row['text'].lower(),axis=1)

In [8]:
#Removing the links from the text

train_df['text']=train_df.apply(lambda row:re.sub(r"//t.co/\w+","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"//t.co/\w+","",row['text']),axis=1)

In [9]:
#for #
train_df['text']=train_df.apply(lambda row:re.sub(r"#","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"#","",row['text']),axis=1)

In [10]:
#for @
train_df['text']=train_df.apply(lambda row:re.sub(r"@\w+","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"@\w+","",row['text']),axis=1)

In [11]:
#Removing the HTML Tags

train_df['text']=train_df.apply(lambda row:re.sub(r"<.*?>","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"<.*?>","",row['text']),axis=1)

In [12]:
#Removing URL's

train_df['text']=train_df.apply(lambda row:re.sub(r"https?://\S+|www\.\S+","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"https?://\S+|www\.\S+","",row['text']),axis=1)

In [13]:
#Removing Emoji's

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔 ")
train_df['text']=train_df['text'].apply(lambda x: remove_emoji(x))
test_df['text']=test_df['text'].apply(lambda x: remove_emoji(x))

In [14]:
#Removing line breaks and tabs

train_df['text']=train_df.apply(lambda row:re.sub(r"\n","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"\n","",row['text']),axis=1)
train_df['text']=train_df.apply(lambda row:re.sub(r"\t","",row['text']),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"\t","",row['text']),axis=1)

In [15]:
#Removing extra spaces

train_df['text']=train_df.apply(lambda row:re.sub(r"\s"," ",row['text'].strip()),axis=1)
test_df['text']=test_df.apply(lambda row:re.sub(r"\s"," ",row['text'].strip()),axis=1)

In [16]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake ma...,1
1,4,,,forest fire near la ronge sask. canada,1
2,5,,,all residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive wildfires evacuation ord...",1
4,7,,,just got sent this photo from ruby alaska as s...,1


# Let's understand the BERT Tokenizer

BERT-Base, uncased uses a vocabulary of 30,522 words. The processes of tokenization involve splitting the input text into a list of tokens that are available in the vocabulary. In order to deal with the words not available in the vocabulary, BERT uses a technique called BPE based WordPiece tokenization. In this approach, an out of vocabulary word is progressively split into subwords and the word is then represented by a group of subwords. Since the subwords are part of the vocabulary, we have learned representations a context for these subwords and the context of the word is simply the combination of the context of the subwords.


In [17]:
#finding the number of class
num_classes=len(train_df.target.unique())
print("The number of classes are: {}".format(num_classes))

The number of classes are: 2


In [19]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [20]:
#looking at how Bert tokenizes the sentences
print("Sentences is: ")
print(train_df['text'][49])
tokens=bert_tokenizer.tokenize(train_df['text'][49])
print(tokens)

Sentences is: 
first night with retainers in. it's quite weird. better get used to it; i have to wear them every single night for the next year at least.
['first', 'night', 'with', 'retain', '##ers', 'in', '.', 'it', "'", 's', 'quite', 'weird', '.', 'better', 'get', 'used', 'to', 'it', ';', 'i', 'have', 'to', 'wear', 'them', 'every', 'single', 'night', 'for', 'the', 'next', 'year', 'at', 'least', '.']


As we can see the word retainers isn't here in the vocabulary and hence has been broken down into two words 'retain' and '##ers'

# Let's have a look at the parameters required for Bert


1. **Input ID's**
  - Input ID's : Index of input words in the BERT Vocabulary
  - Batch Size: Number of examples
  - Sequence Length: Number of tokens in sentence
2. **Attention Masks**
  - Mask to avoid performing attention on padding token indices. Mask values   selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are marked (0 if the token is added by padding).

3. **Label**
  - Indices should be in [0, ..., num_classes- 1]. If num_classes == 1 a regression loss is computed (Mean-Square loss), If num_classes > 1 a classification loss is computed (Cross-Entropy).

In [21]:
tokenized_sequence= bert_tokenizer.encode_plus(train_df['text'][49],add_special_tokens = True,max_length =30,pad_to_max_length = True, return_attention_mask = True)
tokenized_sequence

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


{'input_ids': [101, 2034, 2305, 2007, 9279, 2545, 1999, 1012, 2009, 1005, 1055, 3243, 6881, 1012, 2488, 2131, 2109, 2000, 2009, 1025, 1045, 2031, 2000, 4929, 2068, 2296, 2309, 2305, 2005, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [22]:
#let's see the decoded words
bert_tokenizer.decode(tokenized_sequence['input_ids'])

"[CLS] first night with retainers in. it's quite weird. better get used to it ; i have to wear them every single night for [SEP]"

Special tokens 
- classifier [CLS] 
- separator [SEP]) 
- Padding [PAD] 

are added by the tokenizer.

In [23]:
#Let's load the sentences in the tokenizer
def bert_encoder(dataset,max_l):
  input_ids=[]
  attention_masks=[]
  for sent in dataset:
     bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =max_l,pad_to_max_length = True,return_attention_mask = True)
     input_ids.append(bert_inp['input_ids'])
     attention_masks.append(bert_inp['attention_mask'])

  input_ids=np.asarray(input_ids)
  attention_masks=np.array(attention_masks)

  return input_ids,attention_masks



In [24]:
train_ids,train_masks=bert_encoder(train_df['text'],64)
test_ids,test_masks=bert_encoder(test_df['text'],64)



In [29]:
def create_model(max_l):
  bert_model = TFBertModel.from_pretrained('bert-large-uncased')
  input_ids = tf.keras.Input(shape=(max_l,),dtype='int32')
  attention_masks = tf.keras.Input(shape=(max_l,),dtype='int32')
  
  output = bert_model([input_ids,attention_masks])
  output = output[1]
  output = tf.keras.layers.Dense(32,activation='relu')(output)
  output = tf.keras.layers.Dropout(0.2)(output)

  output = tf.keras.layers.Dense(1,activation='sigmoid')(output)
  model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = output)
  model.compile(Adam(lr=6e-6), loss='binary_crossentropy', metrics=['accuracy'])
  return model

In [30]:
model = create_model(64)
model.summary()

Some layers from the model checkpoint at bert-large-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-large-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 64)]         0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 64)]         0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   TFBaseModelOutputWit 335141888   input_3[0][0]                    
                                                                 input_4[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 32)           32800       tf_bert_model_1[0][1] 

In [33]:
#creating checkpoint to save model
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True,save_weights_only=True)

history = model.fit([train_ids,train_masks],train_df.target,validation_split=0.2, epochs=3,batch_size=16,callbacks=[checkpoint])

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [34]:
#get the best weights
model.load_weights('model.h5')

test_prediction = model.predict([test_ids,test_masks])
submission['target'] = np.round(test_prediction).astype(int)
submission.to_csv('submission_Trasnformers.csv', index=False)
