<center>
    <h1>BERT</h1>
</center>

# Brief Recap
**BERT** was introduced by Google in a research paper titled "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](arxiv.org/abs/1810.04805)" in 2018.

BERT revolutionized natural language processing (NLP) by enabling deep bidirectional context in language models. This means it considers the context from both the left and right of a word during training, allowing for better understanding of word meanings in context.

# Architecture

<img src='https://www.researchgate.net/publication/349546860/figure/fig2/AS:994573320994818@1614136166736/The-Transformer-based-BERT-base-architecture-with-twelve-encoder-blocks.ppm' width=500>

1. **Input Layer:**

  * The input text is tokenized into individual words or subwords (t1, t2, t3, etc.).
  * Each token is represented by a numerical embedding.
  * Positional encoding is added to the embeddings to provide the model with
  * information about the relative position of each token in the sequence.
2. **Transformer Encoder:**

  * This is the core of the BERT model. It consists of multiple layers (typically 12 in the original BERT model) that process the input sequence in parallel.
  * Each layer contains two main components:
  * Multi-Head Attention: This layer allows the model to weigh the importance of different parts of the input sequence when processing a particular token. It can capture long-range dependencies in the text.
  * Feed-Forward Neural Network: This layer applies non-linear transformations to the input, helping the model learn complex patterns.
  * After each layer, an "Add & Norm" operation is applied, which involves adding the input to the output of the layer and then normalizing the result. This helps stabilize the training process.
3. **Classification Layer:**

  * The final layer of the model is a classification layer, which is used to make predictions based on the input sequence.
  * It typically consists of a dense layer with a softmax activation function, which outputs probabilities for different classes.
4. **Masked Language Modeling (MLM):**

  * One of the pre-training tasks used to train BERT.
  * Some tokens in the input sequence are randomly masked (represented as "[MASK]" in the image).
  * The model is then trained to predict the original masked tokens based on the context provided by the surrounding tokens.
5. **Next Sentence Prediction (NSP):**

  * Another pre-training task used to train BERT.
  * Two sentences are input to the model, and the model is trained to predict whether the second sentence is the actual next sentence in the original text.

Overall, the BERT model is trained on a massive amount of text data using the MLM and NSP tasks. This pre-training allows the model to learn a deep understanding of language and context, which can then be applied to a wide range of natural language processing tasks.

# Applications
BERT has diverse applications across several domains:

1. **Text Classification:**

  * **Sentiment analysis:** Determining the sentiment of a piece of text (positive, negative, neutral).
Topic classification: Categorizing text into predefined topics.
  * **Intent classification:** Identifying the intent behind a user's query.
2. **Question Answering:**

  * Answering questions based on a given context, such as a document or a knowledge base.
  * **Extractive question answering:** Identifying the specific span of text that answers the question.
  * **Generative question answering:** Generating a textual answer to a question.
3. **Text Generation:**

  * Generating text, such as articles, poems, or code.
  * **Text summarization:** Condensing long texts into shorter summaries.
  * **Text translation:** Translating text from one language to another.

4. **Named Entity Recognition (NER):**

  * Identifying named entities in text, such as people, organizations, and locations.

8. **Search Engines:**

  * Improving search engine results by understanding the semantic meaning of queries.
9. **Chatbots and Virtual Assistants:**

  * Enhancing the ability of chatbots and virtual assistants to understand and respond to natural language queries.
10. **Biomedicine:**

  * Analyzing medical text, such as clinical notes and research papers.
Identifying drug-drug interactions and adverse effects.


# Implementation of BERT using TensorFlow



## Approach 1




### **Step 1: Define Positional Encoding**

Positional encoding helps the model understand the order of tokens in the input sequence.

```python
get_positional_encoding(max_seq_len, d_model)
```
* `max_seq_len` determines the maximum length of input sequences.
* `d_model` is the dimensionality of the embedding space.

We create a grid of positions and dimensions, applying the sine and cosine transformations to populate the encoding.



In [None]:
def get_positional_encoding(max_seq_len, d_model):
    pos = np.arange(max_seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    positional_encoding = pos * angle_rates
    positional_encoding[:, 0::2] = np.sin(positional_encoding[:, 0::2])
    positional_encoding[:, 1::2] = np.cos(positional_encoding[:, 1::2])
    return tf.cast(positional_encoding, dtype=tf.float32)


### Step 2: Define the Multi-Head Attention Layer

Multi-head attention allows the model to jointly attend to information from different representation subspaces. This helps the model capture various aspects of the input sequences.

**Key Components:**

1. Linear Transformations:

  * Input $x$ is projected into three different spaces: Query (Q), Key (K), and Value (V) using dense layers.
2. Splitting Heads:

  * The model splits the projected vectors into multiple heads. Each head independently performs attention calculations.
3. Scaled Dot-Product Attention:

  * The attention weights are computed using the scaled dot-product of Q and K:
  $$ \text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V $$
  * The dot product is scaled by $\sqrt{d_k}$
  (the depth of the keys) to prevent overly large gradients.
4. Concatenation and Final Dense Layer:

  * The outputs of all heads are concatenated and passed through a final dense layer.



In [None]:
import tensorflow as tf

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, num_heads, d_model):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0  # Check if divisible

        self.depth = d_model // self.num_heads

        self.wq = Dense(d_model)  # Query
        self.wk = Dense(d_model)  # Key
        self.wv = Dense(d_model)  # Value
        self.dense = Dense(d_model)  # Final output layer

    def split_heads(self, x):
        batch_size = tf.shape(x)[0]
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])  # (batch_size, num_heads, seq_len, depth)

    def call(self, inputs):
        q = self.wq(inputs)
        k = self.wk(inputs)
        v = self.wv(inputs)

        q = self.split_heads(q)  # Split into heads
        k = self.split_heads(k)
        v = self.split_heads(v)

        scaled_attention_logits = tf.matmul(q, k, transpose_b=True)
        scaled_attention_logits /= tf.math.sqrt(tf.cast(self.depth, tf.float32))
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

        output = tf.matmul(attention_weights, v)  # Compute weighted sum of values
        output = tf.transpose(output, perm=[0, 2, 1, 3])  # (batch_size, seq_len, num_heads, depth)
        output = tf.reshape(output, (tf.shape(output)[0], -1, self.d_model))  # Concatenate heads

        return self.dense(output)  # Final projection


### Step 3: Define the Feed-Forward Layer

The feed-forward network applies a transformation to each position independently and identically, helping to learn complex representations.

**Structure:**

The feed-forward layer consists of two dense layers:
* The first dense layer expands the dimensionality (with a non-linear activation, typically ReLU).
* The second layer projects back to the original model dimension.

In [None]:
class FeedForward(tf.keras.layers.Layer):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.dense1 = Dense(d_ff, activation='relu')  # Expanding dimension
        self.dense2 = Dense(d_model)  # Projecting back to d_model

    def call(self, inputs):
        return self.dense2(self.dense1(inputs))  # Apply the two dense layers


### Step 4: Define the Encoder Layer

An encoder layer consists of the multi-head attention mechanism and the feed-forward network. It also includes residual connections and layer normalization, which help in stabilizing and improving the training of deep networks.

**Components:**

1. **Multi-Head Attention:**
Computes the attention outputs for the input sequence.
2. **Add & Norm:**
The output of the attention layer is added to the input (residual connection) and then normalized.
3. **Feed-Forward Network:**
The result from the previous step goes through the feed-forward network, followed by another residual connection and normalization.

In [None]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, num_heads, d_model, d_ff, rate=0.1):
        super(EncoderLayer, self).__init__()
        self.attention = MultiHeadAttention(num_heads, d_model)
        self.ffn = FeedForward(d_model, d_ff)
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.attention(inputs)  # Multi-head attention output
        out1 = self.layernorm1(inputs + self.dropout1(attn_output, training=training))  # Add & Norm
        ffn_output = self.ffn(out1)  # Feed-forward network
        return self.layernorm2(out1 + self.dropout2(ffn_output, training=training))  # Add & Norm


### Summary


* **Positional Encoding** helps the model understand the order of tokens.
Multi-Head Attention allows the model to focus on different parts of the input.
* **Feed-Forward Layers** help transform and process the output from the attention mechanism.
* **Encoder Layers** stack attention and feed-forward mechanisms with normalization and residual connections for stability.
* The **BERT Model** integrates these components to process sequences effectively, making it suitable for various NLP tasks.

## Approach 2

We'll leverage the powerful transformers library to streamline the process and provide a solid foundation for further exploration.


```
from transformers import TFBertModel

model = TFBertModel.from_pretrained('bert-base-uncased', **kwargs)

```
The bare Bert Model transformer outputting raw hidden-states without any specific head on top.

This model inherits from [TFPreTrainedModel](https://huggingface.co/docs/transformers/v4.45.2/en/main_classes/model#transformers.TFPreTrainedModel). Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.


**Key Arguments**

* `pretrained_model_name_or_path:`

  The name of the pre-trained model or the path to a directory containing model weights. For example, 'bert-base-uncased'.
* `config:`

  An instance of BertConfig or a configuration dictionary to customize the model architecture.
* `cache_dir:`

  Optional. Directory to cache the pre-trained models.
* `from_pt:`

  Optional. If True, loads the model from a PyTorch checkpoint.


**Key Call Arguments**
```python
outputs = model(inputs, **kwargs)

```
* `input_ids:`

  A tensor of shape (batch_size, sequence_length) containing token IDs.
* `attention_mask:`

  (Optional) A tensor of the same shape as input_ids, where 1 indicates a token to be attended to, and 0 indicates a padding token.
* `token_type_ids:`

  (Optional) A tensor that distinguishes between different sentences in tasks like question answering. It's typically of the same shape as input_ids.
* `training:`

  (Optional) A boolean that specifies whether the model should be in training mode. If set to True, dropout will be applied.

**Outputs**

The outputs of the TFBertModel are generally a BaseModelOutput object, which includes:

* `last_hidden_state:`

  A tensor of shape (batch_size, sequence_length, hidden_size), representing the hidden states of the last layer of the model. Each token's representation can be used for downstream tasks.
* `pooler_output:`

  (Optional) A tensor of shape (batch_size, hidden_size) that contains the hidden state of the first token (usually the [CLS] token) after a linear transformation and Tanh activation. This can be used for classification tasks.
* `hidden_states:`

  (Optional) If output_hidden_states=True, this will contain the hidden states from all layers.
* `attentions:`

  (Optional) If output_attentions=True, this will contain the attention weights from all layers.


### Example Usage

In [None]:
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

# Example text input
text = "Hello, how are you?"

# Tokenize and encode the input
inputs = tokenizer(text, return_tensors='tf')

# Get model outputs
outputs = model(inputs)

# The outputs include hidden states and attention outputs
last_hidden_state = outputs.last_hidden_state
print(last_hidden_state.shape)  # Shape: (batch_size, sequence_length, hidden_size)



# Named Entity Recognition using BERT

<img src='https://miro.medium.com/v2/resize:fit:1400/0*gs2eAAiVleveib9x' width=500>

Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.

**How NER works:**

* **Tokenization:** The text is broken down into individual words or tokens.
* **Part-of-Speech Tagging:** Each token is assigned a part-of-speech tag (e.g., noun, verb, adjective).
* **Entity Recognition:** The system identifies sequences of tokens that form named entities.
* **Entity Classification:** The identified entities are classified into predefined categories (e.g., person, organization, location).

## Dataset Description


[Annotate Corpus for Named Entity Recognition](https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus)
 using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
This dataset is an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc.

**Number of tagged entities:**
```python
'O': 1146068', geo-nam': 58388, 'org-nam': 48034, 'per-nam': 23790, 'gpe-nam': 20680, 'tim-dat': 12786, 'tim-dow': 11404, 'per-tit': 9800, 'per-fam': 8152, 'tim-yoc': 5290, 'tim-moy': 4262, 'per-giv': 2413, 'tim-clo': 891, 'art-nam': 866, 'eve-nam': 602, 'nat-nam': 300, 'tim-nam': 146, 'eve-ord': 107, 'per-ini': 60, 'org-leg': 60, 'per-ord': 38, 'tim-dom': 10, 'per-mid': 1, 'art-add': 1
```

**Essential info about entities:**

* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

<br>
Total Words Count = 1354149 <br>
Target Data Column: "tag"

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from tqdm import tqdm

import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping


import transformers
from transformers import BertTokenizerFast
from transformers import TFBertModel

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from helpers import *

## Load and Preprocess Data

In [None]:
df = pd.read_csv("ner_dataset.csv",encoding = 'ISO-8859-1')
df = df.dropna()
df.head()

In [None]:
print(f"Number of Tags : {len(df.Tag.unique())}")

In [None]:
pie = df['Tag'].value_counts()
px.pie(names = pie.index,values= pie.values,hole = 0.5,title ='Total Count of Tags')

### Grouping, Tokenizing and Padding

We are going to group, tokenize and pad our data for the BERT model by organizing it by sentences, converting the text into numerical IDs using BERT tokenizer, ensuring all sentences have the same length by padding or truncating them.


In [None]:
enc_pos = preprocessing.LabelEncoder()
enc_tag = preprocessing.LabelEncoder()

df.loc[:, "POS"] = enc_pos.fit_transform(df["POS"])
df.loc[:, "Tag"] = enc_tag.fit_transform(df["Tag"])

sentences = df.groupby("Sentence #")["Word"].apply(list).values
pos = df.groupby("Sentence #")["POS"].apply(list).values
tag = df.groupby("Sentence #")["Tag"].apply(list).values


The following tokenize method transforms raw text sentences into numerical data that the BERT model can understand and process. This is an essential preprocessing step for applying BERT to natural language processing tasks.

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

MAX_LEN = 128
def tokenize(data,max_len = MAX_LEN):
    input_ids = list()
    attention_mask = list()
    for i in tqdm(range(len(data))):
        encoded = tokenizer.encode_plus(data[i],
                                        add_special_tokens = True,
                                        max_length = MAX_LEN,
                                        is_split_into_words=True,
                                        return_attention_mask=True,
                                        padding = 'max_length',
                                        truncation=True,return_tensors = 'np')


        input_ids.append(encoded['input_ids'])
        attention_mask.append(encoded['attention_mask'])
    return np.vstack(input_ids),np.vstack(attention_mask)

For each sentence, it uses the `tokenizer.encode_plus method` from the `transformers` library to perform the following:

* **Tokenization**: Breaks down the sentence into individual words or subwords.
Encoding: Converts each token into a unique numerical ID.
* **Special Tokens**: Adds special tokens like [CLS] (classification) and [SEP] (separator) to the beginning and end of the sequence.
* **Padding and Truncation**: Ensures all sequences have the same length by padding shorter sequences with zeros and truncating longer sequences to the maximum length (MAX_LEN).
* **Attention Mask**: Creates an attention mask where 1 indicates tokens to attend to and 0 indicates padding tokens.


In [None]:
# Train test split our dataset
X_train,X_test,y_train,y_test = train_test_split(sentences,tag,random_state=42,test_size=0.1)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [None]:
input_ids,attention_mask = tokenize(X_train,max_len = MAX_LEN)
val_input_ids,val_attention_mask = tokenize(X_test,max_len = MAX_LEN)


In [None]:
# TEST: Checking Padding and Truncation length's
was = list()
for i in range(len(input_ids)):
    was.append(len(input_ids[i]))
set(was)

In [None]:
# Train Padding
test_tag = list()
for i in range(len(y_test)):
    test_tag.append(np.array(y_test[i] + [0] * (128-len(y_test[i]))))

# TEST:  Checking Padding Length
was = list()
for i in range(len(test_tag)):
    was.append(len(test_tag[i]))
set(was)

In [None]:
# Train Padding
train_tag = list()
for i in range(len(y_train)):
    train_tag.append(np.array(y_train[i] + [0] * (128-len(y_train[i]))))

# TEST:  Checking Padding Length
was = list()
for i in range(len(train_tag)):
    was.append(len(train_tag[i]))
set(was)

# Model Building

In [None]:
from transformers import TFBertModel
from tensorflow.keras.layers import Layer

class BertLayer(Layer):
    def __init__(self, bert_model, **kwargs):
        super(BertLayer, self).__init__(**kwargs)
        self.bert_model = bert_model

    def call(self, inputs):
        input_ids, attention_mask = inputs
        # Call the TFBertModel within the call method
        bert_output = self.bert_model(input_ids, attention_mask=attention_mask, return_dict=True)
        return bert_output["last_hidden_state"]

def create_model(bert_model, max_len=MAX_LEN):
    input_ids = tf.keras.Input(shape=(max_len,), dtype='int32')
    attention_masks = tf.keras.Input(shape=(max_len,), dtype='int32')

    # Use the custom BertLayer
    bert_output = BertLayer(bert_model)([input_ids, attention_masks])

    embedding = tf.keras.layers.Dropout(0.3)(bert_output)
    output = tf.keras.layers.Dense(17, activation='softmax')(embedding)
    model = tf.keras.models.Model(inputs=[input_ids, attention_masks], outputs=[output])
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    return model

In [None]:
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
model = create_model(bert_model,MAX_LEN)

**Key Features:**
* **BertLayer:** A custom layer encapsulating the TFBertModel from the transformers library. It takes token IDs and attention masks as input and outputs the last hidden state from BERT.
* **Input Layers:** Two input layers are defined for token IDs (input_ids) and attention masks (attention_masks).
* **BERT Output:** The BertLayer is called with the input layers to obtain BERT's output.
* **Dropout:** A dropout layer is added to reduce overfitting.
* **Dense Layer:** A dense layer with a softmax activation is used for classification, predicting the entity tag for each token.

In [None]:
model.summary()

**Reflection**

* The model has over 100 million parameters.
* This is a large number of parameters, but it is necessary for the model to be able to learn the complex patterns in the data.

# Training and Evaluation

In [None]:
early_stopping = EarlyStopping(mode='min',patience=5)
history_bert = model.fit([input_ids,attention_mask],np.array(train_tag),\
                         validation_data = ([val_input_ids,val_attention_mask],np.array(test_tag)),\
                         epochs = 25,batch_size = 30*2,callbacks = early_stopping,verbose = True)

In [None]:
plot_metrics(history_bert)

**Reflection**
> Under 25 epochs, our model has achieved quite an amazing performance, 95%. Now, It should be able to effortlessly recognize all the tags.

# Inference

In [None]:
def pred(val_input_ids,val_attention_mask):
    return model.predict([val_input_ids,val_attention_mask])

In [None]:
def testing(val_input_ids,val_attention_mask,enc_tag,y_test):
    val_input = val_input_ids.reshape(1,128)
    val_attention = val_attention_mask.reshape(1,128)

    # Print Original Sentence
    sentence = tokenizer.decode(val_input_ids[val_input_ids > 0])
    print("Original Text : ",str(sentence))
    print("\n")
    true_enc_tag = enc_tag.inverse_transform(y_test)

    print("Original Tags : " ,str(true_enc_tag))
    print("\n")

    pred_with_pad = np.argmax(pred(val_input,val_attention),axis = -1)
    pred_without_pad = pred_with_pad[pred_with_pad>0]
    pred_enc_tag = enc_tag.inverse_transform(pred_without_pad)
    print("Predicted Tags : ",pred_enc_tag)

In [None]:
testing(val_input_ids[0],val_attention_mask[0],enc_tag,y_test[0])

**Reflection**

> Looking at original tags and predicted tags, both seems equal. Which means, model has performed well.