# **Objectives**
- To grasp the concept of bidirectional context in language understanding.
- To learn about Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) as BERT‚Äôs pre-training tasks.
- To understand BERT‚Äôs architecture and how it leverages bidirectionality.
- To explore how BERT is fine-tuned for downstream NLP tasks such as classification, NER, and QA.


-----
##  **Bidirectional Context in BERT**
Traditional models like GPT are unidirectional (left-to-right). BERT is **bidirectional**, meaning it understands context from both directions.

Example:
For the sentence: ‚ÄúThe bank was flooded after the storm.‚Äù

Unidirectional models might not understand if "bank" means **financial institution** or **riverbank**. BERT considers both sides of the word and better captures meaning.


---
#  **Introduction to BERT**

Before BERT, most language models could only read text in one direction‚Äîeither from left to right or right to left. But many tasks in Natural Language Processing (NLP), like **question answering** or **named entity recognition**, need context from **both directions** to fully understand the meaning of words and sentences.

Some earlier models, like **ELMo**, tried to fix this using two separate models‚Äîone reading left to right and one right to left. But it wasn‚Äôt ideal.

That‚Äôs where **BERT** comes in.  
BERT stands for **Bidirectional Encoder Representations from Transformers**.

It introduced a new way to train language models using:

1. **Masked Language Modeling (MLM)** ‚Äì randomly hiding words and asking the model to guess them.
2. **Next Sentence Prediction (NSP)** ‚Äì helping the model understand the relationship between two sentences.

BERT uses a **transformer encoder** that looks at both the left and right sides of a word at the same time. This is called a **bidirectional** approach and makes BERT much better at understanding context.

After pre-training on large amounts of text (like Wikipedia), BERT can be fine-tuned on smaller datasets for tasks like:

- Text classification  
- Paraphrase detection  
- Named entity recognition  
- Question answering

All you need is to add a small output layer on top of BERT and train it for your task.

<center>
<img src="https://i.postimg.cc/LsSVbTgj/image.png" width="500"/>

**Figure 1**: Different model directions:  
(a) Left-to-right (unidirectional)  
(b) Combined left and right separately (still unidirectional)  
(c) True bidirectional (BERT's approach)
</center>






---

###  **BERT Achievements**

BERT made a big impact when it was released. It achieved **state-of-the-art results** on several NLP tasks:

- **GLUE score**: 80.5% (‚Üë7.7%)
- **MultiNLI accuracy**: 86.7% (‚Üë4.6%)
- **SQuAD v1.1 (F1 score)**: 93.2 (‚Üë1.5)
- **SQuAD v2.0 (F1 score)**: 83.1 (‚Üë5.1)

----

###  **Real-World Impact**

Google adopted BERT into its **Search engine** to better understand the meaning behind search queries. In 2020, Google announced that **almost every English search query** was being processed by BERT, and the model has since been expanded to many other languages.

---

---
##  **Working of BERT**

BERT is built on top of the **Transformer encoder architecture**, introduced in the paper [‚ÄúAttention is All You Need‚Äù](https://arxiv.org/abs/1706.03762). Unlike traditional models that read text from left to right (or right to left), BERT reads the **entire sequence in both directions at once** using **self-attention**. This allows it to understand the full context of a word based on the words before and after it.

---



##  **Model Architecture**

BERT uses **multiple stacked Transformer encoder layers** to process text:

- **BERT-Base**: 12 layers, 768 hidden units, 12 attention heads
- **BERT-Large**: 24 layers, 1024 hidden units, 16 attention heads

Each layer contains:
- **Multi-head self-attention mechanism**
- **Feed-forward neural network**
- **Layer normalization and residual connections**

These components allow BERT to capture relationships between words in a sentence, regardless of their position or distance.



###  **Attention Mechanism Visualization with BERTViz**

BERT uses a **self-attention mechanism** to understand the relationship between words in a sentence. This allows the model to focus on relevant words when processing each token.

To visualize how attention works in BERT, we can use [**BERTViz**](https://github.com/jessevig/bertviz) ‚Äî a visualization tool that shows attention patterns across layers and attention heads.

---

### **Example Input**

`[CLS] The rabbit quickly hopped [SEP] the turtle slowly crawled [SEP]`

Below is a **head view** visualization from BERTViz, which shows attention weights between tokens:

![Head View](https://raw.githubusercontent.com/jessevig/bertviz/master/images/head-view.gif)

---

### **Why Use BERTViz?**

- Understand which words the model attends to
- Gain insights into BERT‚Äôs decision-making
- Useful for debugging and model interpretability


---

###  **Input Format**

In BERT, the term ‚Äúsentence‚Äù refers to any span of continuous text, and a ‚Äúsequence‚Äù refers to the tokenized input.

The input to BERT always follows a specific format:

- The first token is always **[CLS]** (used for classification tasks)
- If there are two sentences, they are separated by **[SEP]**
- The input also includes:
  - **Token embeddings**: representations of each word/subword
  - **Segment embeddings**: to distinguish sentence A from sentence B
  - **Position embeddings**: to keep track of token positions

Example:

All these embeddings are added together and passed into the encoder layers.



---
###  **Contextual Embeddings**

BERT creates **contextual embeddings**, meaning that the representation of a word depends on its surrounding words.

For example:
- ‚Äúbank‚Äù in ‚Äúriver bank‚Äù will have a different embedding than ‚Äúbank‚Äù in ‚Äúsavings bank‚Äù.

This is in contrast to traditional word embeddings like Word2Vec or GloVe, where a word has the same embedding no matter where it appears.

Because BERT considers both the left and right context of a word, its embeddings are much more powerful and flexible.


----


###  **Outputs from BERT**

After processing the input through all encoder layers, BERT produces:

- A contextual embedding for **each token** in the input
- A special embedding for the **[CLS]** token that represents the entire sequence (useful for classification tasks)

These outputs can then be fine-tuned for various downstream NLP tasks such as sentiment analysis, named entity recognition, question answering, etc.



<center>
<img src="https://i.postimg.cc/28yLm4F5/image.png" width="600"/>

**Figure 2**: BERT Model Architecture ‚Äî showing input tokens, special tokens like [CLS] and [SEP], and the stacked Transformer encoders.
</center>

---



---
#  **Pretraining**

The first step in BERT‚Äôs process is **unsupervised pretraining**. The goal here is to teach the model a deep understanding of language by learning from a large amount of unlabeled text.

BERT uses two main training tasks during pretraining:

- **Masked Language Model (MLM)**  
- **Next Sentence Prediction (NSP)**

Let‚Äôs take a quick look at what these mean.


* * *
## **1. Masked Language Model (MLM)**

Before we dive into MLM, try to fill in the blank in this sentence:

<center> **We went to the library to ____ books.** </center>

You might guess words like *read*, *study*, or *borrow*‚Äîbut probably not *play*. This kind of test is called a **cloze test** (or occlusion test). To get the right answer, you need to understand the context and the meaning of words around the blank.

The **Masked Language Model** is based on this idea.

During BERT‚Äôs pretraining, some words in the input are randomly selected and replaced with a special **[MASK]** token or other tokens. The model‚Äôs job is to predict the original words that were masked out.

Specifically, 15% of the tokens in each sequence are randomly selected. Of these 15% selected tokens:

- **80%** are replaced with the `[MASK]` token.
- **10%** are replaced with a random token from the vocabulary.
- **10%** are left unchanged.

This strategy prevents the model from simply learning to predict `[MASK]` tokens and encourages it to rely on the surrounding context.

For example, consider the sentence:

<center> "The **man** went to the **store**." </center>

If we mask "store", the input might become:

<center> "The man went to the **[MASK]**." </center>

The model tries to guess the missing word based on the surrounding words.

Instead of predicting the entire sequence, BERT only predicts these masked words. It uses the hidden representations of the masked tokens and applies a softmax function over the vocabulary to predict the most likely word. The loss is only calculated for the masked tokens.

<center>
<img src="https://i.postimg.cc/MHTH6SBL/image.png" width="500"/>

**Figure 3:** Masked Language Model (MLM) ‚Äî predicting masked words from context.
</center>

* * *

## **2.  Next Sentence Prediction (NSP)**

While language models like MLM learn about individual words and their context, they don‚Äôt directly capture the relationship between **sentences**. To address this, BERT uses a task called **Next Sentence Prediction** during pretraining.

This task is designed to help BERT understand the relationship between two sentences by training it on a **binary classification problem**. For each pre-training example, two sentences, Sentence A and Sentence B, are constructed.

- In 50% of the examples, Sentence B **actually follows** Sentence A in the original text. These pairs are labeled as **IsNext**.
- In the other 50%, Sentence B is a **random sentence** taken from a different document. These pairs are labeled as **NotNext**.

BERT's objective is to learn to predict whether Sentence B logically follows Sentence A based on the context provided by both sentences.

To perform this prediction, the input to BERT is formatted as: `[CLS] Sentence A [SEP] Sentence B [SEP]`. The special **[CLS]** token's final hidden state is used as the aggregate representation of the entire sequence pair. This vector is then passed through a simple classifier (a linear layer followed by a softmax function) to predict either **IsNext** or **NotNext**. The model is trained to minimize the cross-entropy loss on this binary classification task.

<center>
<figure>
<img src="https://i.postimg.cc/43QxpdbM/image.png" width="550"/>
<figcaption>(a) IsNext example: Sentence B follows Sentence A.</figcaption>
</figure>
<figure>
<img src="https://i.postimg.cc/sXnf0NL4/image.png" width="550"/>
<figcaption>(b) NotNext example: Sentence B is random, unrelated to Sentence A.</figcaption>
</figure>

**Figure 4:** Next Sentence Prediction (NSP) ‚Äî (a) IsNext case, (b) NotNext case.
</center>

* * *
## **Finetuning**

One of BERT‚Äôs most powerful features is its ability to **transfer the knowledge** it learned during pretraining to many different NLP tasks quickly and effectively. This process is called **finetuning**.

We start with a **pre-trained BERT model** and then train it a little more on a specific task, like sentiment analysis, named entity recognition (NER), or question answering. Sometimes, we add a small task-specific layer on top of BERT‚Äôs output to help with that task.

<center>
<img src="https://i.postimg.cc/ZYMK8XpT/image.png" width="750"/>

**Figure 5:** BERT Framework ‚Äî (a) Pretraining on MLM & NSP, (b) Finetuning for different NLP tasks (e.g., MNLI, NER, SQuAD).
</center>

During finetuning, **all layers** of BERT are trained together with this task-specific layer, usually for just a few epochs. This allows the model to quickly adapt and perform well on the new task.
---
## **Handling [MASK] during Fine-tuning:**

It's important to note that the `[MASK]` token, used during the MLM pre-training task, is **not** used during fine-tuning. The model has already learned rich contextual representations of words by predicting masked tokens during pre-training. In the fine-tuning phase, these learned representations (the hidden states of the original tokens) are used as input to the task-specific output layer for the downstream task. The model no longer performs the masking and prediction of individual words; instead, it uses the full sequence context to solve the target task.

----
### **Example: Finetuning BERT for Named Entity Recognition (NER)**

In the NER task, BERT produces contextual embeddings for every token. These token embeddings are passed through a **feed-forward network (FFN)** followed by a **softmax layer** to classify each token into categories like person, location, organization, or none.

<center>
<img src="https://drive.google.com/uc?export=view&id=18G20zWoN8nJTMMlTCM--ojkSJfLNFRe4" width="700"/>

**Figure 6:** Fine-tuning BERT for the NER task
</center>


----

###  **Finetuning for Different NLP Tasks**

The input format during finetuning depends on the task but often follows the same structure as in pretraining:

- The two sentences (sentence A and sentence B) from pretraining can represent:
  - Sentence pairs in **paraphrasing**
  - Hypothesis-premise pairs in **natural language inference (entailment)**
  - Question and passage pairs in **question answering**
  - Or just a single sentence paired with a dummy input in **text classification** or **sequence tagging**

The output depends on the task type:

- For **token-level tasks** (e.g., NER, POS tagging, question answering), the model uses the final hidden vectors for each token and feeds them into an output layer.
- For **sentence-level classification tasks** (e.g., sentiment analysis, entailment), the model uses the hidden vector corresponding to the **[CLS]** token and feeds it into the output layer.


This flexible finetuning approach makes BERT suitable for a wide range of NLP problems, often achieving state-of-the-art results with relatively little additional training.


* * *
## **Effect of Removing NSP: Ablation Study on MNLI and QNLI**

The **Next Sentence Prediction (NSP)** task was introduced in the original BERT paper ([Devlin et al., 2018](https://arxiv.org/pdf/1810.04805.pdf)) to help the model understand relationships between sentence pairs. However, later studies and experiments questioned its necessity for achieving state-of-the-art results on various downstream tasks.

To investigate the contribution of the NSP task, researchers conducted **ablation studies**. An ablation study involves removing a specific component (in this case, the NSP pre-training objective) from a model to assess its impact on performance.

The original BERT paper presented results comparing models trained **with** and **without** the NSP objective on several tasks, including **Natural Language Inference (NLI)** tasks like **MNLI** and **QNLI**.

### **What are MNLI and QNLI?**

*   **MNLI (Multi-Genre Natural Language Inference)**: A dataset for determining the relationship between a premise sentence and a hypothesis sentence. The task is to classify the relationship as **entailment**, **contradiction**, or **neutral**. This requires the model to understand how sentence pairs relate to each other logically.
*   **QNLI (Question-answering Natural Language Inference)**: An NLI task derived from question-answering data. Given a question and a potential answer sentence from a text, the model must determine if the sentence contains the answer (entailment) or not (not entailment). This task also relies on understanding the relationship between a question and a potential answer.

These tasks are good benchmarks for evaluating a model's ability to understand sentence-pair relationships, which is what the NSP task was designed to improve.

### **Ablation Study Results: With vs. Without NSP**

The ablation study results on MNLI and QNLI showed:

| Model Variant      | MNLI Accuracy (%) | QNLI Accuracy (%) |
| ------------------ | ----------------- | ----------------- |
| BERT (with NSP)    | 84.0              | 91.0              |
| BERT (without NSP) | 83.9              | 90.7              |

> *Note: Values are illustrative, based on the original BERT paper's ablation findings.*

### **Interpretation and Why NSP is Sometimes Removed**

The results indicate that removing the NSP task resulted in a very **marginal decrease** in performance on both MNLI and QNLI. The difference was often less than 0.5%.

This led to the conclusion that, while NSP might offer a slight benefit for tasks specifically requiring sentence-pair understanding, its contribution was not as significant as initially hypothesized and was considerably less impactful than the MLM task.

Later models, such as **RoBERTa**, built upon this finding by completely removing the NSP pre-training task and focusing on more robust MLM training with larger datasets and longer sequences. RoBERTa achieved state-of-the-art results across many benchmarks without NSP, further supporting the idea that explicit sentence-pair pre-training might not be strictly necessary if the model learns strong contextual representations through MLM.

In summary, NSP is sometimes removed because empirical evidence showed its limited impact on overall performance, and focusing solely on a strong MLM objective proved to be a more effective pre-training strategy for many downstream NLP tasks.



###  **Examples of loading BERT model from HuggingFace and show examples of tasks like MASK filling that BERT is capable of:**

---

###  **Loading BERT from Hugging Face and Performing MASK Filling**

**BERT (Bidirectional Encoder Representations from Transformers)** is a powerful pretrained model developed by Google. One of its core tasks is **Masked Language Modeling (MLM)** ‚Äî predicting the masked word in a sentence.

With the Hugging Face ü§ó Transformers library, it's very easy to use BERT for such tasks.

---

###  Step 1: Install Transformers Library

```bash
pip install transformers
```

---

###  Step 2: Load Pretrained BERT Model and Tokenizer

```python
from transformers import pipeline

# Load BERT with a fill-mask pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")
```

---

###  Step 3: Example ‚Äì MASK Filling (Masked Language Modeling)

```python
# Provide a sentence with a [MASK] token
result = unmasker("The capital of France is [MASK].")

# Print top predictions
for res in result:
    print(f"{res['sequence']} (score: {res['score']:.4f})")
```

####  Output:

```
The capital of France is Paris. (score: 0.97)
The capital of France is Lyon. (score: 0.01)
The capital of France is Marseille. (score: 0.003)
...
```

---

###  **What's Going On?**

BERT tries to predict the masked word using context from both the **left and right** of the mask token. In this example, it correctly predicts "**Paris**" as the missing word.

---

###  **Other Tasks BERT Can Do (using Hugging Face pipelines)**

| Task                           | Pipeline Name           | Example Use                            |
| ------------------------------ | ----------------------- | -------------------------------------- |
| Masked word filling            | `"fill-mask"`           | Predict `[MASK]` in a sentence         |
| Text classification            | `"text-classification"` | Sentiment analysis or intent detection |
| Named Entity Recognition (NER) | `"ner"`                 | Identify people, places, etc.          |
| Question Answering             | `"question-answering"`  | Answer questions from context          |

---

###  **Conclusion**

Hugging Face Transformers make it very easy to experiment with BERT. You can quickly load the model and perform **masked language modeling** and many other NLP tasks without much setup.

Let me know if you want to explore another BERT task like sentiment classification or question answering!


---
## **Influence of BERT**

Since its release, BERT has revolutionized Natural Language Processing (NLP) and inspired many follow-up models and applications.

### Popular BERT Variants and Domain Adaptations

- **SciBERT**  
  Pre-trained on a large scientific publications corpus, SciBERT improves performance on scientific NLP tasks like paper classification and entity recognition.

- **BioBERT**  
  Trained on vast biomedical texts, BioBERT excels in biomedical named entity recognition, relation extraction, and question answering, outperforming previous models in the biomedical domain.

- **RoBERTa**  
  A robustly optimized BERT variant that trains longer with more data and removes the Next Sentence Prediction (NSP) task, achieving better results on many benchmarks.

- **ALBERT**  
  A ‚Äúlite‚Äù version of BERT designed to reduce model size and increase training speed, using parameter-sharing and factorized embeddings.

- **DistilBERT**  
  A smaller, faster version of BERT that retains much of its performance but is more efficient for deployment.

- **ViLBERT**  
  Extends BERT to a **multi-modal** model combining image and text understanding with two separate streams, enabling joint reasoning over vision and language.

- **ClinicalBERT**  
  Adapted for clinical notes and healthcare data, improving medical text understanding.


----
### **Real-World Applications of BERT**

- **Search Engines**  
  Google uses BERT to better understand search queries, improving the relevance of results.

- **Chatbots and Virtual Assistants**  
  BERT powers natural and context-aware conversations.

- **Text Summarization and Translation**  
  BERT-based models help generate more accurate summaries and translations.

- **Sentiment Analysis**  
  Widely used in social media monitoring, customer feedback analysis, and brand management.



###  **Limitations and Challenges**

- **Resource Intensive**  
  Training BERT requires large computational power and memory.

- **Input Length Limit**  
  BERT can only process input sequences up to a certain length (usually 512 tokens), which limits handling very long documents.

- **Inference Speed**  
  Large models can be slow for real-time applications, prompting research into lighter versions.



###  **Future Directions**

- **Efficient Transformers**  
  Models like Longformer and Performer address BERT‚Äôs input length and speed limitations by optimizing attention mechanisms.

- **Multimodal Learning**  
  Combining text with images, audio, or video to build richer understanding (e.g., extensions of ViLBERT).

- **Self-Supervised Learning Advances**  
  New pretraining objectives and architectures continue to improve language understanding.

---

##  **Key Takeaways**

- BERT generates **contextual embeddings** for each token by considering the entire input bidirectionally.

- It is based on the **Transformer encoder architecture**.

- BERT‚Äôs framework has two main phases:
  1. **Pretraining** on large unlabeled text corpora using **Masked Language Modeling (MLM)** and **Next Sentence Prediction (NSP)**.
  2. **Finetuning** on specific downstream NLP tasks by adding small task-specific layers.

- Variants like RoBERTa, ALBERT, and DistilBERT improve performance, efficiency, and scalability.

- BERT has transformed many real-world applications, including search, chatbots, summarization, and biomedical NLP.

- Despite its success, BERT faces challenges like high computational cost and input length limits, which ongoing research aims to solve.



---
# References

- Devlin J., Chang M., Lee K., Toutanova K. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL, https://arxiv.org/pdf/1810.04805.pdf

- Vaswani A. et al., (2018), Attention Is All You Need, NIPS, https://arxiv.org/abs/1706.03762
    - Check 3. Model Architecture on page 2 to study about encoder.

- Nayak P. (2019), Understanding searches better than ever before, https://blog.google/products/search/search-language-understanding-bert/


## Suggestion Points:

* Use [viz](https://github.com/jessevig/bertviz) to display attention visualization.
* More techincal details on MLP and NSP
  * How is [MASK] handled during fine-tuning.
  * Effect of removing NSP [Compare ablation study with and with out NSP on MNLI and QNLI]
* Add some examples of loading BERT model from huggingface show examples of task like MASK filling that BERT is capable of