## BERT Embeddings

### 🧠 What Are BERT Embeddings?

BERT (Bidirectional Encoder Representations from Transformers) embeddings are contextual word representations generated by the BERT model. Unlike Word2Vec, which gives a single vector per word regardless of context, **BERT generates different vectors for the same word depending on its usage in a sentence.**

For example:

    - “He sat by the bank of the river.”
    - “She went to the bank to deposit money.”
  
BERT will produce different embeddings for “bank” in each sentence, because it understands the context.


### ⚙️ How Do BERT Embeddings Work?

- **1. Tokenization**
BERT uses WordPiece tokenization, breaking words into subwords (e.g., “playing” → “play” + “##ing”).

Special tokens are added:
[CLS] at the beginning (used for Classification tasks)
[SEP] to separate sentences

- **2. Input Representation**
Each token is represented as a sum of:

- Token embeddings (from the WordPiece vocabulary)
- Segment embeddings (to distinguish sentence A from sentence B)
- Position embeddings (to encode word order)

So the input to BERT is a matrix of shape:

[sequence_length x hidden_size]
For BERT-base, hidden_size = 768.

- **3. Transformer Layers (Encoder Stack)**
BERT uses multi-layer bidirectional transformers to process the entire sentence at once, capturing both left and right context.
    -  BERT uses a stack of transformer encoder layers (12 in BERT-base, 24 in BERT-large).
    -  Each layer has:
        - Multi-head self-attention
        - Feed-forward neural network
        - Layer normalization and residual connections
The self-attention mechanism allows each token to attend to every other token in the sequence, capturing deep contextual relationships.

- **4. Contextual Embeddings**

    - After passing through all layers, each token now has a contextualized embedding.
    - For example, the word “bank” in:
        - “He sat by the river bank” vs.
        - “She deposited money in the bank” will have different embeddings

- **Output**
Each token gets a 768-dimensional vector (for BERT-base). You can also pool these to get a sentence embedding.


Types of BERT Embeddings You Can Extract

| Embedding Type | Description | 
|---------|-----------|
| Token-level | Embedding for each token in the input (contextualized) | 
| [CLS] token | Embedding of the [CLS] token, often used as a sentence-level representation | 
| Mean pooling | Average of all token embeddings (excluding special tokens) | 
| Max pooling | Max value across each dimension of token embeddings | 


##### Python Code:  Example: Sentence Embedding with Mean Poolin

In [2]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input
sentence = "The bank will not be open tomorrow."
inputs = tokenizer(sentence, return_tensors='pt')

# Get hidden states
with torch.no_grad():
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state  # [1, seq_len, 768]

# Mean pooling (excluding [CLS] and [SEP])
attention_mask = inputs['attention_mask']
mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size())
sum_embeddings = torch.sum(last_hidden_state * mask_expanded, 1)
sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
sentence_embedding = sum_embeddings / sum_mask

### Why Use BERT Embeddings?
- ✅ Context-aware: Understands polysemy and syntax
- ✅ Powerful for downstream tasks: Classification, QA, NER, etc.
- ✅ Transferable: Pretrained on massive corpora, useful even with little labeled data
