# Embeddings

The **Embedding Layer** is the bridge between raw integers (token IDs produced by tokenizers) and the vector space where the model operates.\
In standard NLP models (like Word2Vec), this is just a single lookup table. In BERT, it is a sum of three distinct embeddings.

1.  **Token Embeddings:** The meaning of the word itself (e.g., "bank" $\to$ vector).
2.  **Position Embeddings:** The position in the sentence (e.g., Position 0 vs Position 5).
    * *Why?* The Transformer architecture (Attention) is "permutation invariant." It has no inherent sense of order. If we didn't add this, "Man bites dog" and "Dog bites man" would look identical to the model.
3.  **Token Type (Segment) Embeddings:** Indicates which sentence the token belongs to.
    * *Why?* BERT is trained on pairs. `Sentence A` gets ID 0, `Sentence B` gets ID 1.

The final representation for a token is:
$$ E_{\text{final}} = E_{\text{token}} + E_{\text{position}} + E_{\text{type}} $$

We use element-wise **addition**, not concatenation. This keeps the vector size constant (e.g., 768) throughout the model.

Let's start by adding _src_ to the python system path in case your notebook is not being run with it already added.

In [1]:
import sys
from pathlib import Path
sys.path.append((Path('').resolve().parent / 'src').as_posix())

### Initializing the Layer
Let's verify the structure of the module. You will see the three distinct `nn.Embedding` layers.

In [2]:
import torch
from torch import nn
from modules.embeddings import Embeddings
from settings import EmbeddingSettings

settings = EmbeddingSettings(
    vocab_size=120,
    hidden_size=16,           # Small vector size for readability
    max_position_embeddings=50,
    type_vocab_size=2,
    hidden_dropout_prob=0.0   # Disable dropout for deterministic results
)

embeddings_layer = Embeddings(settings)
print(embeddings_layer)


print(f"\n{'Token Embeddings:':<25} {str(embeddings_layer.word_embeddings.weight.shape):<20} (Vocab x Hidden)")
print(f"{'Positional Embeddings:':<25} {str(embeddings_layer.position_embeddings.weight.shape):<20} (MaxPos x Hidden)")
print(f"{'Type Embeddings:':<25} {str(embeddings_layer.token_type_embeddings.weight.shape):<20} (Types x Hidden)")

Embeddings(
  (word_embeddings): Embedding(120, 16, padding_idx=0)
  (position_embeddings): Embedding(50, 16)
  (token_type_embeddings): Embedding(2, 16)
  (LayerNorm): LayerNorm((16,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.0, inplace=False)
)

Token Embeddings:         torch.Size([120, 16]) (Vocab x Hidden)
Positional Embeddings:    torch.Size([50, 16]) (MaxPos x Hidden)
Type Embeddings:          torch.Size([2, 16])  (Types x Hidden)


### The Forward Pass (Summation)

We will pass a batch of sentences.
* **Batch Size:** 2
* **Max Sequence Length in batch:** 5
* **Sentences:** `[10, 20, 30]` and `[25, 40, 42, 22, 33]`

The first sentence ends up being padded to length 5, resulting in `[10, 20, 30, 0, 0]`

In [3]:
input_ids = torch.tensor([[10, 20, 30, 0, 0], [25, 40, 42, 22, 33]])
# Note: We are NOT passing position_ids manually. The model generates them automatically.
output = embeddings_layer(input_ids)
print(f"Input Shape:  {input_ids.shape}")
print(f"Output Shape: {output.shape}")
# Expected: (2, 5, 16) -> (Batch, Seq, Hidden)
assert output.shape == (2, 5, 16)

Input Shape:  torch.Size([2, 5])
Output Shape: torch.Size([2, 5, 16])


This tensor (B, S, H) is what enters the Attention Layer.


### Padding index

As discussed, we want the vector for the `[PAD]` token (ID 0) to be **strictly zero** so it doesn't affect the math. However, for position embeddings, ID 0 means "The First Word", so it *must* be learned.

Let's prove that the model freezes the gradient for the Word `[PAD]` but learns the Position `0`.\
We will enable gradients for a quick test, and then inspect them.

In [4]:
embeddings_layer.train()

# Create an input containing ID 0 for both Word and Position. Input: Word ID 0 (PAD). Position: implicitly Position 0 (First item)
test_input = torch.tensor([[0]]) 

out = embeddings_layer(test_input)
loss = out.abs().sum() # Simple computation to backpropagate loss.
loss.backward()

grad_word_0 = embeddings_layer.word_embeddings.weight.grad[0]
grad_pos_0 = embeddings_layer.position_embeddings.weight.grad[0]

print("Gradient for Word ID 0 ([PAD]):")
print(grad_word_0)
print("\nGradient for Position ID 0 (Start):")
print(grad_pos_0)

if torch.sum(torch.abs(grad_word_0)) == 0:
    print("\n✅ Correct: Word ID 0 is frozen (all zeros).")
if torch.sum(torch.abs(grad_pos_0)) > 0:
    print("✅ Correct: Position 0 is learning.")

Gradient for Word ID 0 ([PAD]):
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Gradient for Position ID 0 (Start):
tensor([-0.1768,  0.1899, -0.4213, -0.4927, -0.2531, -0.4156,  0.2297,  0.4303,
         0.5700, -0.0317, -0.3234,  0.7812,  0.7491, -0.3775, -0.3158, -0.1423])

✅ Correct: Word ID 0 is frozen (all zeros).
✅ Correct: Position 0 is learning.
