# Encoder Layer 

Once we have the components: **Embeddings**, **Attention**, and **Feed-Forward** we combine them into the fundamental building block of BERT: the **Encoder Layer**.

A single Encoder Layer does two things to the input vectors:
1.  **Contextualize:** The Attention mechanism mixes information between tokens ("Socializing").
2.  **Process:** The Feed-Forward network processes each token individually ("Thinking").

Crucially, it uses **Residual Connections** (skip connections) and **Layer Normalization** after each step.

The most important feature of this layer is that its **Input Shape equals its Output Shape**.
* Input: `(Batch, Seq_Len, 768)`
* Output: `(Batch, Seq_Len, 768)`

This allows us to stack $N$ of them (BERT Base uses 12) without worrying about dimensions mismatching.

Let's start by adding src to the python system path in case your notebook is not being run with it already added.

In [1]:
import sys
from pathlib import Path
sys.path.append((Path('').resolve().parent / 'src').as_posix())

In [2]:
import torch
from torch import nn
from modules.encoders.encoder import Encoder 
from settings import EncoderSettings

encoder = Encoder(EncoderSettings())
print(encoder)

Encoder(
  (attention): MultiHeadAttention(
    (query): Linear(in_features=768, out_features=768, bias=True)
    (key): Linear(in_features=768, out_features=768, bias=True)
    (value): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (feed_forward): FeedForwardLayer(
    (dense_expansion): Linear(in_features=768, out_features=3072, bias=True)
    (activation): GELU(approximate='none')
    (dense_contraction): Linear(in_features=3072, out_features=768, bias=True)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)


### Forward Pass
We will pass a batch of data through the layer.
* **Batch:** 2 sentences
* **Length:** 10 tokens
* **Dim:** 768 (Hidden size)

In [3]:
batch_size = 2
seq_len = 10
hidden_size = 768
input_tensor = torch.randn(batch_size, seq_len, hidden_size)
output = encoder(input_tensor)

assert input_tensor.shape == output.shape

print(f"Input Shape:  {input_tensor.shape}")
print(f"Output Shape: {output.shape}")

Input Shape:  torch.Size([2, 10, 768])
Output Shape: torch.Size([2, 10, 768])


# Stacked Encoder Layer

We have a single Encoder layer. Now we stack them and allow the vectors to pass through multiple layers of Attention and FFN.

In [4]:
from modules.encoders.layer import StackedEncoder 
from settings import EncoderSettings

stacked_encoder = StackedEncoder(EncoderSettings())
print(stacked_encoder)
print(f"\nFirst Layer: {type(stacked_encoder.layer[0])}")
print(f"Last Layer:  {type(stacked_encoder.layer[-1])}")

StackedEncoder(
  (layer): ModuleList(
    (0-11): 12 x Encoder(
      (attention): MultiHeadAttention(
        (query): Linear(in_features=768, out_features=768, bias=True)
        (key): Linear(in_features=768, out_features=768, bias=True)
        (value): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (feed_forward): FeedForwardLayer(
        (dense_expansion): Linear(in_features=768, out_features=3072, bias=True)
        (activation): GELU(approximate='none')
        (dense_contraction): Linear(in_features=3072, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
)

First Layer: <class 'modules.encoders.encoder.Encoder'>
Last Layer:  <class 'modules.encoders.encoder.Encoder'>


### Forward Pass

In [5]:
output = stacked_encoder(input_tensor)

assert input_tensor.shape == output.shape

print(f"Input Shape:  {input_tensor.shape}")
print(f"Output Shape: {output.shape}")

Input Shape:  torch.Size([2, 10, 768])
Output Shape: torch.Size([2, 10, 768])
