# BERT Step by Step: The Position-Wise Feedforward Layer

The feedforward layer in the encoder of a transformer model is a key component responsible for processing the information obtained from the self-attention mechanism.

It consist of two fully connected (dense) layers with a GELU activation function in between. The feedforward layer operates independently on each position in the input sequence. The first linear transformation projects the high-dimensional representations learned by the self-attention mechanism into a lower-dimensional space, introducing non-linearity through the activation function. The second linear transformation then restores the dimensionality of the data. This configuration allows the feedforward layer to capture complex patterns and relationships within the input data, promoting the model's capacity to learn intricate features. The feedforward layer significantly contributes to the transformer's ability to model sequential dependencies and hierarchies, making it a vital element in achieving state-of-the-art performance in natural language processing tasks.

In [None]:
import torch
import torch.nn.functional as F
from torch import nn
from transformers import AutoConfig, AutoTokenizer
from transformers import BertForPreTraining

In [None]:
model_checkpoint = 'bert-base-uncased'

In [None]:
model = BertForPreTraining.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
config = AutoConfig.from_pretrained(model_checkpoint)

In [None]:
encoding = tokenizer.encode("let's tokenize something?", return_tensors="pt")
seq_embedding = model.bert.embeddings.word_embeddings(encoding)
seq_embedding.shape   # (batch_size, seq_len, hidden_size)

In [None]:
def scaled_dot_product_attention(seq_embedding):
    head_dim = config.hidden_size // config.num_attention_heads
    query = nn.Linear(config.hidden_size, head_dim)(seq_embedding)
    key = nn.Linear(config.hidden_size, head_dim)(seq_embedding)
    value = nn.Linear(config.hidden_size, head_dim)(seq_embedding)
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / torch.math.sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

In [None]:
att_concat = torch.cat([scaled_dot_product_attention(seq_embedding) for i in range(12)], dim=-1)
att_ouput.shape

In [None]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [None]:
feedforward = FeedForward(config)

In [None]:
feedforward(att_ouput).shape

The feedforward layer operates independently on each position in the input sequence.
In

Note that a feed-forward layer such as nn.Linear is usually applied to a tensor of shape (batch_size, input_dim), where it acts on each element of the batch dimension independently. This is actually true for any dimension except the last one, so when we pass a tensor of shape (batch_size, seq_len, hidden_dim) the layer is applied to all token embeddings of the batch and sequence independently, which is exactly what we want. Let’s test this by passing the attention outputs:

In [None]:
_batch = torch.rand((64, config.hidden_size))
feedforward(_batch).shape