Introduction to Transformers
============================

Welcome to the first notebook in the “Transformers Explained” series!
In this notebook, we will explore the basics of the Transformer
architecture, a model that has revolutionized natural language
processing (NLP) and other domains.

What is a Transformer?
-----------------------

The Transformer is a neural network architecture introduced in 2017 by
Vaswani et al. in the paper “Attention Is All You Need.” Unlike
previous models like RNNs and CNNs, the Transformer relies entirely on
attention mechanisms, removing the need for recurrence or convolutions.

![Transformer Architecture](https://media.datacamp.com/legacy/v1704797298/image_7b08f474e7.png)
*Figure: General Transformer Architecture (Source: Jalammar)*

Why Are Transformers Important?
-------------------------------

Before Transformers, sequential models like LSTM and GRU were dominant
in NLP. But they struggled with long-range dependencies and were hard to parallelize.
Transformers fixed that by using **self-attention**, enabling simultaneous processing.

General Architecture
--------------------

A typical Transformer consists of:
1.  **Encoder**: Processes the input and builds a contextual representation.
2.  **Decoder**: Uses the encoder output to generate the final sequence.

![Encoder Decoder](https://media.datacamp.com/legacy/v1704797298/image_3aa5aef3db.png)
*Figure: Encoder and Decoder structure*

Encoder Blocks
--------------

Each encoder block has:
-   **Multi-head Self-Attention**
-   **Feed-Forward Neural Network**
Decoder Blocks
--------------

Each decoder block contains:
-   **Masked Multi-head Self-Attention**
-   **Encoder-Decoder Attention**
-   **Feed-Forward Layer**

![Multi-Head Attention](https://media.datacamp.com/cms/google/82s2vzpkd8l-bvn5nzlyol98qr7yjcmieudlmn5qvgnofxo4eajw_vpvx-suwmitx4yiebkhyzztq6vmw15j_so_-xiwvc5_d76irx1hlhky4giknbx2pff9rxydcuv3akzvwhl-pvyn7b7eszul9n4.png)
*Figure: Multi-head Attention explained*

Positional Encoding
--------------------

Transformers don’t use recurrence or convolutions, so they lack inherent word order.
To solve this, we add **Positional Encodings**.

![Positional Encoding](https://media.datacamp.com/cms/google/aa6uuy3t-iknfuwcguorfwsud60oza4ptjuotfmk0ce1p1pp_o-dr0k8dxqubp4xfk7yme8vx3tlliorja-afownqyoeggkxey3nv0arqyrwnwpeqzyx0dsyavjdodgysmonaxryhwcqf0b-in1zkki.png)
*Figure: Positional encoding visualized as sinusoidal functions*

Simplified Example (Pseudo-code)
--------------------------------


In [ ]:
import torch
import torch.nn as nn

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

class SimpleEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dim_feedforward, dropout):
        super(SimpleEncoderLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, num_heads, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        src2 = self.self_attn(src, src, src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        src2 = self.linear2(self.dropout(torch.relu(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src

d_model = 512
num_heads = 8
dim_feedforward = 2048
dropout = 0.1

encoder_layer = SimpleEncoderLayer(d_model, num_heads, dim_feedforward, dropout)
pos_encoder = PositionalEncoding(d_model)

input_sequence = torch.rand(10, 2, d_model)
input_with_pos = pos_encoder(input_sequence)
output = encoder_layer(input_with_pos)

print(f"Input shape: {input_sequence.shape}")
print(f"Encoder output shape: {output.shape}")

Conclusion
----------

This notebook introduced you to the basic concepts behind the Transformer architecture.
In the next notebooks, we’ll dive deeper into the **attention mechanism**, and build
a full Transformer step-by-step.
