# Transformer Architecture Flow Documentation

## Building the Transformer (build_transformer function)

### Input Parameters:
- src_vocab_size: Size of source vocabulary
- tgt_vocab_size: Size of target vocabulary
- src_seq_len: Maximum length of source sequence
- tgt_seq_len: Maximum length of target sequence
- d_model: Dimension of the model (default: 512)
- N: Number of encoder/decoder blocks (default: 6)
- h: Number of attention heads (default: 8)
- dropout: Dropout rate (default: 0.1)
- d_ff: Dimension of feed-forward network (default: 2048)

### Construction Flow:
1. Create embedding layers:
   - Source embedding: (batch_size, src_seq_len) → (batch_size, src_seq_len, d_model)
   - Target embedding: (batch_size, tgt_seq_len) → (batch_size, tgt_seq_len, d_model)

2. Create positional encoding layers:
   - Adds positional information to embeddings
   - Output shape remains same as input: (batch_size, seq_len, d_model)
   - Uses sine and cosine functions for different frequencies

3. Build Encoder:
   - Creates N identical encoder blocks
   - Each encoder block contains:
     a. Self-attention block (MultiHeadAttentionBlock)
     b. Feed-forward block
     c. Two residual connections with layer normalization

4. Build Decoder:
   - Creates N identical decoder blocks
   - Each decoder block contains:
     a. Self-attention block
     b. Cross-attention block
     c. Feed-forward block
     d. Three residual connections with layer normalization

5. Create projection layer:
   - Converts decoder output to vocabulary probabilities
   - Input: (batch_size, seq_len, d_model)
   - Output: (batch_size, seq_len, tgt_vocab_size)

## Forward Pass Flow

### 1. Encoding Process:
```
Input → Embedding → Positional Encoding → Encoder Blocks → Encoder Output
(batch, seq_len) → (batch, seq_len, d_model) → (batch, seq_len, d_model) → (batch, seq_len, d_model)
```

#### Encoder Block Processing:
1. Self-Attention:
   - Input splits into Q, K, V: (batch, seq_len, d_model) → (batch, h, seq_len, d_k)
   - Attention computation: (batch, h, seq_len, seq_len)
   - Output: (batch, seq_len, d_model)
2. Feed-Forward:
   - First linear: (batch, seq_len, d_model) → (batch, seq_len, d_ff)
   - ReLU activation
   - Second linear: (batch, seq_len, d_ff) → (batch, seq_len, d_model)

### 2. Decoding Process:
```
Target → Embedding → Positional Encoding → Decoder Blocks → Projection → Output Probabilities
(batch, seq_len) → (batch, seq_len, d_model) → (batch, seq_len, d_model) → (batch, seq_len, tgt_vocab_size)
```

#### Decoder Block Processing:
1. Self-Attention (masked):
   - Prevents attending to future tokens
   - Same shape transformations as encoder self-attention
2. Cross-Attention:
   - Q: from decoder, K,V: from encoder output
   - Allows decoder to focus on relevant parts of input sequence
3. Feed-Forward:
   - Identical to encoder feed-forward network

### 3. Multi-Head Attention Details:
1. Linear projections:
   ```
   Input(batch, seq_len, d_model) → 
   Split into h heads(batch, h, seq_len, d_k) →
   Attention(batch, h, seq_len, seq_len) →
   Values(batch, h, seq_len, d_k) →
   Concat(batch, seq_len, d_model)
   ```

2. Attention formula:
   ```
   Attention(Q,K,V) = softmax(QK^T/√d_k)V
   ```

### 4. Important Shape Transformations:
- Source input: (batch_size, src_seq_len)
- Target input: (batch_size, tgt_seq_len)
- Embeddings: (..., d_model)
- Attention heads: (batch_size, num_heads, seq_len, d_k)
- Feed-forward: (batch_size, seq_len, d_ff)
- Final output: (batch_size, tgt_seq_len, tgt_vocab_size)

## Training Flow:
1. Forward pass:
   - Source sequence → Encoder → Encoder Output
   - Target sequence → Decoder (using encoder output) → Predictions
2. Loss calculation:
   - Compare predictions with shifted target sequence
3. Backward pass:
   - Gradient calculation and parameter updates
4. Key masks:
   - Source mask: Handles padding in input sequence
   - Target mask: Prevents attending to future tokens (causal mask)

## Implementation Notes:
- Layer normalization is applied before attention and feed-forward (pre-norm)
- Residual connections help with gradient flow
- Xavier uniform initialization is used for all parameters
- Dropout is applied:
  1. After positional encoding
  2. After attention softmax
  3. After each sublayer before residual connection

# Detailed Transformer Execution Flow - Step by Step

## 1. Initial Construction (build_transformer function call)

### 1.1 Component Creation
```python
transformer = build_transformer(
    src_vocab_size=32000,  # example values
    tgt_vocab_size=32000,
    src_seq_len=512,
    tgt_seq_len=512
)
```

1. Creates embedding layers:
   ```python
   src_embed = InputEmbeddings(d_model, src_vocab_size)
   tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)
   ```

2. Creates positional encoding layers:
   ```python
   src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
   tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)
   ```
   - During initialization, creates position encodings matrix (pe):
     - Shape: (1, seq_len, d_model)
     - Contains sine/cosine patterns for each position

3. Creates N encoder blocks and N decoder blocks
4. Initializes projection layer

## 2. Forward Pass Execution Flow

When you input source and target sequences:
```python
output = transformer(src_sequence, tgt_sequence, src_mask, tgt_mask)
```

### 2.1 Source Input Processing (Encoding)

1. **Embedding Layer** (`src_embed`):
   ```python
   # Input: (batch_size, src_seq_len)
   embedded = self.src_embed(src)
   # Output: (batch_size, src_seq_len, d_model)
   ```

2. **Positional Encoding** (`src_pos`):
   ```python
   # Forward method of PositionalEncoding
   position_encoded = self.src_pos(embedded)
   # Adds pre-computed positional encodings to embeddings
   # Output: (batch_size, src_seq_len, d_model)
   ```

3. **Encoder Processing**:
   ```python
   encoder_output = self.encoder(position_encoded, src_mask)
   ```
   
   For each encoder block:
   
   a. **Self-Attention** (`MultiHeadAttentionBlock`):
      ```python
      # 1. Linear projections
      Q = self.w_q(x)  # (batch, seq_len, d_model)
      K = self.w_k(x)  # (batch, seq_len, d_model)
      V = self.w_v(x)  # (batch, seq_len, d_model)
      
      # 2. Reshape for multiple heads
      Q = Q.view(batch, seq_len, h, d_k).transpose(1, 2)
      # Now: (batch, h, seq_len, d_k)
      
      # 3. Attention computation
      attention_scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
      # Apply mask and softmax
      attention_output = attention_scores @ V
      
      # 4. Concatenate heads and project
      attention_output = self.w_o(attention_output)
      ```

   b. **Feed-Forward Network**:
      ```python
      # Two linear transformations with ReLU
      ff_output = self.linear_2(self.dropout(
          torch.relu(self.linear_1(attention_output))
      ))
      ```

### 2.2 Target Input Processing (Decoding)

1. **Embedding Layer** (`tgt_embed`):
   ```python
   # Input: (batch_size, tgt_seq_len)
   embedded = self.tgt_embed(tgt)
   # Output: (batch_size, tgt_seq_len, d_model)
   ```

2. **Positional Encoding** (`tgt_pos`):
   ```python
   position_encoded = self.tgt_pos(embedded)
   ```

3. **Decoder Processing**:
   For each decoder block:
   
   a. **Masked Self-Attention**:
      - Same as encoder self-attention but with causal mask
      - Prevents attending to future positions
   
   b. **Cross-Attention**:
      ```python
      # Q from decoder, K,V from encoder
      cross_attention = self.cross_attention_block(
          decoder_output,  # Q
          encoder_output,  # K
          encoder_output   # V
      )
      ```
   
   c. **Feed-Forward Network**:
      - Same structure as encoder

### 2.3 Final Projection

```python
# Input: (batch_size, tgt_seq_len, d_model)
output = self.projection_layer(decoder_output)
# Output: (batch_size, tgt_seq_len, tgt_vocab_size)
```

## 3. Residual Connections and Layer Normalization

Throughout the network:

```python
class ResidualConnection(nn.Module):
    def forward(self, x, sublayer):
        # 1. Layer normalization
        normalized = self.norm(x)
        # 2. Apply sublayer (attention or feed-forward)
        sublayer_output = sublayer(normalized)
        # 3. Apply dropout
        dropout_output = self.dropout(sublayer_output)
        # 4. Add residual connection
        return x + dropout_output
```

## 4. Training Process

1. Forward pass triggers when loss is calculated:
   ```python
   # Example training step
   optimizer.zero_grad()
   output = transformer(src, tgt_input, src_mask, tgt_mask)
   loss = criterion(output, tgt_output)
   ```

2. Backward pass:
   ```python
   loss.backward()
   optimizer.step()
   ```

## 5. Mask Types and Creation

1. **Source Padding Mask**:
   - Created for source sequences
   - Masks padding tokens (usually 0s)
   ```python
   # Shape: (batch_size, 1, 1, src_seq_len)
   src_mask = (src != pad_token).unsqueeze(1).unsqueeze(2)
   ```

2. **Target Causal Mask**:
   - Combines padding mask and causal mask
   - Prevents attending to future tokens
   ```python
   # Shape: (batch_size, 1, tgt_seq_len, tgt_seq_len)
   tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
   ```

## 6. Memory Flow and Attention Patterns

1. **Encoder Self-Attention**:
   - Each position can attend to all positions in the source sequence
   - Memory complexity: O(n²) where n is sequence length

2. **Decoder Self-Attention**:
   - Each position can only attend to previous positions
   - Memory complexity: O(n²)

3. **Cross-Attention**:
   - Each decoder position can attend to all encoder positions
   - Memory complexity: O(n²)

## 7. Dimension Transitions Throughout Network

```
Input → Embedding → Positional → Attention → FF → Output
(L) → (L,D) → (L,D) → (L,D) → (L,D) → (L,V)

Where:
L = sequence length
D = d_model (512 default)
V = vocabulary size
```

Each attention head: D/h dimensions (64 if d_model=512 and h=8)

In [3]:
import torch
import torch.nn as nn
import math
class InputEmbeddings(nn.Module):

    def __init__(self, d_model: int, vocab_size: int) -> None: # d_model : dimensions , vocab_size : How many words are there in the vocabulary.
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        # (batch, seq_len) --> (batch, seq_len, d_model)
        # Multiply by sqrt(d_model) to scale the embeddings according to the paper
        return self.embedding(x) * math.sqrt(self.d_model) # This is in the paper.

In [2]:
vocab = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "I": 3, "am": 4, "learning": 5, "transformers": 6}
vocab_size = len(vocab)

# Reverse vocabulary for token-to-text mapping
reverse_vocab = {v: k for k, v in vocab.items()}

# Example tokenizer
def tokenize(text):
    return [vocab[token] for token in text.split() if token in vocab]
import torch

# Input sentence
text = "I am learning transformers"

# Tokenize the sentence
token_ids = tokenize(text)
print(f"Token IDs: {token_ids}")  # Output: [3, 4, 5, 6]

# Add <sos> and <eos> tokens
token_ids = [vocab["<sos>"]] + token_ids + [vocab["<eos>"]]
print(f"Token IDs with <sos> and <eos>: {token_ids}")  # Output: [1, 3, 4, 5, 6, 2]

# Convert to tensor
input_ids = torch.tensor([token_ids])  # Shape: (batch_size=1, seq_len=6)

Token IDs: [3, 4, 5, 6]
Token IDs with <sos> and <eos>: [1, 3, 4, 5, 6, 2]


In [3]:
import math
import torch.nn as nn

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)
d_model = 8  # Example embedding dimension
embedding_layer = InputEmbeddings(d_model, vocab_size)

In [4]:
embedding_layer(input_ids)

tensor([[[ 4.3973, -0.3792,  2.3362,  1.0594,  5.3727,  0.4745, -1.6741,
           0.0453],
         [ 3.4330, -0.0212,  3.6404, -0.4299, -3.4045,  1.3409, -1.3734,
           2.6941],
         [ 0.1255,  1.5592, -2.0894,  2.4839, -0.4168,  2.5544, -3.7248,
           0.7863],
         [ 0.0383, -3.0224, -1.6236,  5.3485,  3.2719, -2.2278,  1.6502,
           2.0719],
         [-0.7482,  0.5400, -1.1554, -3.9622,  4.8737, -5.5352,  0.0768,
           0.8087],
         [-0.2199, -0.3985, -3.0491,  4.4211,  5.0618,  0.3720, -3.8904,
           0.9432]]], grad_fn=<MulBackward0>)

In [5]:
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

In [6]:
div_term

tensor([1.0000e+00, 1.0000e-01, 1.0000e-02, 1.0000e-03])

In [17]:
pe = torch.zeros(3, 512)
# Create a vector of shape (seq_len)
position = torch.arange(0, 3, dtype=torch.float).unsqueeze(1) # (seq_len, 1)
# Create a vector of shape (d_model)
div_term = torch.exp(torch.arange(0, 512, 2).float() * (-math.log(10000.0) / d_model)) # (d_model / 2)
# Apply sine to even indices
pe[:, 0::2] = torch.sin(position * div_term)
pe.shape

torch.Size([3, 512])

In [15]:
div_term.shape

torch.Size([256])

In [16]:
position.shape

torch.Size([3, 1])

In [18]:
math.sqrt(256)

16.0

In [19]:
2/16 , 1/16 , 0/16

(0.125, 0.0625, 0.0)

In [27]:
x = torch.ones(1,2,3,3)
x

tensor([[[[1., 1., 1.],
          [1., 1., 1.],
          [1., 1., 1.]],

         [[1., 1., 1.],
          [1., 1., 1.],
          [1., 1., 1.]]]])

In [4]:

def causal_mask(size):
    mask = torch.triu(torch.ones(1,size,size),diagonal=1).type(torch.int)
    return mask == 0

In [15]:
import torch

# Example token values
seq_len = 4
self_sos_token = torch.tensor([1], dtype=torch.int64)  # start of sequence token
dec_input_tokens = [15,6]  # the input tokens
self_pad_token = 0  # padding token
dec_num_padding_tokens = seq_len - len(dec_input_tokens) - 2  # number of padding tokens to add
self_eos_token = torch.tensor([2], dtype=torch.int64)

# Concatenate the tokens
decoder_input = torch.cat(
    [
        self_sos_token,
        torch.tensor(dec_input_tokens, dtype=torch.int64),
        torch.tensor([self_pad_token] * dec_num_padding_tokens, dtype=torch.int64),
        self_eos_token
    ],
    dim=0,
)

print(decoder_input)


tensor([ 1, 15,  6,  2])


In [30]:
len(decoder_input)

4

In [31]:
decoder_input.size(0)

4

In [32]:
decoder_pre_mask = (decoder_input != self_pad_token).unsqueeze(0).int()

In [33]:
decoder_pre_mask.shape

torch.Size([1, 4])

In [34]:
decoder_input.size(0)

4

In [35]:
cm = causal_mask(decoder_input.size(0))

In [36]:
# returns = {"decoder_mask": (decoder_input != self_pad_token).unsqueeze(0).int() & causal_mask(decoder_input.size(0))} # (1, seq_len) & (1, seq_len, seq_len),
print(decoder_pre_mask)
print(cm)

print(decoder_pre_mask & cm)

tensor([[1, 1, 1, 0]], dtype=torch.int32)
tensor([[[ True, False, False, False],
         [ True,  True, False, False],
         [ True,  True,  True, False],
         [ True,  True,  True,  True]]])
tensor([[[1, 0, 0, 0],
         [1, 1, 0, 0],
         [1, 1, 1, 0],
         [1, 1, 1, 0]]], dtype=torch.int32)


In [1]:
import torch
import torch.nn as nn
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from pathlib import Path
from torch.utils.data import Dataset, DataLoader,random_split
def get_or_build_tokenizer(config,ds,lang):
    # config['tokenizer_file'] = '..tokenizers/tokenizer_{0}.json'
    tokenizer_path = Path(config['tokenizer_file'].format(lang))
    if not Path.exists(tokenizer_path):
        tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
        tokenizer.pre_tokenizer = Whitespace()
        trainer = WordLevelTrainer(special_tokens=["[UNK]","[PAD]","[SOS]","[EOS]"], min_frequency=2) # min_frequency A word to have it in our vocab , the frequency of that word must be 2.
        tokenizer.train_from_iterator(get_all_sentences(ds,lang),trainer=trainer)
        tokenizer.save(str(tokenizer_path))
    else:
        tokenizer = Tokenizer.from_file(str(tokenizer_path))
    return tokenizer
class BilingualDataset(Dataset):

    def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len):
        super().__init__()
        self.seq_len = seq_len

        self.ds = ds
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.src_lang = src_lang
        self.tgt_lang = tgt_lang

        self.sos_token = torch.tensor([tokenizer_tgt.token_to_id("[SOS]")], dtype=torch.int64)
        self.eos_token = torch.tensor([tokenizer_tgt.token_to_id("[EOS]")], dtype=torch.int64)
        self.pad_token = torch.tensor([tokenizer_tgt.token_to_id("[PAD]")], dtype=torch.int64)

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        src_target_pair = self.ds[idx]
        src_text = src_target_pair['translation'][self.src_lang]
        tgt_text = src_target_pair['translation'][self.tgt_lang]

        # Transform the text into tokens
        enc_input_tokens = self.tokenizer_src.encode(src_text).ids
        dec_input_tokens = self.tokenizer_tgt.encode(tgt_text).ids

        # Add sos, eos and padding to each sentence
        enc_num_padding_tokens = self.seq_len - len(enc_input_tokens) - 2  # We will add <s> and </s>
        # We will only add <s>, and </s> only on the label
        dec_num_padding_tokens = self.seq_len - len(dec_input_tokens) - 1 # we will add only <s> at the start. So to skip that we did -1

        # Make sure the number of padding tokens is not negative. If it is, the sentence is too long
        if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:
            raise ValueError("Sentence is too long")

        # Add <s> and </s> token
        encoder_input = torch.cat(
            [
                self.sos_token,
                torch.tensor(enc_input_tokens, dtype=torch.int64),
                self.eos_token,
                torch.tensor([self.pad_token] * enc_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        # Add only <s> token
        decoder_input = torch.cat(
            [
                self.sos_token,
                torch.tensor(dec_input_tokens, dtype=torch.int64),
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        # Add only </s> token
        label = torch.cat(
            [
                torch.tensor(dec_input_tokens, dtype=torch.int64),
                self.eos_token,
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        # Double check the size of the tensors to make sure they are all seq_len long
        assert encoder_input.size(0) == self.seq_len
        assert decoder_input.size(0) == self.seq_len
        assert label.size(0) == self.seq_len

        return {
            "encoder_input": encoder_input,  # (seq_len)
            "decoder_input": decoder_input,  # (seq_len)
            "encoder_mask": (encoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int(), # (1, 1, seq_len)
            "decoder_mask": (decoder_input != self.pad_token).unsqueeze(0).int() & causal_mask(decoder_input.size(0)), # (1, seq_len) & (1, seq_len, seq_len),
            "label": label,  # (seq_len)
            "src_text": src_text,
            "tgt_text": tgt_text,
        }

def causal_mask(size):
    mask = torch.triu(torch.ones(1,size,size),diagonal=1).type(torch.int)
    return mask == 0



def get_ds(config):
    ds_raw = load_dataset('opus_books',f"{config['lang_src']}-{config['lang_tgt']}",split='train')

    # build the tokenizer
    tokenizer_src = get_or_build_tokenizer(config,ds_raw,config['lang_src'])
    tokenizer_tgt = get_or_build_tokenizer(config,ds_raw,config['lang_tgt'])

    # keep 90% for training and 10% for validation
    train_ds_size = int(0.9 * len(ds_raw))
    val_ds_size = len(ds_raw) - train_ds_size
    train_ds_raw , val_ds_raw = random_split(ds_raw,[train_ds_size,val_ds_size])

    print("train_ds_raw",next(iter(train_ds_raw)))
    # train_ds and val_ds
    train_ds = BilingualDataset(train_ds_raw,tokenizer_src,tokenizer_tgt,config['lang_src'],config['lang_tgt'],config['seq_len'])
    val_ds  = BilingualDataset(val_ds_raw,tokenizer_src,tokenizer_tgt,config['lang_src'],config['lang_tgt'],config['seq_len'])
    max_len_src = 0
    max_len_tgt = 0
    for item in ds_raw:
        src_ids = tokenizer_src.encode(item['translation'][config['lang_src']]).ids
        tgt_ids = tokenizer_src.encode(item['translation'][config['lang_tgt']]).ids
        max_len_src = max(max_len_src , len(src_ids))
        max_len_tgt = max(max_len_tgt,len(tgt_ids))
    print(f"Max length of Soure sentence : {max_len_src}")
    print(f"Max length of target sentence : {max_len_tgt}")

    train_dataloader = DataLoader(train_ds,batch_size=config['batch_size'],shuffle= True)
    val_dataloader = DataLoader(val_ds, batch_size = 1 , shuffle = True)
    print(train_dataloader)
    return train_dataloader , val_dataloader, tokenizer_src , tokenizer_tgt

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
%%writefile config2.py
from pathlib import Path
def get_config():
    return {
        "batch_size" : 8,
        "num_epochs" : 20,
        "lr" : 10**-4,
        "seq_len" : 350,
        "d_model" : 512,
        "lang_src" : "en",
        "lang_tgt" : "it",
        "model_folder" : "weights",
        "model_basename" : "tmodel_",
        "preload" : None,
        "tokenizer_file" : "tokenizer_{0}.json",
        "experiment_name" : "runs/tmodel"
    }
def get_weights_file_path(config,epoch:str):
    model_folder = config['model_folder']
    model_basename = config['model_basename']
    model_filename = f"{model_basename}{epoch}.pt"
    return str(Path('.') / model_folder / model_filename)

Overwriting config2.py


In [3]:
from config2 import *
config = get_config()
train_dataloader , val_dataloader, tokenizer_src , tokenizer_tgt = get_ds(config)

train_ds_raw {'id': '14822', 'translation': {'en': 'And every gentle air that dallied,', 'it': 'E, ad ogni dolce venticello'}}
Max length of Soure sentence : 309
Max length of target sentence : 274
<torch.utils.data.dataloader.DataLoader object at 0x0000027977B2B430>


In [71]:
print(tds)

<__main__.BilingualDataset object at 0x000002EC7EC941C0>


In [72]:
tokenizer_src

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":1, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":2, "content":"[SOS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":3, "content":"[EOS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=WordLevel(vocab={"[UNK]":0, "[PAD]":1, "[SOS]":2, "[EOS]":3, ",":4, "the":5, "and":6, ".":7, "to":8, "I":9, "of":10, "a":11, "'":12, "in":13, "was":14, "that":15, "he":16, "it":17, ";":18, "had":19, "his":20, "not":21, "with":22, "her":23, "you":24, "as":25, "for":26, "she":27, "my":28, "-":29, "at":30, "but":31, "him":32, "me":33, "is":34, """:35, "on":36, "be":37, ":

22463

In [4]:
# Visualization of variables in the DataLoader
def visualize_dataloader(train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt):
    # Get a single batch from the train DataLoader
    print("Visualizing the Training DataLoader...\n")
    for batch_idx, batch in enumerate(train_dataloader):
        print(f"Batch {batch_idx + 1}:\n")
        
        for key, value in batch.items():
            if isinstance(value, torch.Tensor):
                print(f"Key: {key}")
                print(f"Shape: {value.shape}")
                print(f"Content (First Sample): {value[0]}\n")
            else:
                print(f"Key: {key} - Type: {type(value)}\n")
        
        print("\n--- Text Reconstruction from Token IDs ---")
        print("Source Text:")
        print(batch["src_text"][0])  # Original source text
        print("Reconstructed Source Tokens:")
        print(tokenizer_src.decode(batch["encoder_input"][0].tolist()))
        
        print("Target Text:")
        print(batch["tgt_text"][0])  # Original target text
        print("Reconstructed Target Tokens:")
        print(tokenizer_tgt.decode(batch["decoder_input"][0].tolist()))
        
        # Break after visualizing the first batch
        break

    print("\nVisualizing the Validation DataLoader...\n")
    for batch_idx, batch in enumerate(val_dataloader):
        print(f"Batch {batch_idx + 1}:\n")
        
        for key, value in batch.items():
            if isinstance(value, torch.Tensor):
                print(f"Key: {key}")
                print(f"Shape: {value.shape}")
                print(f"Content (First Sample): {value[0]}\n")
            else:
                print(f"Key: {key} - Type: {type(value)}\n")
        
        print("\n--- Text Reconstruction from Token IDs ---")
        print("Source Text:")
        print(batch["src_text"][0])  # Original source text
        print("Reconstructed Source Tokens:")
        print(tokenizer_src.decode(batch["encoder_input"][0].tolist()))
        
        print("Target Text:")
        print(batch["tgt_text"][0])  # Original target text
        print("Reconstructed Target Tokens:")
        print(tokenizer_tgt.decode(batch["decoder_input"][0].tolist()))
        
        # Break after visualizing the first batch
        break

# Call the visualization function
visualize_dataloader(train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt)

Visualizing the Training DataLoader...

Batch 1:

Key: encoder_input
Shape: torch.Size([8, 350])
Content (First Sample): tensor([    2,  4724,   878,   110,   260, 11501,   746,  3888, 11501,  9176,
           13,     5, 14352,    10,   150,    43,   197,   809,     7,     3,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,