In [None]:
print("hello")
print()


### Decoder Causal Language Models

### GPT vs. ChatGPT

GPT: General-purpose language model, used for text generation, translation, summarization, etc.

ChatGPT: Fine-tuned GPT optimized for conversational AI, maintaining dialogue history for human-like conversations.
Key Difference: GPT does one-time text generation, while ChatGPT is designed for interactive dialogue.

In [None]:
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install portalocker==2.8.2
!pip install torchdata==0.7.1
!pip install pandas
!pip install matplotlib==3.9.0 scikit-learn==1.5.0
!pip install numpy==1.26.0
!pip install transformers==4.40.0

torchdata: Enhances data loading and preprocessing functionalities for PyTorch, streamlining the workflow for machine learning models.

portalocker: Provides a mechanism to lock files, ensuring that only one process can access a file at a time, useful for managing file resources in concurrent applications.

torchtext: Offers utilities for text processing and datasets in PyTorch, simplifying the preparation of data for NLP tasks.

matplotlib: A plotting library for creating static, interactive, and animated visualizations in Python, commonly used for data visualization and graphical plotting tasks.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [None]:
#from torchtext.datasets import multi30k, Multi30k
from torch.utils.data import DataLoader
import torch
from typing import Iterable, List
import matplotlib.pyplot as plt
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
from torchtext.vocab import Vocab
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
from torchtext.datasets import IMDB,PennTreebank
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import time
from torch.optim import Adam


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [None]:
# Load the dataset
train_iter, val_iter = IMDB()

In [None]:
data_itr=iter(train_iter)
# retrieving the third first record
next(data_itr)
next(data_itr)
next(data_itr)

In [None]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DEVICE


## Preprocessing data
The provided code is used for preprocessing text data, particularly for NLP tasks, with a focus on tokenization and vocabulary building.

- **Special Symbols and Indices**: Initializes special tokens (`<unk>`, `<pad>`, and an empty string for EOS) with their corresponding indices (`0`, `1`, and `2`). These tokens are used for unknown words, padding, and end of sentence respectively.
    - `UNK_IDX`: Index for unknown words.
    - `PAD_IDX`: Index used for padding shorter sentences in a batch to ensure uniform length.
    - `EOS_IDX`: Index representing the end of a sentence (though not explicitly used here as the EOS symbol is set to an empty string).

- **`yield_tokens` Function**: A generator function that iterates through a dataset (`data_iter`), tokenizing each data sample using a `tokenizer` function, and yields one tokenized sample at a time.

- **Vocabulary building**: Constructs a vocabulary from the tokenized dataset. The `build_vocab_from_iterator` function processes tokens generated by `yield_tokens`, includes special tokens (`special_symbols`) at the beginning of the vocabulary, and sets a minimum frequency (`min_freq=1`) for tokens to be included.

- **Default index for unknown tokens**: Sets a default index for tokens not found in the vocabulary (`UNK_IDX`), ensuring that out-of-vocabulary words are handled as unknown tokens.

- **`text_to_index` function**: Converts a given text into a sequence of indices based on the built vocabulary. This function is essential for transforming raw text into a numerical format that can be processed by machine learning models.

- **`index_to_en` function**: Transforms a sequence of indices back into a readable string. It's useful for interpreting the outputs of models and converting numerical predictions back into text.

- **Check functionality**: Demonstrates the use of `index_to_en` by converting a tensor of indices `[0,1,2]` back into their corresponding special symbols. This helps verify that the vocabulary and index conversion functions are working as expected.


In [None]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, EOS_IDX = 0, 1, 2
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<|endoftext|>' ]

In [None]:
tokenizer = get_tokenizer("basic_english")

In [None]:
def yield_tokens(data_iter):

    for _,data_sample in data_iter:
        yield  tokenizer(data_sample)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=special_symbols, special_first=True)
vocab.set_default_index(UNK_IDX)


In [None]:
text_to_index=lambda text: [vocab(token) for token in tokenizer(text)]
index_to_en = lambda seq_en: " ".join([vocab.get_itos()[index] for index in seq_en])

In [None]:
#check
index_to_en(torch.tensor([0,1,2,3]))

### Collate function
In the context of our decoder model, we aim to create a collate function. This function takes a block of text as input and produces a modified block of text as output. The actual text transformation is achieved through the use of the `get_sample(block_size, text)` function. The **get_sample** function generates a random text sample(src_sequence) and its subsequent sequence(tgt_sequence) from a given text for language model training. It ensures the sample fits within the specified block size and adjusts for text shorter than the block size, returning both the source and target sequences for model input.


The source (src_sequence) and target (tgt_sequence) sequences are offset by one position because this function is preparing data for training language models, particularly for next-character or next-token prediction.

Reason for One-Character Shift:
Supervised Learning Setup:

The model is trained to predict the next character (or token) in the sequence given the current character.

In [None]:
def get_sample(block_size, text):
    # Determine the length of the input text
    sample_leg = len(text)
    # Calculate the stopping point for randomly selecting a sample
    # This ensures the selected sample doesn't exceed the text length
    random_sample_stop = sample_leg - block_size


    # Check if a random sample can be taken (if the text is longer than block_size)
    if random_sample_stop >= 1:
        # Randomly select a starting point for the sample
        random_start = torch.randint(low=0, high=random_sample_stop, size=(1,)).item()
        # Define the endpoint of the sample
        stop = random_start + block_size

        # Create the input and target sequences
        src_sequence = text[random_start:stop]
        tgt_sequence= text[random_start + 1:stop + 1]

    # Handle the case where the text length is exactly equal or less the block size
    elif random_sample_stop <= 0:
        # Start from the beginning and use the entire text
        random_start = 0
        stop = sample_leg
        src_sequence= text[random_start:stop]
        tgt_sequence = text[random_start + 1:stop]
        # Append an empty string to maintain sequence alignment
        tgt_sequence.append( '<|endoftext|>')

    return src_sequence, tgt_sequence

In [None]:
BATCH_SIZE=1

batch_of_tokens=[]

for i in range(BATCH_SIZE):
  _,text =next(iter(train_iter))
  batch_of_tokens.append(tokenizer(text))

In [None]:
text=batch_of_tokens[0][0:100]
text[0:100]
batch_of_tokens

To test the get_sample function with a block size of 100, where the output includes both the source sequence and the target sequence, with the target sequence being the source sequence shifted by one character, you can use the following code as an example:


In [None]:
block_size=10
src_sequences, tgt_sequence=get_sample( block_size, text)

In [None]:
print("src: ",src_sequences)
print("tgt: ",tgt_sequence)

In [None]:
# Initialize empty lists to store source and target sequences
src_batch, tgt_batch = [], []

# Define the batch size
BATCH_SIZE = 2

# Loop to create batches of source and target sequences
for i in range(BATCH_SIZE):
    # Retrieve the next data point from the training iterator
    _,text = next(iter(train_iter))

    # Generate source and target sequences using the get_sample function
    src_sequence_text, tgt_sequence_text = get_sample(block_size, tokenizer(text))

    # Convert source and target sequences to tokenized vocabulary indices
    src_sequence_indices = vocab(src_sequence_text)
    tgt_sequence_indices = vocab(tgt_sequence_text)

    # Convert the sequences to PyTorch tensors with dtype int64
    src_sequence = torch.tensor(src_sequence_indices, dtype=torch.int64)
    tgt_sequence = torch.tensor(tgt_sequence_indices, dtype=torch.int64)

    # Append the source and target sequences to their respective batches
    src_batch.append(src_sequence)
    tgt_batch.append(tgt_sequence)

    # Print the output for every 2nd sample (adjust as needed)
    print(f"Sample {i}:")
    print("Source Sequence (Text):", src_sequence_text)
    print("Source Sequence (Indices):", src_sequence_indices)
    print("Source Sequence (Shape):", src_sequence.shape)
    print("Target Sequence (Text):", tgt_sequence_text)
    print("Target Sequence (Indices):", tgt_sequence_indices)
    print("Target Sequence (Shape):", tgt_sequence.shape)

The collate_batch function prepares batches of source and target sequences for training by processing each text sample in a given batch. It generates source and target sequences using the get_sample function with a specified block size, converts these sequences to indices using a vocabulary, and transforms them into PyTorch tensors. The sequences are then padded to ensure uniform length across the batch. Finally, it returns the padded source and target batches, ready for training on the specified device (DEVICE).

In [None]:
BLOCK_SIZE=30
def collate_batch(batch):
    src_batch, tgt_batch = [], []
    for _,_textt in batch:
      src_sequence,tgt_sequence=get_sample(BLOCK_SIZE,tokenizer(_textt))
      src_sequence=vocab(src_sequence)
      tgt_sequence=vocab(tgt_sequence)
      src_sequence= torch.tensor(src_sequence, dtype=torch.int64)
      tgt_sequence = torch.tensor(tgt_sequence, dtype=torch.int64)
      src_batch.append(src_sequence)
      tgt_batch.append(tgt_sequence)


    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=False)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=False)

    return src_batch.to(DEVICE), tgt_batch.to(DEVICE)

The code sets up data loaders for the training, validation, and testing sets using the DataLoader class, with each set utilizing a custom collate_batch function for batch processing. The data loaders handle batches of size 1 for simplicity and shuffle the data for randomized access. After initializing the training data loader, it fetches the first batch of source (src) and target (tgt) sequences. It then iterates over each token in the source sequence, converts them back to text using the index_to_en function, and prints the resulting sentences, demonstrating how to access and display preprocessed data ready for model training.

In [None]:
BATCH_SIZE=1
dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_dataloader= DataLoader(val_iter , batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

In [None]:
dataset=iter(dataloader)
for sample in range(10):
  src,trt=next(dataset)
  print("sample",sample)
  print("sorce:",index_to_en(src))
  print("\n")
  print("target:",index_to_en(trt))
  print("\n")

In [None]:
for  src,trt in dataset:
    print(trt.shape)
    print(src.shape)
    print(index_to_en(src[0,:]))
    print(index_to_en(trt[0,:]))
    break

In [None]:
print("source:",index_to_en(src))
print("target:",index_to_en(trt))

### Masking

In transformers, masking is crucial for ensuring certain positions are not attended to. The function generate_square_subsequent_mask produces an upper triangular matrix, which ensures that during decoding, a token can't attend to future tokens of target.

How This Works in Transformers

Assume we are generating a sequence:

Input: ["A", "B", "C", "D", "E"]

At each timestep:

Token "A" can only attend to itself.

Token "B" can attend to "A" but not "C", "D", "E".

Token "C" can attend to "A, B" but not "D, E", and so on.

This prevents the model from seeing future tokens when generating output during autoregressive decoding.

In [None]:
# This function creates an upper triangular mask for transformer models to prevent a token from attending to future tokens in decoder self-attention.
def generate_square_subsequent_mask(sz,device=DEVICE):
    mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

### Padding Mask (src_padding_mask):

Masks out padding tokens in the input, ensuring that the model doesn't attend to these irrelevant tokens.
This is crucial for sequences of different lengths, where padding is added to equalize sequence lengths.

In [None]:
def create_mask(src,device=DEVICE):
    src_seq_len = src.shape[0]
    src_mask = generate_square_subsequent_mask(src_seq_len)
    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    return src_mask,src_padding_mask

In [None]:
#Replace first four tokens with PAD token so we can also check how pad tokens are masked using padding_mask
src[0:4]=PAD_IDX

In [None]:
mask,padding_mask = create_mask(src)
src

## Positional encoding

The Transformer model doesn't have built-in knowledge of the order of tokens in the sequence. To give the model this information, positional encodings are added to the embeddings of the tokens. These encodings have a fixed pattern based on their position in the sequence.

GPT uses trainable positional encodings. Unlike fixed positional encodings (such as sinusoidal encodings used in the original Transformer paper), trainable positional encodings are learned during the model training process.

Trainable positional encodings are implemented as a set of learnable parameters, one for each position in the input sequence. These parameters have the same dimensionality as the token embeddings. During training, the model updates the positional encoding parameters along with the other model parameters to capture the positional information more effectively.

The use of trainable positional encodings in GPT allows the model to learn more flexible and task-specific positional representations, potentially improving its performance on various natural language processing tasks.



In [None]:
# add positional information to the input tokens
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

In [None]:
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Token embedding

Token embedding, also known as word embedding or word representation, is a way to convert words or tokens from a text corpus into numerical vectors in a continuous vector space. Each unique word or token in the corpus is assigned a fixed-length vector where the numerical values represent various linguistic properties of the word, such as its meaning, context, or relationships with other words.

The TokenEmbedding class below converts numerical tokens into embeddings:

## Custom GPT model architecture

The `CustomGPTModel` class defines a transformer-based model architecture for generative pre-trained models. This model aims to generate text and perform various NLP tasks. Below is an explanation of the main components of the class:

- **Initialization (`__init__`)**: The constructor takes several parameters including `embed_size`, `vocab_size`, `num_heads`, `num_layers`, `max_seq_len`, and `dropout`. It initializes the embedding layer, positional encoding, transformer encoder layers, and a linear layer (`lm_head`) for generating logits over the vocabulary.

- **Weight initialization (`init_weights`)**: This method initializes the weights of the model for better training convergence. The Xavier uniform initialization is used, which is a common practice for initializing weights in deep learning.

- **Decoder (`decoder`)**: Although named `decoder`, this method currently functions as the forward pass through the transformer encoder layers, followed by the generation of logits for the language modeling task. It handles the addition of positional encodings to the embeddings and applies a mask if necessary.

- **Forward pass (`forward`)**: This method is similar to the `decoder` method and defines the forward computation of the model. It processes the input through embedding layers, positional encoding, transformer encoder layers, and produces the final output using the `lm_head`.

- **Mask generation**: Both `decoder` and `forward` methods contain logic to generate a square causal mask if no source mask is provided. This mask ensures that the prediction for a position does not depend on the future tokens in the sequence, which is important for the autoregressive nature of GPT models.

- **Commented out decoder**: A section of the code is commented out, suggesting an initial design where a transformer decoder layer was considered. However, the final implementation uses only encoder layers, which is a common simplification for models focusing on language modeling and generation.

This class effectively encapsulates the necessary components to create a GPT-like model, allowing for training on language modeling tasks and text generation applications.


In [None]:
class CustomGPTModel(nn.Module):
    def __init__(self, embed_size,vocab_size, num_heads, num_layers, max_seq_len=500,dropout=0.1):

        super().__init__()

        self.init_weights()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, dropout=dropout)

        print( embed_size )


        # Remaining layers are part of the TransformerDecoder
        encoder_layers = nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers=num_layers)
        self.embed_size = embed_size
        self.lm_head = nn.Linear(embed_size, vocab_size)

    def init_weights(self):
      for p in self.parameters():
          if p.dim() > 1:
              nn.init.xavier_uniform_(p)

    def create_mask(src,device=DEVICE):
        src_seq_len = src.shape[0]
        src_mask = nn.Transformer.generate_square_subsequent_mask(src_seq_len)
        src_padding_mask = (src == PAD_IDX).transpose(0, 1)
        return src_mask,src_padding_mask

    def decoder(self, x,src_mask):
        seq_length = x.size(0)

        # Add positional embeddings to the input embeddings
        x = self.embed(x)* math.sqrt(self.embed_size)
        x = self.positional_encoding(x)

        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask, src_padding_mask = create_mask(x)

        output = self.transformer_encoder(x, src_mask)
        logits = self.lm_head(x)
        return logits

    def forward(self,x,src_mask=None,key_padding_mask=None):

        seq_length = x.size(0)

        # Add positional embeddings to the input embeddings
        x = self.embed(x)* math.sqrt(self.embed_size) #src = self.embedding(src) * math.sqrt(self.d_model)
        x = self.positional_encoding(x)


        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask, src_padding_mask = create_mask(x)

        output = self.transformer_encoder(x, src_mask,key_padding_mask)
        x = self.lm_head(x)

        return x


What happens in each layer?

1️⃣ Multi-Head Self-Attention: Understands relationships between tokens.

2️⃣ Layer Normalization: Stabilizes training.

3️⃣ Feed-Forward Network (FFN): Adds additional transformation capability.

4️⃣ Residual Connections: Helps with training deep models.

5️⃣ Dropout: Prevents overfitting.

Each encoder layer processes the input sequentially, refining token representations before passing them to the next layer.

In [None]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability

model = CustomGPTModel(embed_size=emsize, num_heads=nhead, num_layers=nlayers, vocab_size=ntokens,dropout=dropout).to(DEVICE)

In [None]:
def encode_prompt(prompt, block_size=BLOCK_SIZE):
    # Handle None prompt
    while prompt is None:
        prompt = input("Sorry, prompt cannot be empty. Please enter a valid prompt: ")

    tokens = tokenizer(prompt)
    number_of_tokens = len(tokens)

    # Handle long prompts
    if number_of_tokens > block_size:
        tokens = tokens[-block_size:]  # Keep last block_size characters

    prompt_indices = vocab(tokens)
    prompt_encoded = torch.tensor(prompt_indices, dtype=torch.int64).reshape(-1, 1)
    return prompt_encoded

In [None]:
print(index_to_en(encode_prompt("This is a prompt to get model generate next words." ) ))

In [None]:
prompt_encoded=encode_prompt("This is a prompt to get model generate next words.").to(DEVICE)
prompt_encoded

In [None]:
logits = model.decoder(prompt_encoded,src_mask=None).to(DEVICE)

In [None]:
logits

In [None]:
logits = logits.transpose(0, 1)
logits.shape

seq_len = 11 → The input sequence has 11 tokens.

batch_size = 1 → Only one sequence is being processed.

vocab_size = 68813 → The vocabulary contains 68,813 unique tokens, meaning the model predicts a probability distribution over this many possible tokens at each position.

In [None]:
logit_preiction =logits[:,-1]
logit_preiction.shape

In [None]:
 _, next_word_index = torch.max(logit_preiction, dim=1)
 next_word_index

## Autoregressive text generation

In decoder models, we simply append the output to the input to generate the next response. We stop this process when we encounter the end-of-sequence tag <|endoftext|> or if the input becomes too large. We will implement it as a function later in this notebook.


In [None]:
prompt="this is the beginning of"

In [None]:
prompt_encoded = encode_prompt(prompt).to(DEVICE)
print("Device for prompt_encoded:", prompt_encoded.shape)

In [None]:
max_new_tokens=10

In [None]:
for i in range(max_new_tokens):
    logits = model.decoder(prompt_encoded,src_mask=None)
    logits = logits.transpose(0, 1)
    print(" ")
    print(f"Shape of logits at step {i}: {logits.shape}")

    logit_preiction = logits[:, -1]
    print(f"Shape of logit_prediction at step {i}: {logit_preiction.shape}")

    next_token_encoded = torch.argmax(logit_preiction, dim=-1).reshape(-1, 1)
    print(f"Shape of next_token_encoded at step {i}: {next_token_encoded.shape}")

    prompt_encoded = torch.cat((prompt_encoded, next_token_encoded), dim=0).to(DEVICE)
    print(f"Sequence for step {i}: {[index_to_en(j) for j in prompt_encoded]}")
    print(f"Shape of prompt_encoded after concatenation at step {i}: {prompt_encoded.shape}")

In [None]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, EOS_IDX = 0, 1, 2
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<|endoftext|>' ]
BLOCK_SIZE

In [None]:

#auto-regressive Language Model text generation
def generate(model, prompt=None, max_new_tokens=500, block_size=BLOCK_SIZE, vocab=vocab, tokenizer=tokenizer):
    # Move model to the specified device (e.g., GPU or CPU)
    model.to(DEVICE)

    # Encode the input prompt using the provided encode_prompt function
    prompt_encoded = encode_prompt(prompt).to(DEVICE)
    tokens = []

    # Generate new tokens up to max_new_tokens
    for _ in range(max_new_tokens):
        # Decode the encoded prompt using the model's decoder
        logits = model(prompt_encoded,src_mask=None,key_padding_mask=None)

        # Transpose the logits to bring the sequence length to the first dimension
        logits = logits.transpose(0, 1)

        # Select the logits of the last token in the sequence
        logit_prediction = logits[:, -1]

        # Choose the most probable next token from the logits(greedy decoding)
        next_token_encoded = torch.argmax(logit_prediction, dim=-1).reshape(-1, 1)

        # If the next token is the end-of-sequence (EOS) token, stop generation
        if next_token_encoded.item() == EOS_IDX:
            break

        # Append the next token to the prompt_encoded and keep only the last 'block_size' tokens
        prompt_encoded = torch.cat((prompt_encoded, next_token_encoded), dim=0)[-block_size:]

        # Convert the next token index to a token string using the vocabulary
        # Move the tensor back to CPU for vocab lookup if needed
        token_id = next_token_encoded.to('cpu').item()
        tokens.append(vocab.get_itos()[token_id])

    # Join the generated tokens into a single string and return
    return ' '.join(tokens)

In [None]:
generate(model,prompt="this is the beginning of",max_new_tokens=30,vocab=vocab,tokenizer=tokenizer)

### Decoding the differences: Training vs. inference

The key difference between the training and inference stages lies in the inputs to the decoder. During training, the decoder benefits from exposure to the ground truth--receiving the exact target sequence tokens incrementally through a technique known as "teacher forcing." This approach is in stark contrast to some other neural network architectures that rely on the network's previous predictions as inputs during training. Once training concludes, the datasets used resemble those employed in more conventional neural network models, providing a familiar foundation for comparison and evaluation.

To start the training, first create a Cross Entropy Loss object. The loss will not consider PAD tokens.


In [None]:
from torch.nn import CrossEntropyLoss
loss_fn = CrossEntropyLoss(ignore_index=PAD_IDX)

In [None]:
src,tgt=next(iter(dataloader))

mask,padding_mask = create_mask(src)

In [None]:
logits = model(src,src_mask=mask,key_padding_mask=padding_mask)
print(logits.shape)

In [None]:
print("output shape",logits.shape)
print("source shape ",src)

In [None]:
tgt
print(tgt.shape)

print(logits.reshape(-1, logits.shape[-1]).shape)
print(tgt.reshape(-1).shape)

loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt.reshape(-1))
print(loss.item())

def evaluate(model: nn.Module, eval_data) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    with torch.no_grad():
        for src,tgt in eval_data:
            tgt = tgt.to(DEVICE)
            #seq_len = src.size(0)
            logits = model(src,src_mask=None,key_padding_mask=None)
            total_loss +=  loss_fn(logits.reshape(-1, logits.shape[-1]), tgt.reshape(-1)).item()
    return total_loss / (len(list(eval_data)) - 1)

evaluate(model,val_dataloader)

## Training the model
Incorporating the previously outlined steps, we proceed to train the model. Apart from these specific procedures, the overall training process conforms to the conventional methods employed in neural network training.

**Please be aware that training the model using CPUs can be a time-consuming process. If you don't have access to GPUs, you can jump to  "loading the saved model" and proceed with loading the pre-trained model using the provided code in the subsequent section `Loading the Saved Model`. We have trained the model for 30 epochs and saved it for your convenience.**

The `train` function is defined to fine-tune the `CustomGPTModel` on a given training dataset. It is structured as follows:

- **Optimizer**: Initializes an ADAM optimizer.

Within the `train` function:

- The model is set to train mode, which enables dropout and batch normalization layers.
- A loop iterates over the training data, which is loaded in batches. For each batch:
    - The source (`src`) and target (`tgt`) sequences are extracted.
    - The model performs a forward pass to get logits.
    - The logits are reshaped for loss calculation.
    - The loss is computed using `loss_fn`, which likely refers to a loss function such as cross-entropy that measures the difference between the predicted logits and the target sequences.
- Gradient clipping is applied to prevent exploding gradients, which is common in training deep neural networks.
- The optimizer updates the model parameters based on the computed gradients.

Logging occurs every `10000` steps, or when reaching a specific batch (batch `42060` is hardcoded as an example). During logging:

- The average loss and the perplexity (a measure of how well the probability model predicts a sample) are calculated and printed, providing insights into the model's performance.
- The elapsed time per batch since the last log interval is measured and reported, giving an indication of training efficiency.



In [None]:
optimizer = Adam(model.parameters(), lr=1e-2, weight_decay=0.01, betas=(0.9, 0.999))
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 10000, gamma=0.9)

def train(model: nn.Module,train_data) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 10000
    start_time = time.time()

    num_batches = len(list(train_data)) // block_size
    for batch,srctgt in enumerate(train_data):
        src= srctgt[0]
        tgt= srctgt[1]
        logits = model(src,src_mask=None)
        logits_flat = logits.reshape(-1, logits.shape[-1])
        loss = loss_fn(logits_flat, tgt.reshape(-1))

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        total_loss += loss.item()

        if (batch % log_interval == 0 and batch > 0) or batch==42060:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            #cur_loss = total_loss / log_interval
            cur_loss = total_loss / batch
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch//block_size:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.4f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            start_time = time.time()

    return total_loss

In [None]:
best_val_loss = float('inf')
epochs = 30
Train_losses= []
Val_losses = []
for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train_loss = train(model,dataloader)
    val_loss = evaluate(model, val_dataloader)
    val_ppl = math.exp(val_loss)
    Train_losses.append(train_loss)
    Val_losses.append(val_loss)

    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
        f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'model_best_val_loss.pt')

In [None]:
# Calculate the number of epochs (assuming the lengths of train_losses and val_losses are equal)
num_epochs = len(Train_losses)

# Create a figure and a set of subplots
fig, ax = plt.subplots()

# Plot the training losses
ax.plot(range(num_epochs), Train_losses, label='Training Loss', color='blue')

# Plot the validation losses
ax.plot(range(num_epochs), Val_losses, label='Validation Loss', color='orange')

# Set the x-axis label
ax.set_xlabel('Epoch')

# Set the y-axis label
ax.set_ylabel('Loss')

# Set the title of the plot
ax.set_title('Training and Validation Losses')

# Add a legend to the plot
ax.legend()

# Show the plot
plt.show()

## Loading GPT2 model from HuggingFace
Let's now load the GPT2 model from HuggingFace to check how it performs at text generation:


In [None]:
# Load the tokenizer and model
tokenizer1 = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define the input prompt
#input_text = "Once upon a time in a faraway land,"
input_text = "the movie was"

# Tokenize the input text and prepare the input for the model
input_ids = tokenizer1.encode(input_text, return_tensors="pt")

# Generate text using the model
# Set the desired length of the generated text (max_length),
# and other generation parameters like temperature, top_k, and top_p
max_length = 15
temperature = 0.7
top_k = 50
top_p = 0.95

generated_ids = model.generate(
    input_ids,
    max_length=max_length,
    temperature=temperature,
    top_k=top_k,
    top_p=top_p,
    pad_token_id=tokenizer1.eos_token_id,
)

# Decode the generated text
generated_text = tokenizer1.decode(generated_ids[0], skip_special_tokens=True)

# Print the input prompt and the generated text
print(f"Input: {input_text}")
print(f"Generated Text: {generated_text}")

# Summary

1️⃣ Encoder vs. Decoder in Transformers

The Transformer model, introduced in the research paper "Attention Is All You Need", has two main components:

Encoder: Used in models like BERT, processes the entire input at once.
Decoder: Used in models like GPT, generates text one token at a time.

2️⃣ Key Difference Between Encoder and Decoder

The decoder is similar to the encoder in structure, but it has an extra component:

Masked Multi-Head Self-Attention.

This masking ensures that the model cannot see future tokens when generating text.

Without masking, the model could "peek" at the future words, which would break the idea of autoregressive text generation.

3️⃣ What is Masked Multi-Head Self-Attention?

Multi-Head Attention is a mechanism that allows the model to focus on different parts of the input when predicting the next word.

Masking ensures causality:
At each step, the model can only see the past and present tokens, but not future tokens.

This is done by adding -∞ to the softmax layer, forcing attention weights for future tokens to be zero.

This prevents data leakage, ensuring that text is generated one token at a time.
🔹 Example of Masking: If the input is "The movie was amazing", the model should only attend to "The", "movie", and "was" when predicting the next word. It should not see "amazing" before generating it.

4️⃣ How the Decoder Works

The decoder follows these steps:

1️⃣ Takes an input query (Q):

If it's used without an encoder (like in GPT), the Q comes from the decoder itself.

If it's used with an encoder (like in translation models), Q comes from the decoder, but the Key (K) and Value (V) come from the encoder.

2️⃣ Processes Input with Multi-Head Attention:

First, it applies Masked Self-Attention to process previously generated words.
Then, it applies Encoder-Decoder Attention (if applicable, like in translation).

3️⃣ Performs Layer Normalization & Feed-Forward Processing:

Normalizes the outputs and applies a feed-forward network for better representation.

4️⃣ Predicts the Next Token:

The final output is a probability distribution over the vocabulary (e.g., 50,000 words).

The most likely token is selected using methods like greedy search, nucleus sampling, or beam search.

5️⃣ Decoder-Only Models (GPT Family)

Models like GPT-2, GPT-3, GPT-4, Gemini LM, etc. only use the decoder part.
These models are called Autoregressive Models because they generate text one token at a time.

Unlike BERT, which encodes all input at once, GPT predicts one token at a time and feeds it back as input.

🔹 Example of Autoregressive Generation:

Step 1: Start with "The movie was"

Step 2: GPT predicts "amazing"

Step 3: Now input becomes "The movie was amazing"

Step 4: GPT predicts "!", making the final output "The movie was amazing!"

The process repeats until the desired length is reached.

6️⃣ Embeddings & Positional Encoding
Before feeding text into the decoder, the model converts words into embeddings.
Positional encoding is added so that the model understands word order.
Uses sinusoidal functions (sine and cosine) to encode position information.

7️⃣ How the Model Chooses Words

After predicting probabilities for all words in the vocabulary, it selects words using different strategies:

Method	Description

Greedy Search	Picks the word with the highest probability at each step (can be repetitive).

Beam Search	Explores multiple possibilities and selects the best sequence.

Top-K Sampling	Picks from the top K most probable words (adds randomness).

Top-P Sampling (Nucleus Sampling)	Chooses words from the smallest set whose probabilities sum to at least p (e.g., 95%).

# How Text Generation Works Step-by-Step

1️⃣ Start with an input sentence: "The movie was".

2️⃣ Tokenize and embed it.

3️⃣ Pass through masked self-attention in the decoder.

4️⃣ Predict the next token (e.g., "amazing").

5️⃣ Add the predicted token back into input ("The movie was amazing").

6️⃣ Repeat steps 3-5 until reaching the desired text length.

🔹 This loop continues, generating text token by token.



1️⃣ What is Top-K Sampling?

🔹 "Keep the K most probable words and ignore the rest"

Instead of picking from all words, it only considers the top K most probable words at each step.

Example of Top-K Sampling

Imagine GPT-2 is generating text and needs to predict the next word after:
📝 "The food was"

The model gives probabilities for the next word:

Word	Probability

delicious	40%

amazing	30%

terrible	15%

overpriced	10%

blue	5%

dog	0.1%

spaceship	0.01%

🔹 Top-K with K=3 → Keep only the top 3 words: ✅ "delicious", "amazing",
"terrible"

❌ "overpriced", "blue", "dog", "spaceship" (discarded)

👉 The model picks randomly among the top 3 words, making the text more diverse.


 2️⃣ What is Top-P (Nucleus Sampling)?
🔹 "Keep the smallest set of words whose probabilities sum to at least P"

Instead of choosing a fixed K, this method dynamically selects the top words until their combined probability reaches a threshold (e.g., 0.95 or 95%).
Example of Top-P Sampling (P = 0.95)

Using the same scenario ("The food was"), here are the probabilities of different words:

Word	Probability	Cumulative Probability

delicious	40%	40%

amazing	30%	70%

terrible	15%	85%

overpriced	10%	95% ✅ (Stop here)

blue	5%	100% ❌ (Ignored)

🔹 Top-P with P=0.95 → Keep words until total probability reaches 95%:

✅ "delicious", "amazing", "terrible", "overpriced"

❌ "blue", "dog", "spaceship" (discarded)

👉 Unlike Top-K, the number of words considered is not fixed!
👉 It varies depending on the probability distribution of the words.



Which method is better?

✅ Use Top-K when you want consistent word selection (e.g., K=50).

✅ Use Top-P when you want a flexible, probability-based approach (e.g., P=0.9).