# Projet GPT - Train

This notebook contains the code used to train a small language model using PyTorch from scratch. The model is inspired by the GPT architecture.


#### Hardware
- RTX3060 12GB VRAM
- AMD Ryzen 7 5800X 8-Core
- 32GB RAM
- Ubuntu 22.04 LTS

In [7]:
import os
CACHE_DIR = "/media/rob/RobsDisk/cache_data_llm"
os.environ['HF_HOME'] = CACHE_DIR
os.environ['HF_DATASETS_CACHE'] = os.path.join(CACHE_DIR, "datasets")
os.environ['HF_METRICS_CACHE'] = os.path.join(CACHE_DIR, "metrics")
os.environ['HF_MODULES_CACHE'] = os.path.join(CACHE_DIR, "modules")


In [8]:
import torch, torch.nn as nn, torch.optim as optim
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import random, math

from datasets import load_dataset,concatenate_datasets
import tiktoken
from tqdm import tqdm

## Datasets

### Common knowledge datasets:

##### English Wikipedia crawled dataset

In [9]:
# English Wikipedia crawled dataset
# path to store the dataset cache: /Volumes/RobertsDisk
wiki_en = load_dataset("wikimedia/wikipedia", "20231101.en", split='train', cache_dir=CACHE_DIR) 
print("English Wikipedia dataset loaded.")
print("dataset size in gb:", wiki_en.dataset_size / (1024**3))
print("Number of entries:", len(wiki_en))
print("-"*50)
print("Example entry:")
print(wiki_en[random.randint(0, len(wiki_en)-1)]['text'])


English Wikipedia dataset loaded.
dataset size in gb: 18.812774107791483
Number of entries: 6407814
--------------------------------------------------
Example entry:
Xestia laxa is a species of cutworm or dart moth in the family Noctuidae. It was described by J. Donald Lafontaine and Kauri Mikkola in 1998 and is found in North America.

The MONA or Hodges number for Xestia laxa is 10963.1.

References

 Crabo, L.; Davis, M.; Hammond, P.; Mustelin, T. & Shepard, J. (2013). "Five new species and three new subspecies of Erebidae and Noctuidae (Insecta, Lepidoptera) from Northwestern North America, with notes on Chytolita Grote (Erebidae) and Hydraecia Guenée (Noctuidae)". ZooKeys. 264: 85-123.
 Lafontaine, D. & Troubridge, J. (2010). "Two new species of the Euxoa westermanni species-group from Canada (Lepidoptera, Noctuidae, Noctuinae)". ZooKeys. 39: 255-262.
 Lafontaine, J. Donald & Schmidt, B. Christian (2010). "Annotated check list of the Noctuoidea (Insecta, Lepidoptera) of North Amer

#### Simple stories dataset

In [10]:
# Simple stories dataset
stories = load_dataset("SimpleStories/SimpleStories", split='train', cache_dir=CACHE_DIR)
print("Simple stories dataset loaded.")
print("dataset size in mb:", stories.dataset_size / (1024**2))
print("Number of entries:", len(stories))
print("-"*50)
print("Example entry:")
print(stories[random.randint(0, len(stories)-1)]['story'])

Simple stories dataset loaded.
dataset size in mb: 3030.012650489807
Number of entries: 2115696
--------------------------------------------------
Example entry:
Gentle winds blew as the summer festival began in the village. Emmanuel wanted to win the kite-flying contest. He worked all week to make the best kite ever. He chose bright colors and added a long tail. But just before the contest, he realized he lost the special tail that made the kite fly better!

Feeling worried, Emmanuel looked everywhere but couldn't find it. Then, he spotted his sister, Lena, making her own kite. She saw him looking sad and asked what was wrong. Emmanuel explained about the missing tail. Lena smiled and said, "You can use my kite's tail!" 

With his sister's help, Emmanuel fixed his kite. The sun shined brightly as the contest started. Each kite flew high in the sky, dancing with the wind. Emmanuel held his breath as he launched his kite. To his surprise, it flew even higher than he imagined! 

The crow

##### FineWeb-Edu dataset

In [11]:
fineweb_edu = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT",  split='train', cache_dir=CACHE_DIR)

print("FineWeb-Edu is ready.")
print("dataset size in gb:", fineweb_edu.dataset_size / (1024**3))
print("Number of entries:", len(fineweb_edu))
print("-"*50)
print("Example entry:")
print(fineweb_edu[random.randint(0, len(fineweb_edu)-1)]['text']) 


FineWeb-Edu is ready.
dataset size in gb: 45.730818568728864
Number of entries: 9672101
--------------------------------------------------
Example entry:
When Can I Wean my Baby onto Solids?
Weaning is the process of introducing solids to your little one (not to be confused with switching from breastfeeding to bottle feeding, which is for some reason, also called weaning!) The NHS guideline for when you should wean your baby onto solids is when he is 6 months old. Babies younger than this don't have digestive systems which have developed enough to cope with solids.
More Milk, Less Weaning
Some parents notice their baby still seems hungry after milk feeds and worry that milk isn't enough for them. If this is the case with your baby, give him extra milk feeds. Don't be tempted to wean early. Pre-2003, official guidelines advised parents to wean between 4-6 months old. However, this advice was changed to 6 months after the World Health Organisation and the Department of Health conducted r

In [12]:
# OpenWebText2 dataset
owt2 = load_dataset("Skylion007/openwebtext", split="train", cache_dir=CACHE_DIR)
print("OpenWebText2 dataset loaded.")
print("Dataset size in GB:", owt2.dataset_size / (1024**3))
print("Number of entries:", len(owt2))
print("-"*50)
print("Example entry:")
print(owt2[random.randint(0, len(owt2)-1)]['text'])

OpenWebText2 dataset loaded.
Dataset size in GB: 37.03822539001703
Number of entries: 8013769
--------------------------------------------------
Example entry:
A pro-Trump boat in Vallejo, California, on Wednesday. Stephen Lam/Reuters

Last week Mother Jones reported that the Trump campaign had nominated a prominent California white supremacist to be a national convention delegate. (That individual, William Johnson, has since resigned his spot.) On Thursday, MoJo has another, equally disturbing story: A Maryland Trump delegate was indicted Tuesday on weapons and child pornography charges. From the magazine:

The federal indictment alleges that Caleb Andrew Bailey, 30, of Waldorf, Maryland, illegally mailed a cache of ammunition and explosives through the US Postal service, and illegally possessed a machine gun and child pornography. The indictment further alleges that Bailey “attempted to use and did use a minor to engage in sexually explicit conduct to produce child pornography.”

The

#### Some Q&A data to improve the model's ability to answer questions:

In [13]:
q_a1 = load_dataset("agentlans/text-sft-questions-answers-only", split='train', cache_dir=CACHE_DIR)
print("Q&A dataset loaded.")
print("dataset size in mb:", q_a1.dataset_size / (1024**2))
print("Number of entries:", len(q_a1))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(q_a1)-1)
print(q_a1[index]['question'][:500], "\n", q_a1[index]['answer'])

Q&A dataset loaded.
dataset size in mb: 46.480509757995605
Number of entries: 120959
--------------------------------------------------
Example entry:
Who initiated the proposal for a memorial to Spalding's war dead, and how did they approach it? 
 Barbara McLaren, the wife of Francis McLaren, who was killed in a flying accident during WWI, initated the proposal for a memorial in January 1918. She engaged Sir Edwin Lutyens through a family connection to design the memorial.


In [14]:
#euclaise/reddit-instruct
reddit_instruct = load_dataset("euclaise/reddit-instruct", split='train', cache_dir=CACHE_DIR)
# reddit_instruct = load_dataset("Felladrin/ChatML-reddit-instruct-curated", split='train', cache_dir=CACHE_DIR)
print("Reddit Instruct dataset loaded.")
print("dataset size in gb:", reddit_instruct.dataset_size / (1024**3))
print("Number of entries:", len(reddit_instruct))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(reddit_instruct)-1)
print(reddit_instruct[index]['post_title'][:500], reddit_instruct[index]['post_text'][:500]), "\n", reddit_instruct[index]['comment_text'][:500]

Reddit Instruct dataset loaded.
dataset size in gb: 0.09901080373674631
Number of entries: 84784
--------------------------------------------------
Example entry:
ELI5: How is it possible that athletic results keep getting better and better; will they ever plateau? It seems like every Olympics, World Championship, etc, at least one record is broken, usually more. It's to the point now where gold-medal-winning 100m sprinters of a few decades ago would barely even rank in the top 10 today.

I know some of this is down to better training and nutrition, better equipment (more so in some sports than others), but surely there must be some limit that results from the human body? Will we ever hit a point where we have seen basically the fastest a person can run, t


(None,
 '\n',
 'Theoretically, the human body has finite limits that it cannot exceed. Bones can take only so much force without breaking, the human circulatory system has limits due to size, etc... However, there can always be outliers that are genetically better suited than the average human for certain sports. So in other words, we could plateau, but someone better could always be born to break the record.')

In [15]:
# tatsu-lab/alpaca ( for Q&A fine-tuning )
alpaca = load_dataset("tatsu-lab/alpaca", split='train')
print("Alpaca dataset loaded.")
print("dataset size in mb:", alpaca.dataset_size / (1024**2))
print("Number of entries:", len(alpaca))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(alpaca)-1)
print(alpaca[index]['instruction'][:500], "\n", alpaca[index]['output'][:500])

Alpaca dataset loaded.
dataset size in mb: 44.06797695159912
Number of entries: 52002
--------------------------------------------------
Example entry:
Find the second root of 9. 
 The second root of 9 is -3.


## Data Preprocessing

#### Tokenizer setup

For this project i use tiktoken for the tokenizer, as it is the same tokenizer used by OpenAI for their models.

I use the "gpt2" encoding which is a byte pair encoding (BPE) tokenizer.

In [16]:
tokenizer_base = tiktoken.get_encoding("gpt2")

tokenizer = tiktoken.Encoding(
    name="rob-tokenizer",
    pat_str=tokenizer_base._pat_str,
    mergeable_ranks=tokenizer_base._mergeable_ranks,
    special_tokens={
        **tokenizer_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
        "<|pad|>": 0,
    }
)

#### Test of the byte pair encoding tokenizer 

In [17]:
# test of tokenizer on reddit_instruct
sample_text = reddit_instruct[0]['post_title'] + " " + reddit_instruct[0]['post_text'] + " " + reddit_instruct[0]['comment_text']
tokens = tokenizer.encode(sample_text)
print(tokens)
print("Decoded text:")
print(tokenizer.decode(tokens)) 
print(f"Sample text length in characters: {len(sample_text)}")
print(f"Sample text length in tokens: {len(tokens)}")   

[2061, 318, 24207, 1616, 2587, 30, 314, 2342, 257, 7684, 286, 1097, 5861, 290, 484, 1561, 546, 275, 32512, 7021, 290, 884, 11, 1312, 373, 11263, 644, 275, 32512, 318, 290, 1312, 18548, 1064, 597, 2562, 7468, 284, 644, 340, 318, 24207, 1616, 318, 655, 262, 1438, 329, 257, 16058, 286, 6147, 13, 554, 262, 29393, 995, 340, 338, 1690, 973, 355, 257, 1790, 1021, 329, 3354, 326, 547, 3235, 1389, 503, 286, 257, 1263, 2512, 286, 2587, 11, 355, 6886, 284, 11721, 3350, 654, 810, 44030, 6147, 318, 19036, 656, 257, 15936, 12070, 503, 286, 9629, 6147, 13, 7080, 3191, 318, 517, 5789, 329, 1588, 17794, 475, 340, 460, 779, 1365, 3081, 286, 21782, 290, 318, 4577, 284, 787, 329, 4833, 17794, 588, 3234, 3354, 13]
Decoded text:
What is Billet material? I watch a bunch of car videos and they talk about billet blocks and such, i was wondering what billet is and i cant find any easy explanation to what it is Billet is just the name for a chunk of metal. In the automotive world it's often used as a short hand 

### Formatting datasets functions

#### Merging datasets

In [18]:
from datasets import concatenate_datasets
combined_train_dataset = concatenate_datasets([  
    wiki_en,
    stories,
    fineweb_edu,
    owt2,  
])  

combined_finetune_dataset = concatenate_datasets([
    q_a1,
    reddit_instruct,
    alpaca,
])

# Shuffle the combined dataset
train_dataset = combined_train_dataset.shuffle(seed=42)
finetune_dataset = combined_finetune_dataset.shuffle(seed=42)
print(f"Train dataset size: {len(combined_train_dataset)}")
print(f"Finetune dataset size: {len(combined_finetune_dataset)}")

# Exemple 

print("Example entry from train dataset:")
index = random.randint(0, len(train_dataset)-1)
print(train_dataset[index])   

Train dataset size: 26209380
Finetune dataset size: 257745
Example entry from train dataset:
{'id': '<urn:uuid:ce36fb03-de82-4bbe-8149-452235797548>', 'url': 'https://nyfamilydentalcare.com/best-dental-implants-clinic-near-flatbush-ny-11210-tel-718-630-1030/', 'title': None, 'text': 'A root canal is the naturally occurring structural space within the root of a tooth. It includes the pulp chamber (within the coronal component of the tooth), the primary canal(s), and also much more intricate anatomical branches that may connect the origin canals to every other or to the surface area of the origin.\nAt the facility of every tooth is a hollow area that houses soft tissues, such as the nerve, blood vessels, as well as connective cells. This hollow location includes a fairly wide area in the coronal portion of the tooth called the pulp chamber. These canals run with the center of the origins, comparable to the way pencil lead runs via a pencil. The pulp gets nourishment with the blood vessel

#### Custom Dataset class

Inspired by the dataloader from the "LLMs from scratch" repository. But adapted for multi-row text arrays.

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/dataloader.ipynb

In [19]:
class GPTDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=512):
        """
        Args:
            dataset: Dataset of the combined hugginface entries
            tokenizer: the initiatokenizer to process text
            max_length: Context window size
        """
        self.data = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.input_tokens = []
        self.target_tokens = []

        self.pad_token_id = 0         # <|pad|>
        self.bos_token_id = 100264    # <|im_start|>
        self.eos_token_id = 100265    # <|im_end|>

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Raw text
        
        # Data format handling
        entry = self.data[idx]
        if 'text' in entry:
            text = entry['text']
        elif 'story' in entry:
            text = entry['story']
        elif 'question' in entry and 'answer' in entry: 
            text = "User: " + entry['question'] + " Assistant:" + entry['answer']
        elif 'post_title' in entry and 'post_text' in entry and 'comment_text' in entry:
            text = "User: " + entry['post_title'] + " Assistant:" + entry['post_text'] + " " + entry['comment_text']
        elif 'instruction' in entry and 'output' in entry:
            text = "User: " + entry['instruction'] + " Assistant:" + entry['output']
        else:
            raise ValueError("Unknown data entry format")
        
        text = str(text) # Ensure text is a string
        #print(text)

        # Adding Start and End tokens
        text = "<|im_start|>" + text + "<|im_end|>" 

        # Tokenization
        tokens = self.tokenizer.encode(text, allowed_special="all")

        # Truncation
        tokens = tokens[:self.max_length] #Data is loost here ( fix later with sliding window )

        input_ids = torch.tensor(tokens[:-1], dtype=torch.long)  # All tokens except last
        labels = torch.tensor(tokens[1:], dtype=torch.long)      # All tokens except first


        #Padding 
        padding_length = self.max_length - len(tokens)
        if padding_length > 0:
            input_ids = torch.cat([input_ids, torch.full((padding_length,), self.pad_token_id)])
            labels = torch.cat([labels, torch.full((padding_length,), -100)])


        attention_mask = (input_ids != self.pad_token_id).long() # 1 for real tokens, 0 for padding

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

### DataLoader setup

In [20]:
train_dataset = GPTDataset(combined_train_dataset, tokenizer, max_length=512)
print(f"Train dataset size: {len(train_dataset)}")

batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Train dataset size: 26209380


##### Test of a entry from dataloader

In [21]:
print("##### Test of a entry from dataloader")
batch = next(iter(train_dataloader))
print(batch)    

##### Test of a entry from dataloader
{'input_ids': tensor([[100264,  14202,      0,  ...,      0,      0,      0],
        [100264,  28065,     13,  ...,     11,   4453,     11],
        [100264,    818,    674,  ...,     11,   1301,   7546],
        ...,
        [100264,     32,   9815,  ...,   1002,    612,    318],
        [100264,  21816,   2579,  ...,      0,      0,      0],
        [100264,     35,    672,  ...,      0,      0,      0]]), 'attention_mask': tensor([[1, 1, 0,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[ 14202, 100265,   -100,  ...,   -100,   -100,   -100],
        [ 28065,     13,   5436,  ...,   4453,     11,    318],
        [   818,    674,   2457,  ...,   1301,   7546,    287],
        ...,
        [    32,   9815,   1351,  ...,    612,    318,    257],
        [ 21816,   2579,     11, 

## GPT Model 

### GPT config 

This is the configuration for the GPT model i am going to train. It is a smaller version of the GPT-2 model. 

- Context length: 512 tokens
- Embedding dimension: 512
- Number of attention heads: 8
- Number of layers: 8

In [30]:
GPT_CONFIG = {
    "vocab_size": 50257,
    "context_length": 512,
    "emb_dim": 512,
    "number_heads": 8,
    "number_layers": 8,
    "drop_rate": 0.1,
}

#### Pytroch model implementation

For this first implementation, i am using the transformer and embedding modules from PyTorch. Later i will try to implement the attention mechanism from scratch for better understanding.

https://docs.pytorch.org/docs/stable/generated/torch.nn.Transformer.html

In [52]:
class GPTModel(nn.Module):
    """
    Gpt model class using transformer library
    """
    def __init__(self, config):
        super().__init__()
        self.config = config

        # Network components 
        ## Embedding layers
        self.embedding = nn.Embedding(config['vocab_size'], config['emb_dim'])
        self.positional_encoding = nn.Embedding(config['context_length'], config['emb_dim'])
        ## Transformer
        self.transformer = nn.Transformer(
            d_model=config['emb_dim'],
            nhead=config['number_heads'],
            num_encoder_layers=0,
            num_decoder_layers=config['number_layers'],
            dim_feedforward=4*config['emb_dim'],
            dropout=config['drop_rate'],
            activation='gelu',
            batch_first=True, # outputs are in (batch, seq, feature) format
        )
        ## Output layer
        self.output_layer = nn.Linear(config['emb_dim'], config['vocab_size'], bias=False)

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)



    def forward(self, input_ids, attention_mask, label_ids=None):   
        batch_size, seq_length = input_ids.shape

        # Embedding
        token_embeddings = self.embedding(input_ids)  
        pos_ids = torch.arange(seq_length, device=input_ids.device).unsqueeze(0)

        position_embeddings = self.positional_encoding(pos_ids)  # (batch_size, seq_length, emb_dim)

        embeddings = token_embeddings + position_embeddings  # (batch_size, seq_length, emb_dim)

        # Prevent attending to future tokens
        causal_mask = torch.triu(torch.full((seq_length, seq_length), float('-inf')), diagonal=1).to(input_ids.device)

        # Combining with the attention mask for padding tokens 
        if attention_mask is not None:
            attention_mask = (attention_mask == 0).float() * -1e9  # (batch_size, seq_length)
            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, seq_length)
            combined_mask = causal_mask.unsqueeze(0) + attention_mask
        else:
            combined_mask = causal_mask.unsqueeze(0)

        embeddings = embeddings.permute(1, 0, 2)  # (seq_length, batch_size, emb_dim)
        x = self.transformer(embeddings, mask=combined_mask)  # (batch_size, seq_length, emb_dim)
        x = x.permute(1, 0, 2)  # (batch_size, seq_length, emb_dim)
        logits = self.output_layer(x)

        # Computing loss 
        if label_ids is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
            loss = loss_fct(logits.view(-1, self.config['vocab_size']), label_ids.view(-1)) # Applies loss to predictions
            return logits, loss

        return logits, None




### Model instantiation 

In [53]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPTModel(GPT_CONFIG).to(device)
print(model)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) # AdamW optimizer
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10) # Cosine annealing learning rate scheduler

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## Training setup

In [48]:
def inference(model, tokenizer, prompt, max_length=512, device='cpu'):
    model.eval()
    input_ids = tokenizer.encode(prompt, allowed_special="all")
    input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0).to(device)  # (1, seq_length)

    generated_ids = input_ids
    max_length = max_length - input_ids.shape[1]  # Remaining length for generation
    with torch.no_grad():
        for _ in range(max_length):
            attention_mask = torch.ones_like(generated_ids)  # All tokens are real (no padding)
            logits, _ = model(generated_ids, attention_mask)
            next_token_logits = logits[:, -1, :]  # (1, vocab_size)
            next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(0)  # (1, 1)

            generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)  # Append to sequence

            if next_token_id.item() == 100265:  # Stop if <|im_end|> token is generated
                break

    generated_text = tokenizer.decode(generated_ids.squeeze().tolist())
    return generated_text


In [51]:
def train_loop(model, dataloader, optimizer, scheduler, device, num_epochs=3, question_interval=500):
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
        step_count = 0

        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()
            logits, loss = model(input_ids, attention_mask, labels)
            loss.backward()
            optimizer.step()
            scheduler.step()  # Step scheduler every batch

            epoch_loss += loss.item()
            progress_bar.set_postfix(loss=loss.item())

            # Inference check
            step_count += 1
            if step_count % question_interval == 0:
                model.eval()
                with torch.no_grad():
                    prompt = "You are an AI being trained. How are you doing? Please answer briefly."
                    generated_text = inference(model, tokenizer, prompt, max_length=50, device=device)
                    print(f"\n[Inference at step {step_count}]: {generated_text}\n")
                model.train()

        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")


### First training on combined train dataset

In [38]:
train_loop(model, train_dataloader, optimizer, scheduler, device, num_epochs=3)

Epoch 1/3:   0%|          | 0/1638087 [00:01<?, ?it/s]


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## Sources 

### Principal references: 
- https://arxiv.org/abs/2005.14165 (GPT-3 paper)
- https://arxiv.org/abs/2002.05709 (Attention is all you need paper)
- Build a Large Language Model (from scratch) by Sebastian Raschka

### About Padding tokens in Language Modeling
- https://arxiv.org/html/2510.01238v1 