# Projet GPT - Train

This notebook contains the code used to train a small language model using PyTorch from scratch. The model is inspired by the GPT architecture.


#### Hardware
- RTX3060 12GB VRAM
- AMD Ryzen 7 5800X 8-Core
- 32GB RAM
- Ubuntu 22.04 LTS

In [None]:
import os
CACHE_DIR = "/media/rob/RobsDisk/cache_data_llm"
os.environ['HF_HOME'] = CACHE_DIR
os.environ['HF_DATASETS_CACHE'] = os.path.join(CACHE_DIR, "datasets")
os.environ['HF_METRICS_CACHE'] = os.path.join(CACHE_DIR, "metrics")
os.environ['HF_MODULES_CACHE'] = os.path.join(CACHE_DIR, "modules")


In [2]:
import torch, torch.nn as nn, torch.optim as optim
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import random, math

from datasets import load_dataset,concatenate_datasets
import tiktoken
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## Datasets

### Common knowledge datasets:

##### English Wikipedia crawled dataset

In [3]:
# English Wikipedia crawled dataset
# path to store the dataset cache: /Volumes/RobertsDisk
wiki_en = load_dataset("wikimedia/wikipedia", "20231101.en", split='train', cache_dir=CACHE_DIR) 
print("English Wikipedia dataset loaded.")
print("dataset size in gb:", wiki_en.dataset_size / (1024**3))
print("Number of entries:", len(wiki_en))
print("-"*50)
print("Example entry:")
print(wiki_en[random.randint(0, len(wiki_en)-1)]['text'][:500])


English Wikipedia dataset loaded.
dataset size in gb: 18.812774107791483
Number of entries: 6407814
--------------------------------------------------
Example entry:
Kopi John (10 October 1993 – 27 August 2019) was a Papua New Guinean cricketer. In July 2018, she was named in Papua New Guinea's squad for the 2018 ICC Women's World Twenty20 Qualifier tournament. She made her Women's Twenty20 International (WT20I) for Papua New Guinea against Bangladesh in the World Twenty20 Qualifier on 7 July 2018. In April 2019, she was named in Papua New Guinea's squad for the 2019 ICC Women's Qualifier EAP tournament in Vanuatu.

John died on 27 August 2019 following a sh


#### Simple stories dataset

In [4]:
# Simple stories dataset
stories = load_dataset("SimpleStories/SimpleStories", split='train', cache_dir=CACHE_DIR)
print("Simple stories dataset loaded.")
print("dataset size in mb:", stories.dataset_size / (1024**2))
print("Number of entries:", len(stories))
print("-"*50)
print("Example entry:")
print(stories[random.randint(0, len(stories)-1)]['story'][:500])



Simple stories dataset loaded.
dataset size in mb: 3030.012650489807
Number of entries: 2115696
--------------------------------------------------
Example entry:
Down by the river, a boy named Alex loved to fish. One sunny day, he sat on the bank with his fishing pole. While waiting for a bite, he noticed something shiny in the water. He leaned in closer and saw a treasure chest! Alex couldn't believe his luck. He quickly pulled out his fishing net to catch the chest. 

After a bit of splashing, he managed to get it out. The chest was heavy, and Alex struggled to lift it. "This better be worth it," he said, panting. He found a spot to sit and examined th


##### FineWeb-Edu dataset

In [5]:
fineweb_edu = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT",  split='train', cache_dir=CACHE_DIR)

print("FineWeb-Edu is ready.")
print("dataset size in gb:", fineweb_edu.dataset_size / (1024**3))
print("Number of entries:", len(fineweb_edu))
print("-"*50)
print("Example entry:")
print(fineweb_edu[random.randint(0, len(fineweb_edu)-1)]['text'][:500]) 


FineWeb-Edu is ready.
dataset size in gb: 45.730818568728864
Number of entries: 9672101
--------------------------------------------------
Example entry:
Images of Corn (Clavus)
Corns are thickenings of the skin composed of keratin that are typically found on the toes caused by repeated friction or pressure to the area. The base of the corn is seen on the surface of the skin while the top points inward, causing discomfort.
Corns are classified as either hard or soft, depending upon their location and appearance. Hard corns typically affect the tops of the toes and are composed of a dense core that presses on sensory nerves, causing extreme pain. 


#### Some Q&A data to improve the model's ability to answer questions:

In [6]:
q_a1 = load_dataset("agentlans/text-sft-questions-answers-only", split='train', cache_dir=CACHE_DIR)
print("Q&A dataset loaded.")
print("dataset size in mb:", q_a1.dataset_size / (1024**2))
print("Number of entries:", len(q_a1))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(q_a1)-1)
print(q_a1[index]['question'][:500], "\n", q_a1[index]['answer'][:500])

Q&A dataset loaded.
dataset size in mb: 46.480509757995605
Number of entries: 120959
--------------------------------------------------
Example entry:
What is the value of exploring the intersection between sound and visual art through Chunity? 
 The combination of audio processing and 3D graphics allows developers to explore the intersection between sound and visual art, creating immersive environments where objects emit sound, instruments react visually, and scenes come alive through synchronized audio-visual elements.


In [7]:
#euclaise/reddit-instruct
reddit_instruct = load_dataset("euclaise/reddit-instruct", split='train', cache_dir=CACHE_DIR)
# reddit_instruct = load_dataset("Felladrin/ChatML-reddit-instruct-curated", split='train', cache_dir=CACHE_DIR)
print("Reddit Instruct dataset loaded.")
print("dataset size in gb:", reddit_instruct.dataset_size / (1024**3))
print("Number of entries:", len(reddit_instruct))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(reddit_instruct)-1)
print(reddit_instruct[index]['post_title'][:500], reddit_instruct[index]['post_text'][:500]), "\n", reddit_instruct[index]['comment_text'][:500]

Reddit Instruct dataset loaded.
dataset size in gb: 0.09901080373674631
Number of entries: 84784
--------------------------------------------------
Example entry:
ELI5: How exactly did people kill soldiers wearing heavy armour(the type from and medieval movies among others) with only swords and arrows throughout history? Plus, how did the people wearing these armours have the stamina to fight in battles? Wouldn't they be exhausted after fighting only one person who also happened to be wearing similar armour?

Assuming all these movies accurately represent how battles and wars were.

Example of the armour I'm talking about:

http://www.medievalcollectibles.com/images/Product/large/ED6223.png

As well as other types.


(None,
 '\n',
 'There were usually some gaps between the armor pieces. Also you could kill these soldiers with a heavy blow (such as a mace to  the head, or falling off a moving horse). Also they could become bogged down in mud and easy to clobber. Also they could just become exhausted.\n\nHowever, very few soldiers actually wore heavy, thick armor. It was incredibly expensive. Incomplete armor or chain mail were more common.\n\nAnd then, you could just shoot them with a crossbow. This was effective against all but')

In [8]:
# tatsu-lab/alpaca ( for Q&A fine-tuning )
alpaca = load_dataset("tatsu-lab/alpaca", split='train')
print("Alpaca dataset loaded.")
print("dataset size in mb:", alpaca.dataset_size / (1024**2))
print("Number of entries:", len(alpaca))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(alpaca)-1)
print(alpaca[index]['instruction'][:500], "\n", alpaca[index]['output'][:500])

Alpaca dataset loaded.
dataset size in mb: 44.06797695159912
Number of entries: 52002
--------------------------------------------------
Example entry:
Given a real-world scenario, design a system to automate the task. 
 The restaurant food delivery system will allow customers to order online or through a mobile app. Customers will be able to select items from the restaurant's menu and choose the delivery option. Drivers will receive an alert once the order has been placed, and they can navigate to the restaurant to pick up the order and deliver it to the customer's address. The system must also include payment options and order tracking features.


## Data Preprocessing

#### Tokenizer setup

For this project i use tiktoken for the tokenizer, as it is the same tokenizer used by OpenAI for their models.

I use the "gpt2" encoding which is a byte pair encoding (BPE) tokenizer.

In [9]:
tokenizer_base = tiktoken.get_encoding("gpt2")

tokenizer = tiktoken.Encoding(
    name="rob-tokenizer",
    pat_str=tokenizer_base._pat_str,
    mergeable_ranks=tokenizer_base._mergeable_ranks,
    special_tokens={
        **tokenizer_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
        "<|pad|>": 0,
    }
)

#### Test of the byte pair encoding tokenizer 

In [10]:
# test of tokenizer on reddit_instruct
sample_text = reddit_instruct[0]['post_title'] + " " + reddit_instruct[0]['post_text'] + " " + reddit_instruct[0]['comment_text']
tokens = tokenizer.encode(sample_text)
print(tokens)
print("Decoded text:")
print(tokenizer.decode(tokens)) 
print(f"Sample text length in characters: {len(sample_text)}")
print(f"Sample text length in tokens: {len(tokens)}")   

[2061, 318, 24207, 1616, 2587, 30, 314, 2342, 257, 7684, 286, 1097, 5861, 290, 484, 1561, 546, 275, 32512, 7021, 290, 884, 11, 1312, 373, 11263, 644, 275, 32512, 318, 290, 1312, 18548, 1064, 597, 2562, 7468, 284, 644, 340, 318, 24207, 1616, 318, 655, 262, 1438, 329, 257, 16058, 286, 6147, 13, 554, 262, 29393, 995, 340, 338, 1690, 973, 355, 257, 1790, 1021, 329, 3354, 326, 547, 3235, 1389, 503, 286, 257, 1263, 2512, 286, 2587, 11, 355, 6886, 284, 11721, 3350, 654, 810, 44030, 6147, 318, 19036, 656, 257, 15936, 12070, 503, 286, 9629, 6147, 13, 7080, 3191, 318, 517, 5789, 329, 1588, 17794, 475, 340, 460, 779, 1365, 3081, 286, 21782, 290, 318, 4577, 284, 787, 329, 4833, 17794, 588, 3234, 3354, 13]
Decoded text:
What is Billet material? I watch a bunch of car videos and they talk about billet blocks and such, i was wondering what billet is and i cant find any easy explanation to what it is Billet is just the name for a chunk of metal. In the automotive world it's often used as a short hand 

### Formatting datasets functions

#### Merging datasets

In [None]:
from datasets import concatenate_datasets
combined_train_dataset = concatenate_datasets([  
    wiki_en,
    stories,
    fineweb_edu,
])  

combined_finetune_dataset = concatenate_datasets([
    q_a1,
    reddit_instruct,
    alpaca,
])

# Shuffle the combined dataset
train_dataset = combined_train_dataset.shuffle(seed=42)
finetune_dataset = combined_finetune_dataset.shuffle(seed=42)
print(f"Train dataset size: {len(combined_train_dataset)}")
print(f"Finetune dataset size: {len(combined_finetune_dataset)}")

# Exemple 

print("Example entry from train dataset:")
index = random.randint(0, len(train_dataset)-1)
print(train_dataset[index])   

Train dataset size: 18195611
Finetune dataset size: 257745
Example entry from train dataset:
{'id': '<urn:uuid:b7562f79-9e71-4c40-890c-8a24f0133759>', 'url': 'http://unitedexplanations.org/english/2016/03/08/migrant-working-women-the-main-victims-of-job-insecurity/', 'title': None, 'text': 'Originally published in Spanish here.\nIn Spain, a high percentage of female migrant workers face precarious situations of informal labor and, often, undergo the consequences of poorly or unregulated domestic work. This situation exacerbates gender inequalities, not only for native female workers who have yet to see a real redistribution of domestic tasks, but also for the female migrant workers, who become increasingly vulnerable because of the lack of efficient migration policies including an integral gender approach.\nMigrant women globally: Triple discrimination through gender, nationality and social class\nAccording to recent United Nations data, women represent approximately half of the 200 mi

#### Custom Dataset class

Inspired by the dataloader from the "LLMs from scratch" repository. But adapted for multi-row text arrays.

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/dataloader.ipynb

In [None]:
class GPTDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=512):
        """
        Args:
            dataset: Dataset of the combined hugginface entries
            tokenizer: the initiatokenizer to process text
            max_length: Context window size
        """
        self.data = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.input_tokens = []
        self.target_tokens = []

        self.pad_token_id = 0         # <|pad|>
        self.bos_token_id = 100264    # <|im_start|>
        self.eos_token_id = 100265    # <|im_end|>

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Raw text
        
        # Data format handling
        entry = self.data[idx]
        if 'text' in entry:
            text = entry['text']
        elif 'story' in entry:
            text = entry['story']
        elif 'question' in entry and 'answer' in entry: 
            text = "User: " + entry['question'] + " Assistant:" + entry['answer']
        elif 'post_title' in entry and 'post_text' in entry and 'comment_text' in entry:
            text = "User: " + entry['post_title'] + " Assistant:" + entry['post_text'] + " " + entry['comment_text']
        elif 'instruction' in entry and 'output' in entry:
            text = "User: " + entry['instruction'] + " Assistant:" + entry['output']
        else:
            raise ValueError("Unknown data entry format")
        

        # Adding Start and End tokens
        text = "<|im_start|>" + text + "<|im_end|>" 

        # Tokenization
        tokens = self.tokenizer.encode(text, allowed_special="all")

        # Truncation and padding
        if len(tokens) > self.max_length:
            tokens = tokens[:self.max_length]
            tokens[-1] = self.eos_token_id  # ensure last token is eos

        input_ids = torch.tensor(tokens, dtype=torch.long)
        attention_mask = torch.ones_like(input_ids)


        #Padding 
        padding_length = self.max_length - input_ids.size(0)
        if padding_length > 0:
            # Create padding tensors
            pad_ids = torch.full((padding_length,), self.pad_token_id, dtype=torch.long)
            pad_mask = torch.zeros((padding_length,), dtype=torch.long) # 0 = ignore
            
            # Concatenate
            input_ids = torch.cat([input_ids, pad_ids])
            attention_mask = torch.cat([attention_mask, pad_mask])


        labels = input_ids.clone()

        #adding the ignore index for padding tokens
        if padding_length > 0:
            # We know the padding is at the end
            labels[-padding_length:] = -100

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

NameError: name 'Dataset' is not defined

### DataLoader setup

In [13]:
train_dataset = TextDataset(combined_hf_dataset, tokenizer, max_length=512)
print(f"Train dataset size: {len(train_dataset)}")

batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

NameError: name 'TextDataset' is not defined

### GPT config 

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 512,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

## Sources 

### Principal references: 
- https://arxiv.org/abs/2005.14165 (GPT-3 paper)
- https://arxiv.org/abs/2002.05709 (Attention is all you need paper)
- Build a Large Language Model (from scratch) by Sebastian Raschka

### About Padding tokens in Language Modeling
- https://arxiv.org/html/2510.01238v1 