# Projet GPT - Train

This notebook contains the code used to train a small language model using PyTorch from scratch. The model is inspired by the GPT architecture.


#### Hardware
- RTX3060 12GB VRAM
- AMD Ryzen 7 5800X 8-Core
- 32GB RAM
- Ubuntu 22.04 LTS

In [1]:
import os
CACHE_DIR = "/media/rob/RobsDisk/cache_data_llm"
os.environ['HF_HOME'] = CACHE_DIR
os.environ['HF_DATASETS_CACHE'] = os.path.join(CACHE_DIR, "datasets")
os.environ['HF_METRICS_CACHE'] = os.path.join(CACHE_DIR, "metrics")
os.environ['HF_MODULES_CACHE'] = os.path.join(CACHE_DIR, "modules")


In [2]:
import torch, torch.nn as nn, torch.optim as optim
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import random, math

from datasets import load_dataset,concatenate_datasets
import tiktoken
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## Datasets

### Common knowledge datasets:

##### English Wikipedia crawled dataset

In [3]:
# English Wikipedia crawled dataset
# path to store the dataset cache: /Volumes/RobertsDisk
wiki_en = load_dataset("wikimedia/wikipedia", "20231101.en", split='train', cache_dir=CACHE_DIR) 
print("English Wikipedia dataset loaded.")
print("dataset size in gb:", wiki_en.dataset_size / (1024**3))
print("Number of entries:", len(wiki_en))
print("-"*50)
print("Example entry:")
print(wiki_en[random.randint(0, len(wiki_en)-1)]['text'])


English Wikipedia dataset loaded.
dataset size in gb: 18.812774107791483
Number of entries: 6407814
--------------------------------------------------
Example entry:
Michael Edward Krukow (born January 21, 1952), nicknamed "Kruk", is an American former professional baseball player and sportscaster. As a starting pitcher, he played in Major League Baseball (MLB) for the Chicago Cubs, Philadelphia Phillies, and San Francisco Giants. He has been a television and radio broadcaster for the Giants since 1990, and is one half of the popular "Kruk and Kuip" duo, alongside his friend and former teammate Duane Kuiper. He was an All-Star in 1986.

Early life
Krukow was born in Long Beach, California, and attended San Gabriel High School in San Gabriel, California, where he played as a catcher. Krukow was a fan of the Los Angeles Dodgers, the Giants' archrival, and attended many games at Dodger Stadium with his father. He was drafted as a catcher by the California Angels in the 32nd round of the 1

#### Simple stories dataset

In [4]:
# Simple stories dataset
stories = load_dataset("SimpleStories/SimpleStories", split='train', cache_dir=CACHE_DIR)
print("Simple stories dataset loaded.")
print("dataset size in mb:", stories.dataset_size / (1024**2))
print("Number of entries:", len(stories))
print("-"*50)
print("Example entry:")
print(stories[random.randint(0, len(stories)-1)]['story'])



Simple stories dataset loaded.
dataset size in mb: 3030.012650489807
Number of entries: 2115696
--------------------------------------------------
Example entry:

In the night, I snuck up to the star and grabbed it. The moment I touched it, I felt a strange pull. It whispered to me, promising to grant wishes. But deep down, I knew that the wishes would bring sadness. My heart raced at the thought of taking happiness from others.

As I wandered through the village, I listened to their hopes. They wished for peace, for joy, and for love. I could twist those wishes into shadows, pulling them into my darkness. But with each wish I stole, I felt a little less like myself. The star seemed to glow dimmer with every wish I took.

At dawn, the villagers gathered, their faces filled with worry. They looked up at the sky, searching for the light I had taken. Suddenly, the star shimmered in my hand, and I saw the girl with the golden hair again. She stood there, determined. With her light, she rea

##### FineWeb-Edu dataset

In [5]:
fineweb_edu = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT",  split='train', cache_dir=CACHE_DIR)

print("FineWeb-Edu is ready.")
print("dataset size in gb:", fineweb_edu.dataset_size / (1024**3))
print("Number of entries:", len(fineweb_edu))
print("-"*50)
print("Example entry:")
print(fineweb_edu[random.randint(0, len(fineweb_edu)-1)]['text']) 


FineWeb-Edu is ready.
dataset size in gb: 45.730818568728864
Number of entries: 9672101
--------------------------------------------------
Example entry:
- Bitcoin is different than any currency you’ve used before, so it's very important to understand some key points.
- Unlike government issued money, that can be inflated at will,
- The supply of bitcoin is mathematically limited to twenty one million bitcoins
- Bitcoins are impossible to be counterfeited or inflated.
- Use them to send or receive any amount of money, with anyone, anywhere in the world, at very low cost.
- Bitcoin payments are impossible to be blocked, and bitcoin wallets can’t be frozen.
- Unless the entire world's internet is turned off, the Bitcoin network is unstoppable and cannot be censored.
- Learn even MORE!
BitCoins - Be Informed


In [6]:
# OpenWebText2 dataset
owt2 = load_dataset("Skylion007/openwebtext", split="train", cache_dir=CACHE_DIR)
print("OpenWebText2 dataset loaded.")
print("Dataset size in GB:", owt2.dataset_size / (1024**3))
print("Number of entries:", len(owt2))
print("-"*50)
print("Example entry:")
print(owt2[random.randint(0, len(owt2)-1)]['text'])

OpenWebText2 dataset loaded.
Dataset size in GB: 37.03822539001703
Number of entries: 8013769
--------------------------------------------------
Example entry:
MEMORANDUM FOR: The President

FROM: Veteran Intelligence Professionals for Sanity (VIPS)

SUBJECT: Veteran Intelligence Professionals Challenge CIA’s “Rebuttal” on Torture

Former CIA leaders responsible for allowing torture to become part of the 21st Century legacy of the CIA are trying to rehabilitate their tarnished reputations with the release of a new book, Rebuttal: The CIA Responds to the Senate Intelligence Committee’s Study of Its Detention and Interrogation Program. They are pushing the lie that the only allegations against them are from a partisan report issued by Democrats from the Senate Intelligence Committee.

We recall the answer of General John Kimmons, the former Deputy Director of Operations for the Joint Chiefs of Staff, who was asked if good intelligence could be obtained from abusive practices. He replied:

#### Some Q&A data to improve the model's ability to answer questions:

In [7]:
q_a1 = load_dataset("agentlans/text-sft-questions-answers-only", split='train', cache_dir=CACHE_DIR)
print("Q&A dataset loaded.")
print("dataset size in mb:", q_a1.dataset_size / (1024**2))
print("Number of entries:", len(q_a1))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(q_a1)-1)
print(q_a1[index]['question'][:500], "\n", q_a1[index]['answer'])

Q&A dataset loaded.
dataset size in mb: 46.480509757995605
Number of entries: 120959
--------------------------------------------------
Example entry:
What was the outcome of the British Cabinet's discussions on a draft declaration concerning Palestine in 1917? 
 The discussions led to the creation of Balfour's declaration, drafting of which involved Rothschild, Weizmann, and other key figures within the British and Zionist groups, culminating in a final version in 1917.


In [8]:
#euclaise/reddit-instruct
reddit_instruct = load_dataset("euclaise/reddit-instruct", split='train', cache_dir=CACHE_DIR)
# reddit_instruct = load_dataset("Felladrin/ChatML-reddit-instruct-curated", split='train', cache_dir=CACHE_DIR)
print("Reddit Instruct dataset loaded.")
print("dataset size in gb:", reddit_instruct.dataset_size / (1024**3))
print("Number of entries:", len(reddit_instruct))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(reddit_instruct)-1)
print(reddit_instruct[index]['post_title'][:500], reddit_instruct[index]['post_text'][:500]), "\n", reddit_instruct[index]['comment_text'][:500]

Reddit Instruct dataset loaded.
dataset size in gb: 0.09901080373674631
Number of entries: 84784
--------------------------------------------------
Example entry:
Why isn't HPV vaccine recommended for adults past a certain age? I was recently looking over the [CDC recommended immunization schedule for adults](https://www.cdc.gov/vaccines/schedules/downloads/adult/adult-combined-schedule.pdf) and I noticed a couple of oddities. Most notably is the HPV vaccine. Why isn't this recommended for males past age 21 and for females past age 26? Is there a biological reason that people become less susceptible as they age? On a related note, why is 1957 the cutoff birth year for MMR vaccine?


(None,
 '\n',
 'No, there’s no biological reason. They don’t recommend the vaccine past a certain age because they assume that by that age that both men and women have had enough sex where they have already been exposed to the virus and wouldn’t benefit from the vaccine. \n\nAlso, there’s a cutoff age for the MMR vaccine because the original vaccine came out in ‘63, meaning people who were kids before that year were most likely exposed to measles’s and have life long immunity due to the antibodies already being p')

In [9]:
# tatsu-lab/alpaca ( for Q&A fine-tuning )
alpaca = load_dataset("tatsu-lab/alpaca", split='train')
print("Alpaca dataset loaded.")
print("dataset size in mb:", alpaca.dataset_size / (1024**2))
print("Number of entries:", len(alpaca))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(alpaca)-1)
print(alpaca[index]['instruction'][:500], "\n", alpaca[index]['output'][:500])

Alpaca dataset loaded.
dataset size in mb: 44.06797695159912
Number of entries: 52002
--------------------------------------------------
Example entry:
Rewrite the sentence to reveal the metaphor. 
 The sun was shining brightly in the clear blue sky, as bright as a golden coin.


## Data Preprocessing

#### Tokenizer setup

For this project i use tiktoken for the tokenizer, as it is the same tokenizer used by OpenAI for their models.

I use the "gpt2" encoding which is a byte pair encoding (BPE) tokenizer.

In [10]:
tokenizer_base = tiktoken.get_encoding("gpt2")

tokenizer = tiktoken.Encoding(
    name="rob-tokenizer",
    pat_str=tokenizer_base._pat_str,
    mergeable_ranks=tokenizer_base._mergeable_ranks,
    special_tokens={
        **tokenizer_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
        "<|pad|>": 0,
    }
)

#### Test of the byte pair encoding tokenizer 

In [11]:
# test of tokenizer on reddit_instruct
sample_text = reddit_instruct[0]['post_title'] + " " + reddit_instruct[0]['post_text'] + " " + reddit_instruct[0]['comment_text']
tokens = tokenizer.encode(sample_text)
print(tokens)
print("Decoded text:")
print(tokenizer.decode(tokens)) 
print(f"Sample text length in characters: {len(sample_text)}")
print(f"Sample text length in tokens: {len(tokens)}")   

[2061, 318, 24207, 1616, 2587, 30, 314, 2342, 257, 7684, 286, 1097, 5861, 290, 484, 1561, 546, 275, 32512, 7021, 290, 884, 11, 1312, 373, 11263, 644, 275, 32512, 318, 290, 1312, 18548, 1064, 597, 2562, 7468, 284, 644, 340, 318, 24207, 1616, 318, 655, 262, 1438, 329, 257, 16058, 286, 6147, 13, 554, 262, 29393, 995, 340, 338, 1690, 973, 355, 257, 1790, 1021, 329, 3354, 326, 547, 3235, 1389, 503, 286, 257, 1263, 2512, 286, 2587, 11, 355, 6886, 284, 11721, 3350, 654, 810, 44030, 6147, 318, 19036, 656, 257, 15936, 12070, 503, 286, 9629, 6147, 13, 7080, 3191, 318, 517, 5789, 329, 1588, 17794, 475, 340, 460, 779, 1365, 3081, 286, 21782, 290, 318, 4577, 284, 787, 329, 4833, 17794, 588, 3234, 3354, 13]
Decoded text:
What is Billet material? I watch a bunch of car videos and they talk about billet blocks and such, i was wondering what billet is and i cant find any easy explanation to what it is Billet is just the name for a chunk of metal. In the automotive world it's often used as a short hand 

### Formatting datasets functions

#### Merging datasets

In [12]:
from datasets import concatenate_datasets
combined_train_dataset = concatenate_datasets([  
    wiki_en,
    stories,
    fineweb_edu,
    owt2,  
])  

combined_finetune_dataset = concatenate_datasets([
    q_a1,
    reddit_instruct,
    alpaca,
])

# Shuffle the combined dataset
train_dataset = combined_train_dataset.shuffle(seed=42)
finetune_dataset = combined_finetune_dataset.shuffle(seed=42)
print(f"Train dataset size: {len(combined_train_dataset)}")
print(f"Finetune dataset size: {len(combined_finetune_dataset)}")

# Exemple 

print("Example entry from train dataset:")
index = random.randint(0, len(train_dataset)-1)
print(train_dataset[index])   

Train dataset size: 26209380
Finetune dataset size: 257745
Example entry from train dataset:
{'id': '51408024', 'url': 'https://en.wikipedia.org/wiki/Josef%20Sch%C3%A4ffer', 'title': 'Josef Schäffer', 'text': 'Josef Schäffer (born July 2, 1891 in Moravia) was an Austrian track and field athlete who competed in the 1912 Summer Olympics. He competed in the decathlon, shot put, discus throw and two-handed discus throw. He finished tenth in the decathlon, throwing the second-furthest in the discus on his way to his score of 6568.585. In the shot put, he finished thirteenth. In the discus throw, he only managed to come twenty-ninth in the regular discus throw, but came sixteenth in the two-handed discus.\n\nSee also \n Austria at the 1912 Summer Olympics\n\nReferences\n\nExternal links\n \n\n1891 births\nAustrian decathletes\nAustrian shot putters\nAustrian male discus throwers\nOlympic athletes for Austria\nAthletes (track and field) at the 1912 Summer Olympics\nYear of death missing\nOlym

#### Custom Dataset class

Inspired by the dataloader from the "LLMs from scratch" repository. But adapted for multi-row text arrays.

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/dataloader.ipynb

In [36]:
class GPTDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=512):
        """
        Args:
            dataset: Dataset of the combined hugginface entries
            tokenizer: the initiatokenizer to process text
            max_length: Context window size
        """
        self.data = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.input_tokens = []
        self.target_tokens = []

        self.pad_token_id = 0         # <|pad|>
        self.bos_token_id = 100264    # <|im_start|>
        self.eos_token_id = 100265    # <|im_end|>

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Raw text
        
        # Data format handling
        entry = self.data[idx]
        if 'text' in entry:
            text = entry['text']
        elif 'story' in entry:
            text = entry['story']
        elif 'question' in entry and 'answer' in entry: 
            text = "User: " + entry['question'] + " Assistant:" + entry['answer']
        elif 'post_title' in entry and 'post_text' in entry and 'comment_text' in entry:
            text = "User: " + entry['post_title'] + " Assistant:" + entry['post_text'] + " " + entry['comment_text']
        elif 'instruction' in entry and 'output' in entry:
            text = "User: " + entry['instruction'] + " Assistant:" + entry['output']
        else:
            raise ValueError("Unknown data entry format")
        
        text = str(text) # Ensure text is a string
        #print(text)

        # Adding Start and End tokens
        text = "<|im_start|>" + text + "<|im_end|>" 

        # Tokenization
        tokens = self.tokenizer.encode(text, allowed_special="all")

        # Truncation
        tokens = tokens[:self.max_length] #Data is loost here ( fix later with sliding window )

        input_ids = torch.tensor(tokens[:-1], dtype=torch.long)  # All tokens except last
        labels = torch.tensor(tokens[1:], dtype=torch.long)      # All tokens except first


        #Padding 
        padding_length = self.max_length - len(tokens)
        if padding_length > 0:
            input_ids = torch.cat([input_ids, torch.full((padding_length,), self.pad_token_id)])
            labels = torch.cat([labels, torch.full((padding_length,), -100)])


        attention_mask = (input_ids != self.pad_token_id).long() # 1 for real tokens, 0 for padding

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

### DataLoader setup

In [37]:
train_dataset = GPTDataset(combined_train_dataset, tokenizer, max_length=512)
print(f"Train dataset size: {len(train_dataset)}")

batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Train dataset size: 26209380


##### Test of a entry from dataloader

In [38]:
print("##### Test of a entry from dataloader")
batch = next(iter(train_dataloader))
print(batch)    

##### Test of a entry from dataloader
{'input_ids': tensor([[100264,  20588,    642,  ...,      0,      0,      0],
        [100264,    818,    428,  ...,      0,      0,      0],
        [100264,   1722,   8536,  ...,      0,      0,      0],
        ...,
        [100264,    818,  36864,  ...,      0,      0,      0],
        [100264,  14478,   5451,  ...,      0,      0,      0],
        [100264,  14040,    271,  ...,    198,  23675,   4796]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[20588,   642,    11,  ...,  -100,  -100,  -100],
        [  818,   428,  4326,  ...,  -100,  -100,  -100],
        [ 1722,  8536,   287,  ...,  -100,  -100,  -100],
        ...,
        [  818, 36864, 29803,  ...,  -100,  -100,  -100],
        [14478,  5451,   318,  ...,  -100,  -100,  -100],

### GPT config 

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 512,
    "emb_dim": 512,
    "n_heads": 8,
    "n_layers": 8,
    "drop_rate": 0.1,
    "qkv_bias": False
}

## Sources 

### Principal references: 
- https://arxiv.org/abs/2005.14165 (GPT-3 paper)
- https://arxiv.org/abs/2002.05709 (Attention is all you need paper)
- Build a Large Language Model (from scratch) by Sebastian Raschka

### About Padding tokens in Language Modeling
- https://arxiv.org/html/2510.01238v1 