# Projet GPT - Train

This notebook contains the code used to train a small language model using PyTorch from scratch. The model is inspired by the GPT architecture.


#### Hardware
- RTX3060 12GB VRAM
- AMD Ryzen 7 5800X 8-Core
- 32GB RAM
- Ubuntu 22.04 LTS

In [None]:
import os
CACHE_DIR = "/home/rob/projet_gpt/cache_huggingface"
os.environ['HF_HOME'] = CACHE_DIR
os.environ['HF_DATASETS_CACHE'] = os.path.join(CACHE_DIR, "datasets")
os.environ['HF_METRICS_CACHE'] = os.path.join(CACHE_DIR, "metrics")
os.environ['HF_MODULES_CACHE'] = os.path.join(CACHE_DIR, "modules")


In [1]:
import torch, torch.nn as nn, torch.optim as optim
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import random, math

from datasets import load_dataset,concatenate_datasets
import tiktoken
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## Datasets

### Common knowledge datasets:

##### English Wikipedia crawled dataset

In [2]:
# English Wikipedia crawled dataset
# path to store the dataset cache: /Volumes/RobertsDisk
wiki_en = load_dataset("wikimedia/wikipedia", "20231101.en", split='train', cache_dir=CACHE_DIR) 
print("English Wikipedia dataset loaded.")
print("dataset size in gb:", wiki_en.dataset_size / (1024**3))
print("Number of entries:", len(wiki_en))
print("-"*50)
print("Example entry:")
print(wiki_en[random.randint(0, len(wiki_en)-1)]['text'][:500])


NameError: name 'CACHE_DIR' is not defined

#### Simple stories dataset

In [None]:
# Simple stories dataset
stories = load_dataset("SimpleStories/SimpleStories", split='train', cache_dir=CACHE_DIR)
print("Simple stories dataset loaded.")
print("dataset size in mb:", stories.dataset_size / (1024**2))
print("Number of entries:", len(stories))
print("-"*50)
print("Example entry:")
print(stories[random.randint(0, len(stories)-1)]['story'][:500])

Generating train split: 100%|██████████| 2115696/2115696 [00:06<00:00, 336346.05 examples/s]
Generating test split: 100%|██████████| 21371/21371 [00:00<00:00, 302676.62 examples/s]

Simple stories dataset loaded.
dataset size in mb: 3030.012650489807
Number of entries: 2115696
--------------------------------------------------
Example entry:
A storm raged across the sea one night. Pirates on the ship called the Stormy Wind huddled below deck, afraid. Their captain, known for his bold plans, had an idea. He said, "If we can make a shield from the storm, we will be safe!" The crew was scared but trusted him. 

They gathered old sails and ropes, tying them together. As the ship rocked, they worked hard. The captain shouted, "Hold on! We can do this!" The wind howled, but they kept going. After many tries, they created a large sail that





##### FineWeb-Edu dataset

In [None]:
fineweb_edu = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT",  split='train', cache_dir=CACHE_DIR)

print("FineWeb-Edu is ready.")
print("dataset size in gb:", fineweb_edu.dataset_size / (1024**3))
print("Number of entries:", len(fineweb_edu))
print("-"*50)
print("Example entry:")
print(fineweb_edu[random.randint(0, len(fineweb_edu)-1)]['text'][:500]) 


Generating train split: 100%|██████████| 9672101/9672101 [02:59<00:00, 54002.58 examples/s]


FineWeb-Edu is ready.
dataset size in gb: 45.730818568728864
Number of entries: 9672101
--------------------------------------------------
Example entry:
Lesson 1 (from May 4, 1771)
May 4, 1771:
The first letter of the novel introduces the reader to the protagonist, a young man named Werther. This lesson will encourage the student to formulate a character sketch of Werther by considering Goethe's initial presentation of him.
1. Reading Journal Activity: Begin a reading journal, which you will add to throughout your study of this novel. In this journal, you will write your own impressions of the novel: these can be opinions, reflections, questions


#### Some Q&A data to improve the model's ability to answer questions:

In [None]:
q_a1 = load_dataset("agentlans/text-sft-questions-answers-only", split='train', cache_dir=CACHE_DIR)
print("Q&A dataset loaded.")
print("dataset size in mb:", q_a1.dataset_size / (1024**2))
print("Number of entries:", len(q_a1))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(q_a1)-1)
print(q_a1[index]['question'][:500], "\n", q_a1[index]['answer'][:500])

Generating train split: 100%|██████████| 120959/120959 [00:00<00:00, 909910.54 examples/s]
Generating validation split: 100%|██████████| 30240/30240 [00:00<00:00, 1498833.09 examples/s]

Q&A dataset loaded.
dataset size in mb: 46.480509757995605
Number of entries: 120959
--------------------------------------------------
Example entry:
What were Reshetnyak's contributions to the field of Phaeodorea systematics and morphology? 
 Reshetnyak made significant contributions to the study of Phaeodorea systematics and morphology, describing more than 20 new species and establishing a new family, Polypyramidae Reschetnjak, 1966.





In [None]:
#euclaise/reddit-instruct
reddit_instruct = load_dataset("euclaise/reddit-instruct", split='train', cache_dir=CACHE_DIR)
# reddit_instruct = load_dataset("Felladrin/ChatML-reddit-instruct-curated", split='train', cache_dir=CACHE_DIR)
print("Reddit Instruct dataset loaded.")
print("dataset size in gb:", reddit_instruct.dataset_size / (1024**3))
print("Number of entries:", len(reddit_instruct))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(reddit_instruct)-1)
print(reddit_instruct[index]['post_title'][:500], reddit_instruct[index]['post_text'][:500]), "\n", reddit_instruct[index]['comment_text'][:500]

Generating train split: 100%|██████████| 84784/84784 [00:00<00:00, 534904.51 examples/s]
Generating test split: 100%|██████████| 2000/2000 [00:00<00:00, 286046.78 examples/s]

Reddit Instruct dataset loaded.
dataset size in gb: 0.09901080373674631
Number of entries: 84784
--------------------------------------------------
Example entry:
CMV: the left-right political spectrum is a largely useless classification system that does more harm than good. Politics, and its related topics, are an exceedingly multidimensional space and our obsession with mapping people and organizations to an approximate coordinate on a single dimensional scale does more to obfuscate our convictions and beliefs than it does to help identify them. In addition to being insufficient as a classifier, it imparts undue baggage due to "spectrum proximity" with unrelated views. As an example:

* If I am a strong proponent of states' rights and individual responsibility, th





(None,
 '\n',
 'Would you refuse to punch your father in the face, even if he told you to?\n\nDo you find it disgusting to wash your hands in a public bathroom? \n\nThose questions are not of a political nature, and nobody campaigns on either issue. However, they track very well with political views: conservatives say yes, liberals say no. \n\nThis shows that certain viewpoints do naturally cluster together. There are different outlooks on life, and they affect the opinions that people hold on many issues. \n\nNobody i')

In [None]:
# tatsu-lab/alpaca ( for Q&A fine-tuning )
alpaca = load_dataset("tatsu-lab/alpaca", split='train')
print("Alpaca dataset loaded.")
print("dataset size in mb:", alpaca.dataset_size / (1024**2))
print("Number of entries:", len(alpaca))
print("-"*50)
print("Example entry:")
index = random.randint(0, len(alpaca)-1)
print(alpaca[index]['instruction'][:500], "\n", alpaca[index]['output'][:500])

Generating train split: 100%|██████████| 52002/52002 [00:00<00:00, 760091.99 examples/s]

Alpaca dataset loaded.
dataset size in mb: 44.06797695159912
Number of entries: 52002
--------------------------------------------------
Example entry:
Describe one innovation in the automotive industry. 
 One of the major innovations in the automotive industry is autonomous driving technology. Autonomous vehicles are equipped with multiple advanced sensors and enhanced algorithms, which allow the car to detect and analyze its environment and to drive accordingly. This technology has the potential to drastically reduce the number of accidents caused by human error, while increasing the overall efficiency and sustainability of transport.





## Data Preprocessing

#### Tokenizer setup

For this project i use tiktoken for the tokenizer, as it is the same tokenizer used by OpenAI for their models.

I use the "gpt2" encoding which is a byte pair encoding (BPE) tokenizer.

In [None]:
tokenizer_base = tiktoken.get_encoding("gpt2")

tokenizer = tiktoken.Encoding(
    name="rob-tokenizer",
    pat_str=tokenizer_base._pat_str,
    mergeable_ranks=tokenizer_base._mergeable_ranks,
    special_tokens={
        **tokenizer_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
        "<|pad|>": 0,
    }
)

#### Test of the byte pair encoding tokenizer 

In [None]:
# test of tokenizer on reddit_instruct
sample_text = reddit_instruct[0]['post_title'] + " " + reddit_instruct[0]['post_text'] + " " + reddit_instruct[0]['comment_text']
tokens = tokenizer.encode(sample_text)
print(tokens)
print("Decoded text:")
print(tokenizer.decode(tokens)) 
print(f"Sample text length in characters: {len(sample_text)}")
print(f"Sample text length in tokens: {len(tokens)}")   

[2061, 318, 24207, 1616, 2587, 30, 314, 2342, 257, 7684, 286, 1097, 5861, 290, 484, 1561, 546, 275, 32512, 7021, 290, 884, 11, 1312, 373, 11263, 644, 275, 32512, 318, 290, 1312, 18548, 1064, 597, 2562, 7468, 284, 644, 340, 318, 24207, 1616, 318, 655, 262, 1438, 329, 257, 16058, 286, 6147, 13, 554, 262, 29393, 995, 340, 338, 1690, 973, 355, 257, 1790, 1021, 329, 3354, 326, 547, 3235, 1389, 503, 286, 257, 1263, 2512, 286, 2587, 11, 355, 6886, 284, 11721, 3350, 654, 810, 44030, 6147, 318, 19036, 656, 257, 15936, 12070, 503, 286, 9629, 6147, 13, 7080, 3191, 318, 517, 5789, 329, 1588, 17794, 475, 340, 460, 779, 1365, 3081, 286, 21782, 290, 318, 4577, 284, 787, 329, 4833, 17794, 588, 3234, 3354, 13]
Decoded text:
What is Billet material? I watch a bunch of car videos and they talk about billet blocks and such, i was wondering what billet is and i cant find any easy explanation to what it is Billet is just the name for a chunk of metal. In the automotive world it's often used as a short hand 

### Formatting datasets functions

#### Merging datasets

In [None]:
from datasets import concatenate_datasets
combined_train_dataset = concatenate_datasets([
    wiki_en,
    stories,
    fineweb_edu,
])  

combined_finetune_dataset = concatenate_datasets([
    q_a1,
    reddit_instruct,
    alpaca,
])

# Shuffle the combined dataset
train_dataset = combined_train_dataset.shuffle(seed=42)
finetune_dataset = combined_finetune_dataset.shuffle(seed=42)
print(f"Combined dataset size: {len(combined_train_dataset)}")

# Exemple 

print("Example entry from combined dataset:")
index = random.randint(0, len(train_dataset)-1)
print(train_dataset[index])   

Combined dataset size: 18453356
Example entry from combined dataset:
{'id': '2685965', 'url': 'https://en.wikipedia.org/wiki/Roxbury%20High%20School%20%28New%20Jersey%29', 'title': 'Roxbury High School (New Jersey)', 'text': 'Roxbury High School is a four-year comprehensive public high school in the Succasunna section of Roxbury in Morris County, in the U.S. state of New Jersey, serving students in ninth grade through twelfth grades, operating as the lone secondary school of the Roxbury School District, which serves more than 3,500 students.\n\nThe school serves students from Roxbury, as well as from Mount Arlington, who attend as part of a sending/receiving relationship with the Mount Arlington School District.\n\nAs of the 2021–22 school year, the school had an enrollment of 1,195 students and 121.9 classroom teachers (on an FTE basis), for a student–teacher ratio of 9.8:1. There were 115 students (9.6% of enrollment) eligible for free lunch and 44 (3.7% of students) eligible for red

#### Custom Dataset class

In [None]:
class GPTDataset(Dataset):
    def __init__(self, hf_dataset, tokenizer, max_length=512):
        """
        Args:
            hf_dataset: The Hugging Face dataset object (from Part 1)
            tokenizer: The tokenizer to process text
            max_length: Context window size
        """
        self.data = hf_dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.input_tokens = []
        self.target_tokens = []

        # Tokenizing the entire dataset

        self.pad_token_id = 0       # <|pad|>
        self.eos_token_id = 100265  # <|im_end|>

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Raw text
        
        # Data format handling
        entry = self.data[idx]
        if 'text' in entry:
            text = entry['text']
        elif 'story' in entry:
            text = entry['story']
        elif 'question' in entry and 'answer' in entry: 
            text = "User: " + entry['question'] + " Assistant:" + entry['answer']
        elif 'post_title' in entry and 'post_text' in entry and 'comment_text' in entry:
            text = "User: " + entry['post_title'] + " Assistant:" + entry['post_text'] + " " + entry['comment_text']
        elif 'instruction' in entry and 'output' in entry:
            text = "User: " + entry['instruction'] + " Assistant:" + entry['output']
        else:
            raise ValueError("Unknown data entry format")
        

        # Adding Start and End tokens
        text = "<|im_start|>" + text + "<|im_end|>"

        # Tokenization
        tokens = self.tokenizer.encode(text, allowed_special="all")

        # Truncation and padding
        if len(tokens) > self.max_length:
            tokens = tokens[:self.max_length]
            tokens[-1] = self.eos_token_id  # ensure last token is eos

        input_ids = torch.tensor(tokens, dtype=torch.long)
        attention_mask = torch.ones_like(input_ids)


        #Padding 
        padding_length = self.max_length - input_ids.size(0)
        if padding_length > 0:
            # Create padding tensors
            pad_ids = torch.full((padding_length,), self.pad_token_id, dtype=torch.long)
            pad_mask = torch.zeros((padding_length,), dtype=torch.long) # 0 = ignore
            
            # Concatenate
            input_ids = torch.cat([input_ids, pad_ids])
            attention_mask = torch.cat([attention_mask, pad_mask])


        labels = input_ids.clone()

        #adding the ignore index for padding tokens
        if padding_length > 0:
            # We know the padding is at the end
            labels[-padding_length:] = -100

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

### DataLoader setup

In [None]:
train_dataset = TextDataset(combined_hf_dataset, tokenizer, max_length=512)
print(f"Train dataset size: {len(train_dataset)}")

batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Train dataset size: 18453356


### GPT config 

In [5]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 512,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

## Sources 

### Principal references: 
- https://arxiv.org/abs/2005.14165 (GPT-3 paper)
- https://arxiv.org/abs/2002.05709 (Attention is all you need paper)
- Build a Large Language Model (from scratch) by Sebastian Raschka

### About Padding tokens in Language Modeling
- https://arxiv.org/html/2510.01238v1 