<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width="35%" align="right">


# yoctoGPT — Minimal GPT from Scratch

## _Advanced Training — Token-level Training_

**&copy; Dr. Yves J. Hilpisch**<br>AI-Powered by OpenAI & Gemini.

## How to Use This Notebook

- **Goal**: Train a GPT model using a Byte Pair Encoding (BPE) tokenizer.
- **Hardware**: Optimized for Google Colab T4 (~15GB VRAM).
- **Persistence**: Mounts Google Drive for persistent checkpoint storage.

### Roadmap

1. **Setup**: Install dependencies, clone repo, and mount Google Drive.
2. **Tokenization**: Train a BPE tokenizer and encode the corpus into subword tokens.
3. **Training**: Train a model using the `gpt_fast` architecture for maximum throughput.
4. **Sampling & Resume**: Generate text using the tokenizer and resume from checkpoints.

In [None]:
#@title Mount Google Drive for checkpoint storage (repo stays local)
from google.colab import drive
from pathlib import Path

drive.mount("/content/drive")

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_fast")
CKPT_DIR.mkdir(parents=True, exist_ok=True)
print("Checkpoints dir:", CKPT_DIR)

In [None]:
#@title Setup: Install Dependencies and Clone Repository
!nvidia-smi || true
!pip -q install tokenizers tqdm textstat

import os, pathlib, subprocess, textwrap

repo_root = pathlib.Path("/content/yoctoGPT")
if repo_root.exists():
    print("Repo exists, pulling latest...")
    subprocess.run(["git", "pull"], cwd=repo_root, check=False)
else:
    subprocess.run(
        [
            "git",
            "clone",
            "https://github.com/yhilpisch/yoctoGPT.git",
            str(repo_root),
        ],
        check=False,
    )
os.chdir(repo_root)

if os.path.exists("requirements.txt"):
    !pip -q install -r requirements.txt || true

data_dir = pathlib.Path("data")
data_dir.mkdir(exist_ok=True)
txts = list(data_dir.glob("*.txt"))
if not txts:
    sample = textwrap.dedent('''
    Philosophy is the study of general and fundamental questions,
    such as those about existence, reason, knowledge, values, mind,
    and language. It often poses questions rather than providing
    answers, inviting us to think.
    ''').strip()
    (data_dir / "philosophy.txt").write_text(sample, encoding="utf-8")
    print("Created sample data/philosophy.txt")
else:
    names = [p.name for p in txts][:5]
    print(f"Found {len(txts)} text files in data/: {names}")

### Tokenization

Unlike character-level models, token-level models use more sophisticated encodings (like Byte Pair Encoding) that group common character sequences into single tokens. This allows the model to handle larger vocabularies and see more "meaningful" chunks of text at each step.

In [None]:
#@title Prepare Token-level Dataset
!python -m scripts.prepare_tokenizer \
  --all_txt_in_dir \
  --text_dir data \
  --out_dir data/token \
  --vocab_size 8000 \
  --random_split \
  --split_seed 1337

In [None]:
#@title Pick L4-friendly block_size/batch_size
import numpy as np
from pathlib import Path

train_path = Path('data/token/train.bin')
val_path = Path('data/token/val.bin')
train_tokens = int(np.fromfile(train_path, dtype=np.int32).shape[0])
val_tokens = int(np.fromfile(val_path, dtype=np.int32).shape[0])
min_tokens = min(train_tokens, val_tokens)

block_candidates = [512, 384, 256, 192, 128, 96, 64, 48, 32, 24, 16]
block_size = next((b for b in block_candidates if min_tokens > b + 2), max(8, min_tokens - 2))

target_tokens = min(32768, max(2048, min_tokens))
batch_size = max(1, min(96, target_tokens // block_size))

print(f'Using block_size={block_size}, batch_size={batch_size}')


### Training

We use `gpt_fast` with bf16 for efficient token-level training on Colab L4.

In [None]:
#@title Train (gpt_fast) on Colab L4
from pathlib import Path

CKPT_DIR = Path('/content/drive/MyDrive/yocto/checkpoints/colab_fast')
CKPT_DIR.mkdir(parents=True, exist_ok=True)

!python -m yoctoGPT.train \
--mode token \
--data_dir data/token \
--tokenizer_path data/token/tokenizer.json \
--ckpt_dir {CKPT_DIR} \
--model_type gpt_fast \
--device cuda \
--n_layer 6 --n_head 6 --n_embd 384 \
--block_size 512 --batch_size 128 \
--dropout 0.12 --weight_decay 0.08 \
--tie_weights --label_smoothing 0.05 \
--amp --amp_dtype bf16 \
--auto_microbatch \
--eval_interval 250 --eval_iters 50 \
--cosine_lr --warmup_iters 300 \
--min_lr 1e-5 --lr 1.8e-4 \
--max_iters 2000 \
--ema --ema_decay 0.999


### Sampling and Resuming

Generate text using the BPE tokenizer and learn how to resume training from your checkpoints.

In [None]:
#@title Sample Continuation
output_token = !python -m yoctoGPT.sampler \
--mode token \
--ckpt {CKPT_DIR}/best.pt \
--tokenizer_path data/token/tokenizer.json \
--prompt "In the beginning, philosophy sought to" \
--max_new_tokens 120 \
--temperature 0.9 --top_k 30 --top_p 0.9

generated_text_token = '\n'.join(output_token)
print(generated_text_token)


### Readability Assessment

Same readability scoring as Notebook 01, so token and char experiments can be compared on a common quality proxy.

In [None]:
#@title Analyze Generated Text Readability
from textstat import textstat

def readability_scores(text: str) -> dict:
    return {
        'Flesch Reading Ease': textstat.flesch_reading_ease(text),
        'Flesch-Kincaid Grade': textstat.flesch_kincaid_grade(text),
        'Dale-Chall Score': textstat.dale_chall_readability_score(text),
        'Text Standard': textstat.text_standard(text, float_output=False),
    }

scores = readability_scores(generated_text_token)
for metric, value in scores.items():
    print(f'{metric:20}: {value}')


In [None]:
#@title Resume training from best.pt (lower LR + early stopping)
best = CKPT_DIR / 'best.pt'
if not best.exists():
    print('No best.pt found; skipping resume cell.')
else:
    !python -m yoctoGPT.train \
      --mode token \
      --data_dir data/token \
      --tokenizer_path data/token/tokenizer.json \
      --ckpt_dir {CKPT_DIR} \
      --resume {best} \
      --model_type gpt_fast \
      --device cuda \
      --n_layer 6 --n_head 6 --n_embd 384 \
      --block_size 512 --batch_size 128 \
      --dropout 0.12 \
      --tie_weights \
      --amp --amp_dtype bf16 \
      --auto_microbatch \
      --cosine_lr --warmup_iters 80 \
      --lr 8e-5 --max_iters 1200 \
      --eval_interval 300 --eval_iters 80 \
      --early_stopping_patience 3 \
      --early_stopping_min_delta 0.01


### Exercises

1. **Vocabulary Size**: Experiment with a larger `vocab_size` (e.g., 16,000) during tokenization. How does it affect the model's parameter count and training speed?
2. **Different Architectures**: Try training with `--model_type gpt` or `gpt_plus` and compare the throughput (tokens per second) reported in the logs.
3. **Hyperparameter Tuning**: Adjust `dropout` and `lr` to see if you can achieve a lower validation loss in fewer iterations.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width="35%" align="right">
