<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width="35%" align="right">


# yoctoGPT — Minimal GPT from Scratch

## _Scaling Up — Character-level Training_

**&copy; Dr. Yves J. Hilpisch**<br>AI-Powered by OpenAI & Gemini.

## How to Use This Notebook

- **Goal**: Train a larger character-level GPT model on a combined corpus.
- **Hardware**: Optimized for Google Colab T4 (~15GB VRAM).
- **Persistence**: Mounts Google Drive for persistent checkpoint storage.

### Roadmap

1. **Setup**: Install dependencies, clone repo, and mount Google Drive.
2. **Data**: Merge multiple text files into a large character-level corpus.
3. **Training**: Train a larger model (8 layers, 512 embedding dim) with optimized settings.
4. **Sampling & Resume**: Sample from checkpoints and learn how to resume training.

In [None]:
#@title Mount Google Drive for checkpoint storage (repo stays local)
from google.colab import drive
from pathlib import Path

drive.mount("/content/drive")

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")
CKPT_DIR.mkdir(parents=True, exist_ok=True)
print("Checkpoints dir:", CKPT_DIR)

In [None]:
#@title Setup: Install Dependencies and Clone Repository
!nvidia-smi || true
!pip -q install tokenizers tqdm

import os, pathlib, subprocess, textwrap

repo_root = pathlib.Path("/content/yoctoGPT")
if repo_root.exists():
    print("Repo exists, pulling latest...")
    subprocess.run(["git", "pull"], cwd=repo_root, check=False)
else:
    subprocess.run(
        [
            "git",
            "clone",
            "https://github.com/yhilpisch/yoctoGPT.git",
            str(repo_root),
        ],
        check=False,
    )
os.chdir(repo_root)

if os.path.exists("requirements.txt"):
    !pip -q install -r requirements.txt || true

data_dir = pathlib.Path("data")
data_dir.mkdir(exist_ok=True)
txts = list(data_dir.glob("*.txt"))
if not txts:
    sample = textwrap.dedent('''
    Philosophy is the study of general and fundamental questions,
    such as those about existence, reason, knowledge, values, mind,
    and language. It often poses questions rather than providing
    answers, inviting us to think.
    ''').strip()
    (data_dir / "philosophy.txt").write_text(sample, encoding="utf-8")
    print("Created sample data/philosophy.txt")
else:
    names = [p.name for p in txts][:5]
    print(f"Found {len(txts)} text files in data/: {names}")

### Data Preparation

We merge all available text files in the `data/` directory to create a larger, more diverse training set for our character-level model.

In [None]:
#@title Merge Texts and Prepare Char-level Dataset
data_dir = pathlib.Path("data")
texts = []
for p in sorted(data_dir.glob("*.txt")):
    print("Including:", p)
    texts.append(p.read_text(encoding="utf-8"))

merged_path = data_dir / "all_texts_char.txt"
merged_path.write_text("\n\n".join(texts), encoding="utf-8")
print("Wrote merged corpus to", merged_path)

!python -m scripts.prepare_char_data \
  --text_path data/all_texts_char.txt \
  --out_dir data/char

In [None]:
#@title Pick L4-friendly block_size/batch_size
import numpy as np
from pathlib import Path

train_path = Path('data/char/train.bin')
val_path = Path('data/char/val.bin')
train_tokens = int(np.fromfile(train_path, dtype=np.int32).shape[0])
val_tokens = int(np.fromfile(val_path, dtype=np.int32).shape[0])
min_tokens = min(train_tokens, val_tokens)

block_candidates = [512, 384, 256, 192, 128, 96, 64, 48, 32, 24, 16]
block_size = next((b for b in block_candidates if min_tokens > b + 2), max(8, min_tokens - 2))

target_tokens = min(49152, max(2048, min_tokens))
batch_size = max(1, min(128, target_tokens // block_size))

print(f'Using block_size={block_size}, batch_size={batch_size}')


### Training

We use a stronger L4-oriented configuration with `gpt_fast` + bf16 to improve throughput while preserving quality.

In [None]:
#@title Train (char-level GPT) on Colab L4
!python -m yoctoGPT.train \
--mode char \
--data_dir data/char \
--ckpt_dir {CKPT_DIR} \
--model_type gpt_fast \
--device cuda \
--n_layer 8 --n_head 8 --n_embd 512 \
--block_size {block_size} --batch_size {batch_size} \
--dropout 0.1 --weight_decay 0.1 \
--tie_weights --label_smoothing 0.05 \
--amp --amp_dtype bf16 \
--auto_microbatch \
--eval_interval 500 --eval_iters 30 \
--cosine_lr --warmup_iters 200 \
--min_lr 1e-5 --lr 1.2e-4 \
--max_iters 4000 \
--ema --ema_decay 0.999


### Sampling and Resuming

After training, you can generate text or continue training from your saved checkpoints on Google Drive.

In [None]:
#@title Sample Continuation
output_large_char = !python -m yoctoGPT.sampler \
--mode char \
--ckpt {CKPT_DIR}/best.pt \
--vocab_path data/char/vocab.json \
--prompt "Philosophy is" \
--max_new_tokens 300 \
--temperature 0.8 --top_k 50 --top_p 0.95

generated_text_large_char = '\n'.join(output_large_char)
print(generated_text_large_char)


### Readability Assessment

We evaluate generated text quality with the same readability metrics used in Notebook 01 for seamless comparison across experiments.

In [None]:
#@title Analyze Generated Text Readability
from textstat import textstat

def readability_scores(text: str) -> dict:
    return {
        'Flesch Reading Ease': textstat.flesch_reading_ease(text),
        'Flesch-Kincaid Grade': textstat.flesch_kincaid_grade(text),
        'Dale-Chall Score': textstat.dale_chall_readability_score(text),
        'Text Standard': textstat.text_standard(text, float_output=False),
    }

scores = readability_scores(generated_text_large_char)
for metric, value in scores.items():
    print(f'{metric:20}: {value}')


In [None]:
#@title Resume training from best.pt (lower LR + early stopping)
best = CKPT_DIR / 'best.pt'
if not best.exists():
    print('No best.pt found; skipping resume cell.')
else:
    !python -m yoctoGPT.train \
--mode char \
--data_dir data/char \
--ckpt_dir {CKPT_DIR} \
--resume {best} \
--model_type gpt_fast \
--device cuda \
--n_layer 8 --n_head 8 --n_embd 512 \
--block_size {block_size} --batch_size {batch_size} \
--amp --amp_dtype bf16 \
--auto_microbatch \
--cosine_lr --warmup_iters 50 \
--lr 8e-5 --max_iters 1000 \
--eval_interval 250 --eval_iters 60 \
--early_stopping_patience 3 \
--early_stopping_min_delta 0.01


### Exercises

1. **Architecture Trade-offs**: Try using `gpt_plus` (Accuracy-focused) instead of `gpt_fast`. Does it yield a lower validation loss for the same number of iterations?
2. **Longer Context**: Increase the `block_size` to 1024 (if memory permits) and observe the coherence of the generated text.
3. **Regularization**: Experiment with different `dropout` and `weight_decay` values to mitigate overfitting if the gap between train and val loss becomes too large.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width="35%" align="right">
