Releases: workofart/ml-by-hand
GPT-2 124M pretrained on OpenWebText (56k steps)
GPT-2 124M — OpenWebText Baseline Model Card
A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).
Training Metrics
| Metric | Value |
|---|---|
| Validation loss (cross-entropy, nats) | 2.764 |
Validation perplexity (exp(loss)) |
15.87 |
Bits per token (loss / ln 2) |
3.99 |
| Steps trained | 56,000 (of 600,000 planned) |
| Tokens seen | ~27.5B (491,520 tok/step) |
| Start -> end val loss | 5.18 -> 2.76 |
Zero-shot evaluation
Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000 via lm-evaluation-harness.
bits_per_byte, byte_perplexity, and word_perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc is also comparable
Caveats:
- The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte
- the LAMBADA
perplexityis token-level and so carries the usual tokenizer dependence
| Task | Metric | Direction(↑ = higher is better, ↓ = lower is better) | Value | Baseline[1] |
|---|---|---|---|---|
| lambada_openai | acc | ↑ | 0.2989 | 0.3256 |
| perplexity | ↓ | 52.7521 | 40.0554 | |
| wikitext | bits_per_byte | ↓ | 1.0037 | 0.9769 |
| byte_perplexity | ↓ | 2.0052 | 1.9682 | |
| word_perplexity | ↓ | 41.2834 | 37.3698 |
[1] The baseline was evaluated by downloading the gpt-2 small from (https://huggingface.co/openai-community/gpt2/tree/main) and running against the lm-evaluation-harness.
Architecture (GPT-2 Small, 124 million parameter)
| Layers / heads / hidden | 12 / 12 / 768 |
| Max sequence length | 1024 |
| Vocab size | 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64) |
| Dropout | 0.0 |
| Parameter dtype | bfloat16 |
| Notable | packed QKV projection |
The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.
Training configuration
| Dataset | OpenWebText |
| Optimizer | AdamW (lr 6e-4, beta 0.95, weight_decay 0.1) |
| LR schedule | cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps |
| Grad clipping | max-norm 1.0 |
| Global batch | 480 sequences (micro 60 × 8 grad-accum) |
| Tokens / step | 491,520 |
| Eval | mean val loss over 100 batches |
Limitations
- Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
- Undertrained relative to its own schedule (56k/600k steps)