Skip to content

Releases: workofart/ml-by-hand

GPT-2 124M pretrained on OpenWebText (56k steps)

11 Jun 03:48
c576280

Choose a tag to compare

GPT-2 124M — OpenWebText Baseline Model Card

A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).

openwebtext_gpt2_124m_baseline_training

Training Metrics

Metric Value
Validation loss (cross-entropy, nats) 2.764
Validation perplexity (exp(loss)) 15.87
Bits per token (loss / ln 2) 3.99
Steps trained 56,000 (of 600,000 planned)
Tokens seen ~27.5B (491,520 tok/step)
Start -> end val loss 5.18 -> 2.76

Zero-shot evaluation

Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000 via lm-evaluation-harness.

bits_per_byte, byte_perplexity, and word_perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc is also comparable

Caveats:

  • The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte
  • the LAMBADA perplexity is token-level and so carries the usual tokenizer dependence
Task Metric Direction(↑ = higher is better, ↓ = lower is better) Value Baseline[1]
lambada_openai acc 0.2989 0.3256
perplexity 52.7521 40.0554
wikitext bits_per_byte 1.0037 0.9769
byte_perplexity 2.0052 1.9682
word_perplexity 41.2834 37.3698

[1] The baseline was evaluated by downloading the gpt-2 small from (https://huggingface.co/openai-community/gpt2/tree/main) and running against the lm-evaluation-harness.

Architecture (GPT-2 Small, 124 million parameter)

Layers / heads / hidden 12 / 12 / 768
Max sequence length 1024
Vocab size 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64)
Dropout 0.0
Parameter dtype bfloat16
Notable packed QKV projection

The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

Training configuration

Dataset OpenWebText
Optimizer AdamW (lr 6e-4, beta 0.95, weight_decay 0.1)
LR schedule cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps
Grad clipping max-norm 1.0
Global batch 480 sequences (micro 60 × 8 grad-accum)
Tokens / step 491,520
Eval mean val loss over 100 batches

Limitations

  • Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
  • Undertrained relative to its own schedule (56k/600k steps)