GPT-2 124M — OpenWebText Baseline Model Card

A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).

Training Metrics

Metric	Value
Validation loss (cross-entropy, nats)	2.764
Validation perplexity (`exp(loss)`)	15.87
Bits per token (`loss / ln 2`)	3.99
Steps trained	56,000 (of 600,000 planned)
Tokens seen	~27.5B (491,520 tok/step)
Start -> end val loss	5.18 -> 2.76

Zero-shot evaluation

Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000 via lm-evaluation-harness.

bits_per_byte, byte_perplexity, and word_perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc is also comparable

Caveats:

The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte
the LAMBADA perplexity is token-level and so carries the usual tokenizer dependence

Task	Metric	Direction(↑ = higher is better, ↓ = lower is better)	Value	Baseline[1]
lambada_openai	acc	↑	0.2989	0.3256
	perplexity	↓	52.7521	40.0554
wikitext	bits_per_byte	↓	1.0037	0.9769
	byte_perplexity	↓	2.0052	1.9682
	word_perplexity	↓	41.2834	37.3698

[1] The baseline was evaluated by downloading the gpt-2 small from (https://huggingface.co/openai-community/gpt2/tree/main) and running against the lm-evaluation-harness.

Architecture (GPT-2 Small, 124 million parameter)


Layers / heads / hidden	12 / 12 / 768
Max sequence length	1024
Vocab size	50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64)
Dropout	0.0
Parameter dtype	bfloat16
Notable	packed QKV projection

The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

Training configuration


Dataset	OpenWebText
Optimizer	AdamW (lr 6e-4, beta 0.95, weight_decay 0.1)
LR schedule	cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps
Grad clipping	max-norm 1.0
Global batch	480 sequences (micro 60 × 8 grad-accum)
Tokens / step	491,520
Eval	mean val loss over 100 batches

Limitations

Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
Undertrained relative to its own schedule (56k/600k steps)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

GPT-2 124M — OpenWebText Baseline Model Card

Training Metrics

Zero-shot evaluation

Architecture (GPT-2 Small, 124 million parameter)

Training configuration

Limitations

Uh oh!

Releases: workofart/ml-by-hand

GPT-2 124M pretrained on OpenWebText (56k steps)

GPT-2 124M — OpenWebText Baseline Model Card

Training Metrics

Zero-shot evaluation

Architecture (GPT-2 Small, 124 million parameter)

Training configuration

Limitations

Uh oh!