Minimum Viable Model — WebGPU LLM Benchmark

What's the smallest LLM that actually works for common browser UX tasks?

Everyone's racing to run the biggest model in a browser tab. We went the other direction: finding the minimum viable model size per task type, because shipping a 3B model for sentiment analysis is like driving a semi truck to get coffee.

Results

Task	Min Viable Model	Speed	Quality vs 3B
Classification	0.5B	142 tok/s	94% agreement (vs 97%)
Summarization	1B	74 tok/s	91% quality (vs 100%)
Question Answering	1.5B	60 tok/s	91% quality (vs 100%)

Key Findings

Classification: 0.5B is enough

A 0.5B model matches GPT-4o sentiment labels 94% of the time at 142 tokens/sec. The 3B model hits 97% at 34 tok/s — a 4x speed penalty for a 3% accuracy gain. Unless you're doing financial sentiment analysis where every percentage point matters, 0.5B is the right call.

Summarization: 1B is the sweet spot

This is where model size starts to matter. 0.5B hallucinates details and misses key points (2.8/5 quality). 1B nails the core content at 74 tok/s. Going to 3B gets you slightly better nuance handling, but at 3x the VRAM cost and less than half the speed.

QA: 1.5B for most cases

Simple extraction works at 1B, but anything requiring inference breaks. 1.5B handles most single-hop and simple multi-hop questions well. 3B only justifies its cost for complex multi-step reasoning chains — and at that point, you might want a server-side model anyway.

VRAM Budget

Model	VRAM (q4)	Can run on
0.5B	~400 MB	Any GPU with WebGPU
1B	~950 MB	Most discrete GPUs, some iGPUs
1.5B	~1.2 GB	Discrete GPUs, M1+ Macs
3B	~2.4 GB	Mid-range discrete GPUs

Try It

Open index.html in Chrome 113+ (or any WebGPU-capable browser). The page shows pre-computed reference results immediately, and the "Run Live Benchmark" button lets you test the 0.5B model on your own hardware.

# Serve locally (needed for ES module imports)
npx serve .
# or
python3 -m http.server 8000

Methodology

Models: Qwen2.5 family (0.5B, 1.5B, 3B) + SmolLM2-1.7B, all ONNX-quantized (q4)
Runtime: Transformers.js v4 with WebGPU backend
Hardware: Chrome 130 Canary, NVIDIA RTX 4070 (12GB VRAM)
Eval: 5 runs per task per model, first run excluded (warmup). Quality scored against GPT-4o reference outputs on a 1-5 rubric by two independent raters.
Tasks: 50-sample eval sets per task type, covering easy/medium/hard difficulty

The Takeaway

Stop defaulting to the biggest model you can fit. For most browser UX tasks, you're burning 2-6x the VRAM and getting single-digit percentage improvements in quality. Match the model to the task.

If you need...	Use this	Not this
Sentiment badge	0.5B (400 MB)	3B (2.4 GB)
TL;DR summary	1B (950 MB)	3B (2.4 GB)
Inline Q&A	1.5B (1.2 GB)	3B (2.4 GB)

The minimum viable model is almost always smaller than you think.

Stack

Transformers.js v4 — ML inference in the browser
WebGPU — GPU compute in the browser
Zero build tools, zero frameworks, one HTML file

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
index.html		index.html
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimum Viable Model — WebGPU LLM Benchmark

Results

Key Findings

Classification: 0.5B is enough

Summarization: 1B is the sweet spot

QA: 1.5B for most cases

VRAM Budget

Try It

Methodology

The Takeaway

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Minimum Viable Model — WebGPU LLM Benchmark

Results

Key Findings

Classification: 0.5B is enough

Summarization: 1B is the sweet spot

QA: 1.5B for most cases

VRAM Budget

Try It

Methodology

The Takeaway

Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages