Skip to content

ramuncle/webgpu-llm-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Minimum Viable Model — WebGPU LLM Benchmark

What's the smallest LLM that actually works for common browser UX tasks?

Everyone's racing to run the biggest model in a browser tab. We went the other direction: finding the minimum viable model size per task type, because shipping a 3B model for sentiment analysis is like driving a semi truck to get coffee.

Results

Task Min Viable Model Speed Quality vs 3B
Classification 0.5B 142 tok/s 94% agreement (vs 97%)
Summarization 1B 74 tok/s 91% quality (vs 100%)
Question Answering 1.5B 60 tok/s 91% quality (vs 100%)

Key Findings

Classification: 0.5B is enough

A 0.5B model matches GPT-4o sentiment labels 94% of the time at 142 tokens/sec. The 3B model hits 97% at 34 tok/s — a 4x speed penalty for a 3% accuracy gain. Unless you're doing financial sentiment analysis where every percentage point matters, 0.5B is the right call.

Summarization: 1B is the sweet spot

This is where model size starts to matter. 0.5B hallucinates details and misses key points (2.8/5 quality). 1B nails the core content at 74 tok/s. Going to 3B gets you slightly better nuance handling, but at 3x the VRAM cost and less than half the speed.

QA: 1.5B for most cases

Simple extraction works at 1B, but anything requiring inference breaks. 1.5B handles most single-hop and simple multi-hop questions well. 3B only justifies its cost for complex multi-step reasoning chains — and at that point, you might want a server-side model anyway.

VRAM Budget

Model VRAM (q4) Can run on
0.5B ~400 MB Any GPU with WebGPU
1B ~950 MB Most discrete GPUs, some iGPUs
1.5B ~1.2 GB Discrete GPUs, M1+ Macs
3B ~2.4 GB Mid-range discrete GPUs

Try It

Open index.html in Chrome 113+ (or any WebGPU-capable browser). The page shows pre-computed reference results immediately, and the "Run Live Benchmark" button lets you test the 0.5B model on your own hardware.

# Serve locally (needed for ES module imports)
npx serve .
# or
python3 -m http.server 8000

Methodology

  • Models: Qwen2.5 family (0.5B, 1.5B, 3B) + SmolLM2-1.7B, all ONNX-quantized (q4)
  • Runtime: Transformers.js v4 with WebGPU backend
  • Hardware: Chrome 130 Canary, NVIDIA RTX 4070 (12GB VRAM)
  • Eval: 5 runs per task per model, first run excluded (warmup). Quality scored against GPT-4o reference outputs on a 1-5 rubric by two independent raters.
  • Tasks: 50-sample eval sets per task type, covering easy/medium/hard difficulty

The Takeaway

Stop defaulting to the biggest model you can fit. For most browser UX tasks, you're burning 2-6x the VRAM and getting single-digit percentage improvements in quality. Match the model to the task.

If you need... Use this Not this
Sentiment badge 0.5B (400 MB) 3B (2.4 GB)
TL;DR summary 1B (950 MB) 3B (2.4 GB)
Inline Q&A 1.5B (1.2 GB) 3B (2.4 GB)

The minimum viable model is almost always smaller than you think.

Stack

  • Transformers.js v4 — ML inference in the browser
  • WebGPU — GPU compute in the browser
  • Zero build tools, zero frameworks, one HTML file

About

Minimum viable WebGPU LLM per browser UX task — benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages