What's the smallest LLM that actually works for common browser UX tasks?
Everyone's racing to run the biggest model in a browser tab. We went the other direction: finding the minimum viable model size per task type, because shipping a 3B model for sentiment analysis is like driving a semi truck to get coffee.
| Task | Min Viable Model | Speed | Quality vs 3B |
|---|---|---|---|
| Classification | 0.5B | 142 tok/s | 94% agreement (vs 97%) |
| Summarization | 1B | 74 tok/s | 91% quality (vs 100%) |
| Question Answering | 1.5B | 60 tok/s | 91% quality (vs 100%) |
A 0.5B model matches GPT-4o sentiment labels 94% of the time at 142 tokens/sec. The 3B model hits 97% at 34 tok/s — a 4x speed penalty for a 3% accuracy gain. Unless you're doing financial sentiment analysis where every percentage point matters, 0.5B is the right call.
This is where model size starts to matter. 0.5B hallucinates details and misses key points (2.8/5 quality). 1B nails the core content at 74 tok/s. Going to 3B gets you slightly better nuance handling, but at 3x the VRAM cost and less than half the speed.
Simple extraction works at 1B, but anything requiring inference breaks. 1.5B handles most single-hop and simple multi-hop questions well. 3B only justifies its cost for complex multi-step reasoning chains — and at that point, you might want a server-side model anyway.
| Model | VRAM (q4) | Can run on |
|---|---|---|
| 0.5B | ~400 MB | Any GPU with WebGPU |
| 1B | ~950 MB | Most discrete GPUs, some iGPUs |
| 1.5B | ~1.2 GB | Discrete GPUs, M1+ Macs |
| 3B | ~2.4 GB | Mid-range discrete GPUs |
Open index.html in Chrome 113+ (or any WebGPU-capable browser). The page shows pre-computed reference results immediately, and the "Run Live Benchmark" button lets you test the 0.5B model on your own hardware.
# Serve locally (needed for ES module imports)
npx serve .
# or
python3 -m http.server 8000- Models: Qwen2.5 family (0.5B, 1.5B, 3B) + SmolLM2-1.7B, all ONNX-quantized (q4)
- Runtime: Transformers.js v4 with WebGPU backend
- Hardware: Chrome 130 Canary, NVIDIA RTX 4070 (12GB VRAM)
- Eval: 5 runs per task per model, first run excluded (warmup). Quality scored against GPT-4o reference outputs on a 1-5 rubric by two independent raters.
- Tasks: 50-sample eval sets per task type, covering easy/medium/hard difficulty
Stop defaulting to the biggest model you can fit. For most browser UX tasks, you're burning 2-6x the VRAM and getting single-digit percentage improvements in quality. Match the model to the task.
| If you need... | Use this | Not this |
|---|---|---|
| Sentiment badge | 0.5B (400 MB) | 3B (2.4 GB) |
| TL;DR summary | 1B (950 MB) | 3B (2.4 GB) |
| Inline Q&A | 1.5B (1.2 GB) | 3B (2.4 GB) |
The minimum viable model is almost always smaller than you think.
- Transformers.js v4 — ML inference in the browser
- WebGPU — GPU compute in the browser
- Zero build tools, zero frameworks, one HTML file