Run Gemma 3 1B locally in the browser via WebGPU. Q8_0 quantized, streaming generation, multi-turn chat with KV cache reuse. Zero dependencies.
Live Demo · 50KB min · 12KB gzip
Compared against transformers.js (ONNX Runtime WebGPU):
| Model | Engine | Quant | Generation | TTFT |
|---|---|---|---|---|
| Gemma 3 270M | gemma-webgpu | Q8_0 | 136.8 t/s | 0.11s |
| Gemma 3 270M | transformers.js | q4 | 41.7 t/s | 0.51s |
| Gemma 3 1B | gemma-webgpu | Q8_0 | 59.8 t/s | 0.28s |
| Gemma 3 1B | transformers.js | q4 | crashes | — |
3.3x faster on 270M with higher-fidelity Q8_0 quantization. transformers.js can't load the 1B model (ONNX WebGPU abort).
Real device testing via LambdaTest:
| Model | Generation | TTFT | Total time |
|---|---|---|---|
| Gemma 3 1B | 34.4 t/s | 0.45s | 4.1s |
| Gemma 3 270M | 101.1 t/s | 0.14s | 1.4s |
1B model running at 34 tok/s on a phone — streamed via Range requests, never holding the full 1GB in JS memory. 270M hits 100+ tok/s. WebGPU on iOS 26 Safari.
- Gemma 3 1B and 270M — runs entirely in-browser, no server needed
- Q8_0 quantization — high quality inference at ~1GB model size
- Streaming generation — async iterator API, tokens streamed as generated
- Multi-turn chat — KV cache reuse for fast follow-up messages
- Range request loading — streams weights layer-by-layer, works on iPhone
- 12KB gzipped — zero dependencies, pure WebGPU compute shaders
npm install gemma-webgpuimport { createGemmaEngine } from 'gemma-webgpu'
const engine = await createGemmaEngine({
model: '1b', // '1b', '270m', or a full URL to a .gguf file
onProgress: (p) => console.log(p.status),
});
// Multi-turn conversation
engine.addUserMessage('What is the capital of France?');
for await (const token of engine.generate({ temperature: 0.7 })) {
process.stdout.write(token);
}
// Follow-up reuses KV cache — near-instant prefill
engine.addUserMessage('And what about Germany?');
for await (const token of engine.generate()) {
process.stdout.write(token);
}
// Reset conversation
engine.resetConversation();
// Cleanup
engine.dispose();Creates and initializes a Gemma engine. Downloads and loads the model weights.
| Option | Type | Default | Description |
|---|---|---|---|
model |
string |
'1b' |
Model to load: '1b', '270m', or a URL to a .gguf file |
onProgress |
function |
— | Progress callback: ({ loaded, total, status }) => void |
contextLength |
number |
2048 |
Maximum context length in tokens |
Add a user message to the conversation history.
Returns an AsyncGenerator<string> that yields decoded tokens.
| Option | Type | Default | Description |
|---|---|---|---|
temperature |
number |
0.7 |
Sampling temperature. 0 = greedy |
topP |
number |
0.9 |
Top-P nucleus sampling threshold |
repPenalty |
number |
1.2 |
Repetition penalty. 1.0 = none |
maxTokens |
number |
32768 |
Maximum tokens to generate |
toolsJson |
string |
'[]' |
JSON array of tool declarations for function calling |
signal |
AbortSignal |
— | AbortSignal to cancel generation mid-stream |
Clears conversation history and resets KV cache.
Releases all GPU resources.
Read-only model configuration (hidden size, layers, vocab size, etc).
- A browser with WebGPU support (Chrome 113+, Edge 113+, Safari 18+)
- For the 1B model: ~1GB download + ~1.5GB GPU memory
- For the 270M model: ~300MB download + ~500MB GPU memory
- GGUF parsing — reads model metadata and tokenizer vocabulary from the GGUF header
- Range request streaming — fetches weights layer-by-layer via HTTP Range requests (~44MB each), uploads to GPU, frees JS memory. Peak JS memory is ~50MB instead of ~1GB
- WebGPU compute shaders — 18 custom WGSL shaders for embedding lookup, RMS norm, RoPE, attention, FFN, and sampling
- KV cache reuse — follow-up messages only prefill new tokens, making multi-turn conversations fast
MIT