gemma-webgpu

Run Gemma 3 1B locally in the browser via WebGPU. Q8_0 quantized, streaming generation, multi-turn chat with KV cache reuse. Zero dependencies.

Live Demo · 50KB min · 12KB gzip

Benchmarks

Mac Mini M4 Pro — Chrome 134

Compared against transformers.js (ONNX Runtime WebGPU):

Model	Engine	Quant	Generation	TTFT
Gemma 3 270M	gemma-webgpu	Q8_0	136.8 t/s	0.11s
Gemma 3 270M	transformers.js	q4	41.7 t/s	0.51s
Gemma 3 1B	gemma-webgpu	Q8_0	59.8 t/s	0.28s
Gemma 3 1B	transformers.js	q4	crashes	—

3.3x faster on 270M with higher-fidelity Q8_0 quantization. transformers.js can't load the 1B model (ONNX WebGPU abort).

iPhone 17 Pro Max — Safari, iOS 26

Real device testing via LambdaTest:

Model	Generation	TTFT	Total time
Gemma 3 1B	34.4 t/s	0.45s	4.1s
Gemma 3 270M	101.1 t/s	0.14s	1.4s

1B model running at 34 tok/s on a phone — streamed via Range requests, never holding the full 1GB in JS memory. 270M hits 100+ tok/s. WebGPU on iOS 26 Safari.

Run the benchmark yourself →

Features

Gemma 3 1B and 270M — runs entirely in-browser, no server needed
Q8_0 quantization — high quality inference at ~1GB model size
Streaming generation — async iterator API, tokens streamed as generated
Multi-turn chat — KV cache reuse for fast follow-up messages
Range request loading — streams weights layer-by-layer, works on iPhone
12KB gzipped — zero dependencies, pure WebGPU compute shaders

Install

npm install gemma-webgpu

Usage

import { createGemmaEngine } from 'gemma-webgpu'

const engine = await createGemmaEngine({
  model: '1b', // '1b', '270m', or a full URL to a .gguf file
  onProgress: (p) => console.log(p.status),
});

// Multi-turn conversation
engine.addUserMessage('What is the capital of France?');
for await (const token of engine.generate({ temperature: 0.7 })) {
  process.stdout.write(token);
}

// Follow-up reuses KV cache — near-instant prefill
engine.addUserMessage('And what about Germany?');
for await (const token of engine.generate()) {
  process.stdout.write(token);
}

// Reset conversation
engine.resetConversation();

// Cleanup
engine.dispose();

API

`createGemmaEngine(options?)`

Creates and initializes a Gemma engine. Downloads and loads the model weights.

Option	Type	Default	Description
`model`	`string`	`'1b'`	Model to load: `'1b'`, `'270m'`, or a URL to a `.gguf` file
`onProgress`	`function`	—	Progress callback: `({ loaded, total, status }) => void`
`contextLength`	`number`	`2048`	Maximum context length in tokens

`engine.addUserMessage(text)`

Add a user message to the conversation history.

`engine.generate(options?)`

Returns an AsyncGenerator<string> that yields decoded tokens.

Option	Type	Default	Description
`temperature`	`number`	`0.7`	Sampling temperature. 0 = greedy
`topP`	`number`	`0.9`	Top-P nucleus sampling threshold
`repPenalty`	`number`	`1.2`	Repetition penalty. 1.0 = none
`maxTokens`	`number`	`32768`	Maximum tokens to generate
`toolsJson`	`string`	`'[]'`	JSON array of tool declarations for function calling
`signal`	`AbortSignal`	—	AbortSignal to cancel generation mid-stream

`engine.resetConversation()`

Clears conversation history and resets KV cache.

`engine.dispose()`

Releases all GPU resources.

`engine.config`

Read-only model configuration (hidden size, layers, vocab size, etc).

Requirements

A browser with WebGPU support (Chrome 113+, Edge 113+, Safari 18+)
For the 1B model: ~1GB download + ~1.5GB GPU memory
For the 270M model: ~300MB download + ~500MB GPU memory

How It Works

GGUF parsing — reads model metadata and tokenizer vocabulary from the GGUF header
Range request streaming — fetches weights layer-by-layer via HTTP Range requests (~44MB each), uploads to GPU, frees JS memory. Peak JS memory is ~50MB instead of ~1GB
WebGPU compute shaders — 18 custom WGSL shaders for embedding lookup, RMS norm, RoPE, attention, FFN, and sampling
KV cache reuse — follow-up messages only prefill new tokens, making multi-turn conversations fast

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
dist		dist
src		src
.gitignore		.gitignore
README.md		README.md
bench.html		bench.html
debug.html		debug.html
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gemma-webgpu

Benchmarks

Mac Mini M4 Pro — Chrome 134

iPhone 17 Pro Max — Safari, iOS 26

Features

Install

Usage

API

`createGemmaEngine(options?)`

`engine.addUserMessage(text)`

`engine.generate(options?)`

`engine.resetConversation()`

`engine.dispose()`

`engine.config`

Requirements

How It Works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gemma-webgpu

Benchmarks

Mac Mini M4 Pro — Chrome 134

iPhone 17 Pro Max — Safari, iOS 26

Features

Install

Usage

API

createGemmaEngine(options?)

engine.addUserMessage(text)

engine.generate(options?)

engine.resetConversation()

engine.dispose()

engine.config

Requirements

How It Works

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`createGemmaEngine(options?)`

`engine.addUserMessage(text)`

`engine.generate(options?)`

`engine.resetConversation()`

`engine.dispose()`

`engine.config`

Packages