The local UI-grounding specialist for hybrid AI agents.
Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
Today, most AI agents route every screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) — just to figure out where to click. That's a $0.01–0.05 multimodal call adding 800ms–2s of round-trip latency, repeated 20–50 times per agent run. The bill compounds. The latency compounds. And screenshots full of private UI leave your machine.
A general-purpose 200B-parameter LLM is overkill for the question "where is the Submit button?" — that's a narrow vision task. The right architecture is a hybrid one: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's actually uniquely good at.
That's exactly what browserground is — the click-grounding specialist.
| Pure-cloud (status quo) | Hybrid (with browserground) | |
|---|---|---|
| Per-screenshot cost | $0.01–0.05 | $0 |
| Latency | 800ms–2s round-trip | ~1.8s local, no network |
| Tokens billed by cloud | 1,500+ multimodal | ~40 text tokens |
| Screenshots leave machine | yes | no |
| Rate limits | yes | no |
ScreenSpot-v2 point-grounding accuracy (300 items, 100/split):
| Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
|---|---|---|---|---|---|---|
| GPT-4o (cloud) | — | 18.3% | — | — | — | — |
| browserground v0.1 | 2 B | 45.3% | 64.0% | 28.0% | 44.0% | 100% |
| SeeClick | 9.6 B | 55.1% | — | — | — | — |
| ShowUI-2B | 2 B | 75.5% | — | — | — | — |
| UI-TARS-2B-SFT | 2 B | 89.5% | — | — | — | — |
| OS-Atlas-Base-7B | 7 B | ~91% | — | — | — | — |
| zero-shot Qwen3-VL-2B | 2 B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
- Beats GPT-4o by 2.5× and zero-shot Qwen3-VL by 7× on the same benchmark
- 100% strict-JSON format compliance — no fences, no commentary
- v0.2 (target ≥ 60%) on the roadmap
npm install -g browserground
browserground parse screenshot.png --target "Submit button"
# {"bbox_2d": [344, 612, 478, 658]}Daemon mode for fast subsequent calls:
browserground serve &
browserground parse a.png --target "Chrome icon"
browserground parse b.png --target "the back arrow"
browserground stopmkdir -p .claude/skills/browserground
curl -sL https://raw.githubusercontent.com/renezander030/browserground/main/plugins/claude-code/SKILL.md \
> .claude/skills/browserground/SKILL.mdClaude routes screen-grounding prompts to the CLI. Spec at plugins/claude-code/SKILL.md.
# Add to ~/.codex/AGENTS.md
tools:
- name: browserground
command: browserground parse "$IMAGE_PATH" --target "$TARGET"
description: Locate a UI element on a screenshot. Returns {"bbox_2d":[x1,y1,x2,y2]}.import subprocess, json
def ground(screenshot_path, target):
out = subprocess.check_output(["browserground", "parse", screenshot_path, "--target", target])
return json.loads(out)["bbox_2d"]- Base:
Qwen/Qwen3-VL-2B-Instruct - Method: LoRA rank 16 (17.4 M trainable params, 0.81% of base) on all linear modules of the LM
- Training mix (12k records): 4k OS-Atlas macOS desktop + 4k Android + 4k UIBert mobile
- Output: strict JSON
{"bbox_2d": [x1, y1, x2, y2]}— system prompt + LoRA produce 100% parseable output
Training scripts and eval JSONs: renezander030/imgparse-tier1 (private — request access).
- v0.2 — Tier 2 LoRA: 26k mixed incl. web, rank 32, 2 epochs, target ScreenSpot-v2 ≥ 60%
- MLX-native build — ~1-2s on Apple Silicon (currently ~14s via MPS+transformers)
- GGUF build — for llama.cpp / Ollama
- Batch mode — many targets per screenshot in one call
More in v0.2.
Pure-cloud AI agents are bottlenecked on vision-LLM cost and latency. Open-source 2B–7B specialist models can match cloud LLMs on narrow tasks (UI-TARS-2B hits 89.5% on ScreenSpot-v2 vs GPT-4o's 18.3%). The composition pattern — specialist local models for narrow tasks + cloud LLMs for general reasoning — is the cost-effective architecture for 2026 AI agents. browserground is one specialist piece. Bring your own orchestrator.
Apache 2.0.
@misc{browserground-2026,
title = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
author = {Zander, René},
year = {2026},
url = {https://huggingface.co/renezander030/browserground}
}