Skip to content

renezander030/browserground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

browserground logo

browserground

The local UI-grounding specialist for hybrid AI agents.
Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.

HF model npm License Base


The hybrid AI argument

Today, most AI agents route every screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) — just to figure out where to click. That's a $0.01–0.05 multimodal call adding 800ms–2s of round-trip latency, repeated 20–50 times per agent run. The bill compounds. The latency compounds. And screenshots full of private UI leave your machine.

A general-purpose 200B-parameter LLM is overkill for the question "where is the Submit button?" — that's a narrow vision task. The right architecture is a hybrid one: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's actually uniquely good at.

That's exactly what browserground is — the click-grounding specialist.

Hybrid AI agent architecture diagram

Pure-cloud (status quo) Hybrid (with browserground)
Per-screenshot cost $0.01–0.05 $0
Latency 800ms–2s round-trip ~1.8s local, no network
Tokens billed by cloud 1,500+ multimodal ~40 text tokens
Screenshots leave machine yes no
Rate limits yes no

Status: v0.1 (Tier 1.5 LoRA)

ScreenSpot-v2 point-grounding accuracy (300 items, 100/split):

Model Params Overall Mobile Desktop Web Format-OK
GPT-4o (cloud) 18.3%
browserground v0.1 2 B 45.3% 64.0% 28.0% 44.0% 100%
SeeClick 9.6 B 55.1%
ShowUI-2B 2 B 75.5%
UI-TARS-2B-SFT 2 B 89.5%
OS-Atlas-Base-7B 7 B ~91%
zero-shot Qwen3-VL-2B 2 B 6.3% 7.0% 6.0% 6.0% 100%
  • Beats GPT-4o by 2.5× and zero-shot Qwen3-VL by on the same benchmark
  • 100% strict-JSON format compliance — no fences, no commentary
  • v0.2 (target ≥ 60%) on the roadmap

Quick start

npm install -g browserground
browserground parse screenshot.png --target "Submit button"
# {"bbox_2d": [344, 612, 478, 658]}

Daemon mode for fast subsequent calls:

browserground serve &
browserground parse a.png --target "Chrome icon"
browserground parse b.png --target "the back arrow"
browserground stop

Hook into your agent stack

Claude Code

mkdir -p .claude/skills/browserground
curl -sL https://raw.githubusercontent.com/renezander030/browserground/main/plugins/claude-code/SKILL.md \
  > .claude/skills/browserground/SKILL.md

Claude routes screen-grounding prompts to the CLI. Spec at plugins/claude-code/SKILL.md.

Codex CLI

# Add to ~/.codex/AGENTS.md
tools:
  - name: browserground
    command: browserground parse "$IMAGE_PATH" --target "$TARGET"
    description: Locate a UI element on a screenshot. Returns {"bbox_2d":[x1,y1,x2,y2]}.

browser-use / Skyvern (Python)

import subprocess, json
def ground(screenshot_path, target):
    out = subprocess.check_output(["browserground", "parse", screenshot_path, "--target", target])
    return json.loads(out)["bbox_2d"]

How it works

  • Base: Qwen/Qwen3-VL-2B-Instruct
  • Method: LoRA rank 16 (17.4 M trainable params, 0.81% of base) on all linear modules of the LM
  • Training mix (12k records): 4k OS-Atlas macOS desktop + 4k Android + 4k UIBert mobile
  • Output: strict JSON {"bbox_2d": [x1, y1, x2, y2]} — system prompt + LoRA produce 100% parseable output

Training scripts and eval JSONs: renezander030/imgparse-tier1 (private — request access).

What's planned

  • v0.2 — Tier 2 LoRA: 26k mixed incl. web, rank 32, 2 epochs, target ScreenSpot-v2 ≥ 60%
  • MLX-native build — ~1-2s on Apple Silicon (currently ~14s via MPS+transformers)
  • GGUF build — for llama.cpp / Ollama
  • Batch mode — many targets per screenshot in one call

More in v0.2.

Why this exists

Pure-cloud AI agents are bottlenecked on vision-LLM cost and latency. Open-source 2B–7B specialist models can match cloud LLMs on narrow tasks (UI-TARS-2B hits 89.5% on ScreenSpot-v2 vs GPT-4o's 18.3%). The composition pattern — specialist local models for narrow tasks + cloud LLMs for general reasoning — is the cost-effective architecture for 2026 AI agents. browserground is one specialist piece. Bring your own orchestrator.

License

Apache 2.0.


@misc{browserground-2026,
  title  = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
  author = {Zander, René},
  year   = {2026},
  url    = {https://huggingface.co/renezander030/browserground}
}

Releases

No releases published

Packages

 
 
 

Contributors