The design vocabulary for AI surfaces that are honest about uncertainty.
This repository is the design-systems layer behind a connected trilogy of agentic-UX code prototypes — Sentinel, Recourse, and Helm — and an evolving set of notes on the patterns that should be re-used across any product surface where a probabilistic system meets a human decision.
It is not a component library you can install yet. It is the explicit vocabulary the three demos already share, written down so it can be re-used.
Probabilistic systems — LLMs, classifiers, search ranking, retrievers, agents — produce outputs that are correct with some probability and wrong with the rest. Most products today render that probability badly: a raw percentage, a "low confidence" highlight, or worse, nothing at all. The user is then asked to trust the output, override it, or escalate it without the information that decision actually requires.
Probabilistic-UI is the attempt to fix that at the primitive layer. The thesis is short:
AI claims become trustworthy only when their uncertainty is legible and their basis is checkable.
The same handful of primitives — applied with discipline — can carry that across radiology workstations, contract review, insurance appeals, agentic developer tools, and every other surface where a confident-looking output might be wrong.
These are the load-bearing pieces that already recur across all three trilogy demos. Naming and visual treatment are deliberately shared so a reviewer who has seen one project recognizes the same primitive in the next.
A small badge in three or four bands — High / Likely / Unsure / Low — with the exact percentage one hover away. The bands match how clinicians, lawyers, and analysts already speak; the percentage is for the moment they need it.
Why not raw percentages by default: numbers create false precision and reviewer fatigue. 73% and 74% look different and are not, behaviorally, different. Bands compress that without losing the signal.
Audience-shifted variant: for non-experts (Recourse), the bands flip to action verbs — Settled / You verify / Ask a lawyer. Same scale, different vocabulary. Confidence is for the reader, not the model.
A diagonal cross-hatch background — visually distinct from any solid color in the system — used to mark a category of failure that demands a different response, not just a degree of worry.
The same primitive is used in three contexts across the trilogy:
| Project | Cross-hatch means |
|---|---|
| Sentinel | This claim is ungrounded / hallucinated — not just "low confidence." Don't treat it like the others. |
| Recourse | This citation is fabricated or unverified. Don't put it in the appeal letter. |
| Helm | This tool call is irreversible — pushes, deletes, network. Can never be auto-allowed. |
The point is the cross-hatch is learned once and recognized everywhere. A designer adding a new probabilistic surface to a product doesn't get to pick their own way of flagging the dangerous category — the cross-hatch already means "different kind, respond differently."
The antidote to hallucination is provenance. Every probabilistic claim should be paired with a checkable source the reader can verify in the same view — a clinical reference, a statute excerpt, a file path, a search result. If the surface can't produce an anchor, the claim should be flagged as ungrounded (see cross-hatch).
Anchors are bidirectional where possible: hovering the claim lights up the region it came from; clicking the region jumps to the claim.
For agent actions specifically, the right scale isn't "is this safe?" — it's what is the recovery cost?
| Band | Definition | Treatment |
|---|---|---|
trivial |
read-only, no side effect | Auto-allow always |
reversible |
can be undone with another step | Auto-allow if policy permits |
danger |
affects shared state (push, network, send) | Never auto-allow — explicit human approval |
destructive |
cannot be undone (force-push, delete) | Explicit approval + extra confirm |
This is the spine of Helm's ApprovalGate and translates one-for-one into any agentic surface.
A single, reusable verdict object for each thing the human decides on:
verdict = Accept | Edit | Reject | Abort
+ rationale (free text, optional but encouraged for edits/rejects)
+ timestamp + reviewer
+ before / after (if edit)
This is the unit of the audit drawer in Sentinel, the decision record in Helm, and the mailing record in Recourse. Three different surfaces, same shape — and the same shape feeds into compliance, reviewer-performance metrics, and continuous-improvement loops for the underlying model.
A probabilistic surface lies if it shows everything at once or shows nothing.
The pattern: the minimum that's load-bearing renders inline (confidence band, the claim, a single evidence chip). Hover reveals the next layer (exact percentage, top-k alternatives, statute excerpt). Click opens the full panel (model rationale, full provenance, edit history). The reader pulls more detail only when something asks for it.
| Primitive | Sentinel | Recourse | Helm |
|---|---|---|---|
| Calibrated confidence | Likely / Unsure / Low (expert language) |
Settled / You verify / Ask a lawyer (action verbs) |
ConfidenceTag on agent intent |
| Cross-hatch | Hallucination chip | Fabricated-statute chip | Irreversibility chip on danger/destructive |
| Evidence anchors | EvidenceLink to clinical / legal source |
Statute chip with operative excerpt + verified-on date | Diff or command surfaced inside the ApprovalGate |
| Verdict shape | Accept / Edit / Reject per claim | Verify / Send per claim | Allow / Edit / Reject / Abort per tool call |
| Audit object | AuditEntry per decision |
MailingRecord + deadline ledger |
DecisionRecord per step |
| Progressive disclosure | Inline → hover top-k → drawer rationale | Plain-language gloss → excerpt → full statute | Tool kind → diff → cost notice |
A reviewer who learns the vocabulary on one project can read the others at a glance. That's the whole point.
This is the vocabulary, not a published component library. The shipped components live in their respective trilogy repos for now:
- React primitives →
sentinel-react(in Human-in-the-Loop) - ApprovalGate / ReversibilityChip → Helm
- Statute / cadence primitives → Recourse
A future iteration of this repo will extract the shared bits into a single installable package with the canonical types and CSS variables. For now, the README is the contract.
The same vocabulary applies cleanly to:
- Search & retrieval surfaces — "this result is plausible / this result is hallucinated" is the Sentinel pattern applied to RAG output.
- Agentic workflows in any domain (code, ops, support, design tooling) — Helm's reversibility-as-axis carries over without modification.
- Personalized / recommendation surfaces — calibrated bands beat raw scores for the same reasons.
- Compliance & audit-heavy enterprise SaaS — the verdict shape is already the audit primitive.
If you're building any of those, the patterns here are designed to be picked up and re-used. Open an issue or message me — happy to talk specifics.