Skip to content

sinhaankur/Probabilistic-UI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Probabilistic-UI

The design vocabulary for AI surfaces that are honest about uncertainty.

This repository is the design-systems layer behind a connected trilogy of agentic-UX code prototypes — Sentinel, Recourse, and Helm — and an evolving set of notes on the patterns that should be re-used across any product surface where a probabilistic system meets a human decision.

It is not a component library you can install yet. It is the explicit vocabulary the three demos already share, written down so it can be re-used.


The premise

Probabilistic systems — LLMs, classifiers, search ranking, retrievers, agents — produce outputs that are correct with some probability and wrong with the rest. Most products today render that probability badly: a raw percentage, a "low confidence" highlight, or worse, nothing at all. The user is then asked to trust the output, override it, or escalate it without the information that decision actually requires.

Probabilistic-UI is the attempt to fix that at the primitive layer. The thesis is short:

AI claims become trustworthy only when their uncertainty is legible and their basis is checkable.

The same handful of primitives — applied with discipline — can carry that across radiology workstations, contract review, insurance appeals, agentic developer tools, and every other surface where a confident-looking output might be wrong.


The primitives, so far

These are the load-bearing pieces that already recur across all three trilogy demos. Naming and visual treatment are deliberately shared so a reviewer who has seen one project recognizes the same primitive in the next.

1. Calibrated confidence (language over numbers)

A small badge in three or four bands — High / Likely / Unsure / Low — with the exact percentage one hover away. The bands match how clinicians, lawyers, and analysts already speak; the percentage is for the moment they need it.

Why not raw percentages by default: numbers create false precision and reviewer fatigue. 73% and 74% look different and are not, behaviorally, different. Bands compress that without losing the signal.

Audience-shifted variant: for non-experts (Recourse), the bands flip to action verbsSettled / You verify / Ask a lawyer. Same scale, different vocabulary. Confidence is for the reader, not the model.

2. The cross-hatch primitive

A diagonal cross-hatch background — visually distinct from any solid color in the system — used to mark a category of failure that demands a different response, not just a degree of worry.

The same primitive is used in three contexts across the trilogy:

Project Cross-hatch means
Sentinel This claim is ungrounded / hallucinated — not just "low confidence." Don't treat it like the others.
Recourse This citation is fabricated or unverified. Don't put it in the appeal letter.
Helm This tool call is irreversible — pushes, deletes, network. Can never be auto-allowed.

The point is the cross-hatch is learned once and recognized everywhere. A designer adding a new probabilistic surface to a product doesn't get to pick their own way of flagging the dangerous category — the cross-hatch already means "different kind, respond differently."

3. Evidence anchors

The antidote to hallucination is provenance. Every probabilistic claim should be paired with a checkable source the reader can verify in the same view — a clinical reference, a statute excerpt, a file path, a search result. If the surface can't produce an anchor, the claim should be flagged as ungrounded (see cross-hatch).

Anchors are bidirectional where possible: hovering the claim lights up the region it came from; clicking the region jumps to the claim.

4. Reversibility as the policy axis

For agent actions specifically, the right scale isn't "is this safe?" — it's what is the recovery cost?

Band Definition Treatment
trivial read-only, no side effect Auto-allow always
reversible can be undone with another step Auto-allow if policy permits
danger affects shared state (push, network, send) Never auto-allow — explicit human approval
destructive cannot be undone (force-push, delete) Explicit approval + extra confirm

This is the spine of Helm's ApprovalGate and translates one-for-one into any agentic surface.

5. The verdict shape

A single, reusable verdict object for each thing the human decides on:

verdict = Accept | Edit | Reject | Abort
+ rationale (free text, optional but encouraged for edits/rejects)
+ timestamp + reviewer
+ before / after (if edit)

This is the unit of the audit drawer in Sentinel, the decision record in Helm, and the mailing record in Recourse. Three different surfaces, same shape — and the same shape feeds into compliance, reviewer-performance metrics, and continuous-improvement loops for the underlying model.

6. Progressive disclosure

A probabilistic surface lies if it shows everything at once or shows nothing.

The pattern: the minimum that's load-bearing renders inline (confidence band, the claim, a single evidence chip). Hover reveals the next layer (exact percentage, top-k alternatives, statute excerpt). Click opens the full panel (model rationale, full provenance, edit history). The reader pulls more detail only when something asks for it.


How the three demos use the same vocabulary

Primitive Sentinel Recourse Helm
Calibrated confidence Likely / Unsure / Low (expert language) Settled / You verify / Ask a lawyer (action verbs) ConfidenceTag on agent intent
Cross-hatch Hallucination chip Fabricated-statute chip Irreversibility chip on danger/destructive
Evidence anchors EvidenceLink to clinical / legal source Statute chip with operative excerpt + verified-on date Diff or command surfaced inside the ApprovalGate
Verdict shape Accept / Edit / Reject per claim Verify / Send per claim Allow / Edit / Reject / Abort per tool call
Audit object AuditEntry per decision MailingRecord + deadline ledger DecisionRecord per step
Progressive disclosure Inline → hover top-k → drawer rationale Plain-language gloss → excerpt → full statute Tool kind → diff → cost notice

A reviewer who learns the vocabulary on one project can read the others at a glance. That's the whole point.


What's not in this repo yet

This is the vocabulary, not a published component library. The shipped components live in their respective trilogy repos for now:

A future iteration of this repo will extract the shared bits into a single installable package with the canonical types and CSS variables. For now, the README is the contract.


Why it matters for product surfaces beyond the trilogy

The same vocabulary applies cleanly to:

  • Search & retrieval surfaces — "this result is plausible / this result is hallucinated" is the Sentinel pattern applied to RAG output.
  • Agentic workflows in any domain (code, ops, support, design tooling) — Helm's reversibility-as-axis carries over without modification.
  • Personalized / recommendation surfaces — calibrated bands beat raw scores for the same reasons.
  • Compliance & audit-heavy enterprise SaaS — the verdict shape is already the audit primitive.

If you're building any of those, the patterns here are designed to be picked up and re-used. Open an issue or message me — happy to talk specifics.


Related work

Ankur Sinha

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors