Skip to content

v0.2.0a2: runtime-value comparisons + benchmark libraries (alpha)

Pre-release
Pre-release

Choose a tag to compare

@yfxiao16 yfxiao16 released this 07 Jun 06:47
· 19 commits to main since this release

Sponsio 0.2.0a2: runtime-value comparisons + benchmark libraries

Released: 2026-06-07 · Status: alpha · pip install --pre sponsio==0.2.0a2

The 0.2.0a1 "softer landings" release made contracts more graceful when they fire. 0.2.0a2 makes them more expressive: contracts can now read runtime values out of tool arguments and context facts, compare them against each other, and prescribe the next action instead of only forbidding it.

It also ships the five hand-curated benchmark contract libraries that produce Sponsio's published RedCode-Exec, ODCV-Bench, τ²-bench, AgentDojo, and SWE-bench headline numbers, plus brings the TypeScript SDK to parity on the new deterministic core.


What's new

1. Term abstraction: compare runtime values

What it is. The arithmetic comparison family (Eq, Le, Lt, Ge, Gt) now accepts any Term, not just Var or Const. Four runtime-bound term subclasses ship with this release:

  • ArgValue(tool, field): raw value of args[field] when the current event is a call to tool.
  • CtxValue(key): raw value of an externally pushed context fact (guard.observe_context).
  • ArgLength(tool, field): len(args[field]) shorthand.
  • UnaryFn(fn, term): apply a Python callable to another term's value.
from sponsio.formulas.formula import ArgValue, CtxValue, Eq, G, Implies, Atom

# "If we issue a refund, the amount must equal what the supervisor approved."
contract("refund matches approval").guarantees(
    G(Implies(
        Atom("called", "issue_refund"),
        Eq(ArgValue("issue_refund", "amount"), CtxValue("approved_amount")),
    ))
)

Why it exists. Until 0.2.0a2 the only way to compare a runtime arg against an out-of-band fact was to push the comparison up into Python and use a custom strategy callback. The Term abstraction lets the comparison live inside the contract, so it shows up in sponsio validate, in audit logs, and in the DFA-compiled fast path.

Why it's good for users.

  • Audit-friendly. The constraint is declarative, not buried in callback code. A security reviewer reads the contract and sees what's being compared.
  • Cheap. Polymorphic dispatch is microseconds; no per-event Python callback overhead.
  • Composable. UnaryFn(len, ArgValue(...)) and ArgLength(...) cover length caps; UnaryFn(str.lower, ...) covers case-insensitive matches; arbitrary callables cover the rest.
  • Safe on missing data. Either operand resolving to None evaluates the comparison to false (the comparison cannot decide) rather than raising. Wrap fragile comparisons in Implies(scope_predicate, comparison) to suppress them where the relevant arg is not applicable.

2. workflow_step(trigger, next_action): prescriptive next-step

What it is. A new pattern that says "when trigger holds at the current event, the next event must satisfy next_action". Compiles to G(trigger -> X(next_action)).

from sponsio.patterns import workflow_step
from sponsio.formulas.formula import Atom

contract("toggle roaming on disabled status").guarantees(
    workflow_step(
        Atom("ctx", "roaming_status", "disabled"),
        Atom("called", "toggle_roaming"),
    )
)

Why it exists. Sponsio's existing patterns are all block-style: "you must not do X", "X requires Y first". workflow_step is the prescriptive counterpart: "you must do X next". Workflow-style policies ("if you observe X, the next step is Y") map directly onto the pattern without bending the contract into an awkward never-followed-by.

Why it's good for users.

  • Both arguments are arbitrary atoms. called(...), ctx(k, v), arg_field_has(...) all work in either position, so the same factory covers tool ordering, ctx-driven remediation, and arg-conditional follow-ups.
  • One-step bounded. Unlike the F-style always_followed_by, workflow_step decides after a single event. No liveness obligation hanging at session end.

3. Five benchmark contract libraries

What they are. Hand-curated YAML libraries that reproduce Sponsio's published benchmark headline numbers:

Library Benchmark Contracts
sponsio:benchmark/redcode_exec RedCode-Exec dangerous-snippet detection 26
sponsio:benchmark/odcv_bench ODCV-Bench KPI-pressure protection 19 + per-scenario LLM-scan cache
sponsio:benchmark/tau2_bench τ²-bench procedural-correctness 120 materialised contracts
sponsio:benchmark/agentdojo AgentDojo prompt-injection / lethal-trifecta defence 31
sponsio:benchmark/swebench SWE-bench Verified procedural-correctness ~20 per instance

Load like a capability pack:

agents:
  my_bot:
    include:
      - sponsio:benchmark/redcode_exec
      - sponsio:benchmark/odcv_bench

Why they exist. The numbers in the benchmark documents (95.6% on ODCV-Bench, 92% combined on RedCode, 0.746 AUC on τ²-bench) are reproducible only if the exact contracts are available. The libraries are the documentation-of-record for those results.

Why they're good for users.

  • Reproducibility. The published numbers stop being "trust us" and become "run this script on this YAML".
  • Forks-as-starting-points. Most rules tagged code-execution or code-quality generalise; a handful are calibrated to dataset-specific markers. The library is meant to be forked, edited, and pruned, not used verbatim in production.
  • Cross-runtime. The YAML loads identically on the Python guard and on the TypeScript SDK. Both runtimes ship the same five files.

4. TypeScript SDK reaches parity on the deterministic core

The TS SDK (@sponsio/sdk) now mirrors:

  • The Term abstraction and all four runtime-bound term classes (ArgValue, CtxValue, UnaryFn, ArgLength).
  • The workflowStep(trigger, nextAction, desc?) pattern factory.
  • The five benchmark contract YAML libraries under ts/packages/sdk/contracts/benchmark/.
  • Grounding emits arg_value(tool, field) and ctx_value(key) on every event.
  • The textual (formula, trace) -> verdict round-trip parser accepts the three new term tokens.

Verdicts agree on both runtimes for any contract built from primitives that exist in both. Same (formula, trace) pair always produces the same outcome.


Upgrading

pip install --pre sponsio==0.2.0a2

No breaking changes vs 0.2.0a1. Existing contracts continue to compile and behave identically. The new primitives are additive.

Compatibility

  • Var and Const are now Term subclasses. The ArithExpr type is an alias for Term, so existing type hints keep working.
  • Valuation (TS) is now Record<string, unknown>. If your TypeScript code stored boolean / number atoms with an explicit Record<string, boolean | number> typing, narrow at the call site or upcast as needed.
  • No CLI or config schema changes. sponsio validate, sponsio onboard, sponsio.yaml all unchanged.

Known limitations

  • TS's parseNl() does not yet recognise workflow_step or the Term comparison forms as natural-language strings. The factories ARE available for direct construction; only the NL parser is behind. See docs/reference/ts-sdk-parity.md.
  • TypeScript SDK still does not ship a DFA-compiled evaluator (only the recursive one). Verdicts agree, but the DFA path is faster on long traces. This stays on the roadmap.

What's next

  • TS NL parser port for workflow_step and the Term forms.
  • TS DFA-compiled evaluator port.
  • Continue closing the v0.2 strategy system gap on TS (RedirectToSafe dispatch in @sponsio/sdk/langchain, EscalateToHuman.notify callback hooks).

If you are using 0.2.0a2 and hit something we did not predict, open an issue.