v0.2.0a2: runtime-value comparisons + benchmark libraries (alpha)
Pre-releaseSponsio 0.2.0a2: runtime-value comparisons + benchmark libraries
Released: 2026-06-07 · Status: alpha ·
pip install --pre sponsio==0.2.0a2
The 0.2.0a1 "softer landings" release made contracts more graceful when they fire. 0.2.0a2 makes them more expressive: contracts can now read runtime values out of tool arguments and context facts, compare them against each other, and prescribe the next action instead of only forbidding it.
It also ships the five hand-curated benchmark contract libraries that produce Sponsio's published RedCode-Exec, ODCV-Bench, τ²-bench, AgentDojo, and SWE-bench headline numbers, plus brings the TypeScript SDK to parity on the new deterministic core.
What's new
1. Term abstraction: compare runtime values
What it is. The arithmetic comparison family (Eq, Le, Lt, Ge, Gt) now accepts any Term, not just Var or Const. Four runtime-bound term subclasses ship with this release:
ArgValue(tool, field): raw value ofargs[field]when the current event is a call totool.CtxValue(key): raw value of an externally pushed context fact (guard.observe_context).ArgLength(tool, field):len(args[field])shorthand.UnaryFn(fn, term): apply a Python callable to another term's value.
from sponsio.formulas.formula import ArgValue, CtxValue, Eq, G, Implies, Atom
# "If we issue a refund, the amount must equal what the supervisor approved."
contract("refund matches approval").guarantees(
G(Implies(
Atom("called", "issue_refund"),
Eq(ArgValue("issue_refund", "amount"), CtxValue("approved_amount")),
))
)Why it exists. Until 0.2.0a2 the only way to compare a runtime arg against an out-of-band fact was to push the comparison up into Python and use a custom strategy callback. The Term abstraction lets the comparison live inside the contract, so it shows up in sponsio validate, in audit logs, and in the DFA-compiled fast path.
Why it's good for users.
- Audit-friendly. The constraint is declarative, not buried in callback code. A security reviewer reads the contract and sees what's being compared.
- Cheap. Polymorphic dispatch is microseconds; no per-event Python callback overhead.
- Composable.
UnaryFn(len, ArgValue(...))andArgLength(...)cover length caps;UnaryFn(str.lower, ...)covers case-insensitive matches; arbitrary callables cover the rest. - Safe on missing data. Either operand resolving to
Noneevaluates the comparison to false (the comparison cannot decide) rather than raising. Wrap fragile comparisons inImplies(scope_predicate, comparison)to suppress them where the relevant arg is not applicable.
2. workflow_step(trigger, next_action): prescriptive next-step
What it is. A new pattern that says "when trigger holds at the current event, the next event must satisfy next_action". Compiles to G(trigger -> X(next_action)).
from sponsio.patterns import workflow_step
from sponsio.formulas.formula import Atom
contract("toggle roaming on disabled status").guarantees(
workflow_step(
Atom("ctx", "roaming_status", "disabled"),
Atom("called", "toggle_roaming"),
)
)Why it exists. Sponsio's existing patterns are all block-style: "you must not do X", "X requires Y first". workflow_step is the prescriptive counterpart: "you must do X next". Workflow-style policies ("if you observe X, the next step is Y") map directly onto the pattern without bending the contract into an awkward never-followed-by.
Why it's good for users.
- Both arguments are arbitrary atoms.
called(...),ctx(k, v),arg_field_has(...)all work in either position, so the same factory covers tool ordering, ctx-driven remediation, and arg-conditional follow-ups. - One-step bounded. Unlike the F-style
always_followed_by,workflow_stepdecides after a single event. No liveness obligation hanging at session end.
3. Five benchmark contract libraries
What they are. Hand-curated YAML libraries that reproduce Sponsio's published benchmark headline numbers:
| Library | Benchmark | Contracts |
|---|---|---|
sponsio:benchmark/redcode_exec |
RedCode-Exec dangerous-snippet detection | 26 |
sponsio:benchmark/odcv_bench |
ODCV-Bench KPI-pressure protection | 19 + per-scenario LLM-scan cache |
sponsio:benchmark/tau2_bench |
τ²-bench procedural-correctness | 120 materialised contracts |
sponsio:benchmark/agentdojo |
AgentDojo prompt-injection / lethal-trifecta defence | 31 |
sponsio:benchmark/swebench |
SWE-bench Verified procedural-correctness | ~20 per instance |
Load like a capability pack:
agents:
my_bot:
include:
- sponsio:benchmark/redcode_exec
- sponsio:benchmark/odcv_benchWhy they exist. The numbers in the benchmark documents (95.6% on ODCV-Bench, 92% combined on RedCode, 0.746 AUC on τ²-bench) are reproducible only if the exact contracts are available. The libraries are the documentation-of-record for those results.
Why they're good for users.
- Reproducibility. The published numbers stop being "trust us" and become "run this script on this YAML".
- Forks-as-starting-points. Most rules tagged
code-executionorcode-qualitygeneralise; a handful are calibrated to dataset-specific markers. The library is meant to be forked, edited, and pruned, not used verbatim in production. - Cross-runtime. The YAML loads identically on the Python guard and on the TypeScript SDK. Both runtimes ship the same five files.
4. TypeScript SDK reaches parity on the deterministic core
The TS SDK (@sponsio/sdk) now mirrors:
- The
Termabstraction and all four runtime-bound term classes (ArgValue,CtxValue,UnaryFn,ArgLength). - The
workflowStep(trigger, nextAction, desc?)pattern factory. - The five benchmark contract YAML libraries under
ts/packages/sdk/contracts/benchmark/. - Grounding emits
arg_value(tool, field)andctx_value(key)on every event. - The textual
(formula, trace) -> verdictround-trip parser accepts the three new term tokens.
Verdicts agree on both runtimes for any contract built from primitives that exist in both. Same (formula, trace) pair always produces the same outcome.
Upgrading
pip install --pre sponsio==0.2.0a2
No breaking changes vs 0.2.0a1. Existing contracts continue to compile and behave identically. The new primitives are additive.
Compatibility
VarandConstare nowTermsubclasses. TheArithExprtype is an alias forTerm, so existing type hints keep working.Valuation(TS) is nowRecord<string, unknown>. If your TypeScript code stored boolean / number atoms with an explicitRecord<string, boolean | number>typing, narrow at the call site or upcast as needed.- No CLI or config schema changes.
sponsio validate,sponsio onboard,sponsio.yamlall unchanged.
Known limitations
- TS's
parseNl()does not yet recogniseworkflow_stepor theTermcomparison forms as natural-language strings. The factories ARE available for direct construction; only the NL parser is behind. Seedocs/reference/ts-sdk-parity.md. - TypeScript SDK still does not ship a DFA-compiled evaluator (only the recursive one). Verdicts agree, but the DFA path is faster on long traces. This stays on the roadmap.
What's next
- TS NL parser port for
workflow_stepand theTermforms. - TS DFA-compiled evaluator port.
- Continue closing the v0.2 strategy system gap on TS (
RedirectToSafedispatch in@sponsio/sdk/langchain,EscalateToHuman.notifycallback hooks).
If you are using 0.2.0a2 and hit something we did not predict, open an issue.