v0.70.0 β Eval Framework
π¦ Nika 0.70.0 β Eval Framework
Inference as Code Β· April 5, 2026 Β· 16 commits
| π§ͺ Tests | π§ Builtins | π¦ Transforms | π Providers | π¦ Crates |
|---|---|---|---|---|
| 10,142 | 62 | 63 | 14 | 17 |
β§ infer Β· β exec Β· β fetch Β· β invoke Β· β agent
β¨ This Release in 30 Seconds
Workflows deserve tests. Until now, validating a Nika workflow meant running it against a real provider and eyeballing the output. That era is over. v0.70.0 introduces nika eval β a dataset-driven evaluation framework that lets you define inputs, set assertions, and get a pass/fail report you can wire into any CI pipeline. Pair that with a brand-new batch endpoint that accepts up to 50 workflows in a single POST, queryable job tags for organizing production runs, and a Storage V4 migration that makes it all filterable at the SQLite level β and you've got the building blocks for serious workflow ops. This is the release where Nika stops being "run and hope" and starts being "test, tag, batch, observe."
π§ͺ nika eval β Dataset-Driven Workflow Evaluation
Every LLM workflow has a question nobody wants to answer: "Did the last refactor break anything?" Running manually is slow and expensive. nika eval solves this by letting you define a dataset of inputs + expectations, then running your workflow against each row with automatic assertions.
Four assertion types ship out of the box:
| Assertion | What it checks | Example |
|---|---|---|
output_contains |
Substring present in output | output_contains: "ownership" |
output_min_words |
Output has at least N words | output_min_words: 50 |
output_max_words |
Output has at most N words | output_max_words: 500 |
output_matches_schema |
Output validates against JSON Schema | type: object, required: [summary] |
# eval-dataset.yaml
- inputs: { topic: "Rust memory safety" }
expect:
output_contains: "ownership"
output_min_words: 50
- inputs: { topic: "Python async" }
expect:
output_contains: "asyncio"
output_max_words: 500
- inputs: { topic: "WebAssembly" }
expect:
output_matches_schema:
type: object
required: [summary, key_points]# Run eval β mock provider by default (zero cost, instant)
nika eval research.nika.yaml --dataset eval-dataset.yaml
# JSON output for CI pipeline gates
nika eval research.nika.yaml --dataset eval-dataset.yaml --format jsonπ‘ Example output
π§ͺ Eval: research.nika.yaml (3 rows)
β
Row 1: PASS (2/2 assertions)
β
Row 2: PASS (2/2 assertions)
β Row 3: FAIL (output_matches_schema: missing field "key_points")
βββββββββββββββββββββββββββββ
Summary: 2/3 passed
Exit code: 1 (use in CI: any failure = non-zero)
Tip
nika eval uses --provider mock by default, so your evaluation datasets run instantly with zero API cost. Switch to --provider anthropic when you want real LLM validation.
π¦ POST /v1/batch/run β Submit 50 Workflows at Once
If you're orchestrating Nika from an external system β a Node.js backend, a Python script, a CI job β sending one HTTP request per workflow is painful. The new batch endpoint accepts up to 50 workflow executions in a single POST, with two-pass validation: syntax check all requests first, then submit for execution. No more partial failures where request 47 fails validation after 46 are already running.
curl -X POST http://localhost:3000/v1/batch/run \
-H "Authorization: Bearer $NIKA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"requests": [
{
"workflow": "translate.nika.yaml",
"inputs": { "lang": "fr", "source": "README.md" },
"tags": { "env": "prod", "team": "i18n" }
},
{
"workflow": "translate.nika.yaml",
"inputs": { "lang": "de", "source": "README.md" },
"tags": { "env": "prod", "team": "i18n" }
}
]
}'π‘ Response structure
{
"jobs": [
{ "job_id": "abc-123", "workflow": "translate.nika.yaml", "state": "queued" },
{ "job_id": "def-456", "workflow": "translate.nika.yaml", "state": "queued" }
],
"accepted": 2,
"rejected": 0
}Note
Two-pass validation means the batch endpoint checks every request's syntax and workflow existence before queueing any of them. Either the entire batch is accepted, or you get a detailed error report for each rejected request.
π·οΈ Job Tags + Filtered Listing
Production workflows need metadata. Which environment? Which team? Which customer triggered this? Job tags let you attach arbitrary key-value pairs to any workflow execution, and the new GET /v1/jobs endpoint makes them filterable.
# List all completed jobs tagged for production
curl "http://localhost:3000/v1/jobs?state=completed&tag.env=prod&limit=20"
# Filter by team
curl "http://localhost:3000/v1/jobs?tag.team=i18n&workflow=translate.nika.yaml"Tags are stored in SQLite (Storage V4 migration) and queried server-side β no client-side filtering needed. Cursor-based pagination handles large result sets with a has_more flag.
| Feature | Detail |
|---|---|
| π·οΈ Tag format | JSON key-value: { "env": "prod", "team": "i18n" } |
| π Query syntax | ?tag.key=value in query string |
| π Pagination | Cursor-based with ?cursor=xxx&limit=20 |
| β Validation | Keys: 1-64 chars, alphanumeric + -_ only |
Tip
Tags are set at submission time via the tags field in RunRequest. They're immutable after submission β think of them as labels, not mutable state.
π§ Lint Rules L080 + L090
Two new lint rules join the nika lint suite, bringing the total to 10 rules (L001 through L090). These catch workflow quality issues that aren't syntax errors but will bite you in production.
π Fixes (7)
- π― Golden file comparison β Previously a stub that always passed. Now performs real byte-level comparison against reference files, making
nika testactually useful for regression testing. - π’
sumtransform β Restricted to numeric arrays only. Previously silently coerced non-numbers, producing garbage results without warning. - π
min_by/max_bydebug logging β Removed noisy debug output that leaked into production builds. These transforms now operate silently as expected. - π L060 lint rule β Corrected terminal vs orphan node detection logic. Was flagging valid patterns and missing actual orphans.
- π¦ Batch two-pass validation β First pass now correctly rejects malformed requests before any execution begins, preventing partial batch failures.
- π·οΈ Tag key validation β Rejects empty keys, keys with special characters, and keys exceeding 64 characters. Previously accepted anything.
- π
has_morepagination β Cursor-based pagination correctly reportshas_more: falseon the final page instead of requiring one extra empty request.
β¬οΈ Upgrade Notes
Note
Storage V4 migration runs automatically on first start. Your existing jobs database is preserved β the migration adds a tags column and creates an index. No action needed.
Warning
If you relied on nika test golden file comparison always passing (it was a no-op stub), your tests may now fail if the golden files don't match actual output. Regenerate golden files with nika test --update.
π¦ Install
| Method | Command | |
|---|---|---|
| π | Quick | curl -fsSL https://raw.githubusercontent.com/supernovae-st/nika/main/install.sh | sh |
| πΊ | Homebrew | brew install supernovae-st/tap/nika |
| π¦ | npm | npx @supernovae-st/nika |
| π¦ | Cargo | cargo install nika |
| π³ | Docker | docker run --rm ghcr.io/supernovae-st/nika:0.70.0 |
| π» | VS Code | Search "Nika" or ext install supernovae.nika-lang |
Made with π by SuperNovae Studio β Open Source, AGPL-3.0
Full Changelog: v0.69.0...v0.70.0