Release v0.70.0 — Eval Framework · supernovae-st/nika

🦋 Nika 0.70.0 — Eval Framework

Inference as Code · April 5, 2026 · 16 commits

🧪 Tests	🔧 Builtins	📦 Transforms	🌐 Providers	🦀 Crates
10,142	62	63	14	17

✧ infer · ⎈ exec · ☄ fetch · ⊛ invoke · ❋ agent

✨ This Release in 30 Seconds

Workflows deserve tests. Until now, validating a Nika workflow meant running it against a real provider and eyeballing the output. That era is over. v0.70.0 introduces nika eval — a dataset-driven evaluation framework that lets you define inputs, set assertions, and get a pass/fail report you can wire into any CI pipeline. Pair that with a brand-new batch endpoint that accepts up to 50 workflows in a single POST, queryable job tags for organizing production runs, and a Storage V4 migration that makes it all filterable at the SQLite level — and you've got the building blocks for serious workflow ops. This is the release where Nika stops being "run and hope" and starts being "test, tag, batch, observe."

🧪 `nika eval` — Dataset-Driven Workflow Evaluation

Every LLM workflow has a question nobody wants to answer: "Did the last refactor break anything?" Running manually is slow and expensive. nika eval solves this by letting you define a dataset of inputs + expectations, then running your workflow against each row with automatic assertions.

Four assertion types ship out of the box:

Assertion	What it checks	Example
`output_contains`	Substring present in output	`output_contains: "ownership"`
`output_min_words`	Output has at least N words	`output_min_words: 50`
`output_max_words`	Output has at most N words	`output_max_words: 500`
`output_matches_schema`	Output validates against JSON Schema	`type: object, required: [summary]`

# eval-dataset.yaml
- inputs: { topic: "Rust memory safety" }
  expect:
    output_contains: "ownership"
    output_min_words: 50

- inputs: { topic: "Python async" }
  expect:
    output_contains: "asyncio"
    output_max_words: 500

- inputs: { topic: "WebAssembly" }
  expect:
    output_matches_schema:
      type: object
      required: [summary, key_points]

# Run eval — mock provider by default (zero cost, instant)
nika eval research.nika.yaml --dataset eval-dataset.yaml

# JSON output for CI pipeline gates
nika eval research.nika.yaml --dataset eval-dataset.yaml --format json

💡 Example output

🧪 Eval: research.nika.yaml (3 rows)

  ✅ Row 1: PASS  (2/2 assertions)
  ✅ Row 2: PASS  (2/2 assertions)
  ❌ Row 3: FAIL  (output_matches_schema: missing field "key_points")
  ─────────────────────────────
  Summary: 2/3 passed

Exit code: 1 (use in CI: any failure = non-zero)

Tip

nika eval uses --provider mock by default, so your evaluation datasets run instantly with zero API cost. Switch to --provider anthropic when you want real LLM validation.

📦 `POST /v1/batch/run` — Submit 50 Workflows at Once

If you're orchestrating Nika from an external system — a Node.js backend, a Python script, a CI job — sending one HTTP request per workflow is painful. The new batch endpoint accepts up to 50 workflow executions in a single POST, with two-pass validation: syntax check all requests first, then submit for execution. No more partial failures where request 47 fails validation after 46 are already running.

curl -X POST http://localhost:3000/v1/batch/run \
  -H "Authorization: Bearer $NIKA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "requests": [
      {
        "workflow": "translate.nika.yaml",
        "inputs": { "lang": "fr", "source": "README.md" },
        "tags": { "env": "prod", "team": "i18n" }
      },
      {
        "workflow": "translate.nika.yaml",
        "inputs": { "lang": "de", "source": "README.md" },
        "tags": { "env": "prod", "team": "i18n" }
      }
    ]
  }'

💡 Response structure

{
  "jobs": [
    { "job_id": "abc-123", "workflow": "translate.nika.yaml", "state": "queued" },
    { "job_id": "def-456", "workflow": "translate.nika.yaml", "state": "queued" }
  ],
  "accepted": 2,
  "rejected": 0
}

Note

Two-pass validation means the batch endpoint checks every request's syntax and workflow existence before queueing any of them. Either the entire batch is accepted, or you get a detailed error report for each rejected request.

🏷️ Job Tags + Filtered Listing

Production workflows need metadata. Which environment? Which team? Which customer triggered this? Job tags let you attach arbitrary key-value pairs to any workflow execution, and the new GET /v1/jobs endpoint makes them filterable.

# List all completed jobs tagged for production
curl "http://localhost:3000/v1/jobs?state=completed&tag.env=prod&limit=20"

# Filter by team
curl "http://localhost:3000/v1/jobs?tag.team=i18n&workflow=translate.nika.yaml"

Tags are stored in SQLite (Storage V4 migration) and queried server-side — no client-side filtering needed. Cursor-based pagination handles large result sets with a has_more flag.

Feature	Detail
🏷️ Tag format	JSON key-value: `{ "env": "prod", "team": "i18n" }`
🔍 Query syntax	`?tag.key=value` in query string
📄 Pagination	Cursor-based with `?cursor=xxx&limit=20`
✅ Validation	Keys: 1-64 chars, alphanumeric + `-_` only

Tip

Tags are set at submission time via the tags field in RunRequest. They're immutable after submission — think of them as labels, not mutable state.

🔧 Lint Rules L080 + L090

Two new lint rules join the nika lint suite, bringing the total to 10 rules (L001 through L090). These catch workflow quality issues that aren't syntax errors but will bite you in production.

🐛 Fixes (7)

🎯 Golden file comparison — Previously a stub that always passed. Now performs real byte-level comparison against reference files, making nika test actually useful for regression testing.
🔢 sum transform — Restricted to numeric arrays only. Previously silently coerced non-numbers, producing garbage results without warning.
📊 min_by / max_by debug logging — Removed noisy debug output that leaked into production builds. These transforms now operate silently as expected.
🔍 L060 lint rule — Corrected terminal vs orphan node detection logic. Was flagging valid patterns and missing actual orphans.
📦 Batch two-pass validation — First pass now correctly rejects malformed requests before any execution begins, preventing partial batch failures.
🏷️ Tag key validation — Rejects empty keys, keys with special characters, and keys exceeding 64 characters. Previously accepted anything.
📄 has_more pagination — Cursor-based pagination correctly reports has_more: false on the final page instead of requiring one extra empty request.

⬆️ Upgrade Notes

Note

Storage V4 migration runs automatically on first start. Your existing jobs database is preserved — the migration adds a tags column and creates an index. No action needed.

Warning

If you relied on nika test golden file comparison always passing (it was a no-op stub), your tests may now fail if the golden files don't match actual output. Regenerate golden files with nika test --update.

📦 Install

	Method	Command
🚀	Quick	`curl -fsSL https://raw.githubusercontent.com/supernovae-st/nika/main/install.sh \| sh`
🍺	Homebrew	`brew install supernovae-st/tap/nika`
📦	npm	`npx @supernovae-st/nika`
🦀	Cargo	`cargo install nika`
🐳	Docker	`docker run --rm ghcr.io/supernovae-st/nika:0.70.0`
💻	VS Code	Search "Nika" or `ext install supernovae.nika-lang`

Made with 💜 by SuperNovae Studio — Open Source, AGPL-3.0

Full Changelog: v0.69.0...v0.70.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.70.0 — Eval Framework

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🦋 Nika 0.70.0 — Eval Framework

✨ This Release in 30 Seconds

🧪 `nika eval` — Dataset-Driven Workflow Evaluation

📦 `POST /v1/batch/run` — Submit 50 Workflows at Once

🏷️ Job Tags + Filtered Listing

🔧 Lint Rules L080 + L090

⬆️ Upgrade Notes

📦 Install

Uh oh!

v0.70.0 — Eval Framework

🦋 Nika 0.70.0 — Eval Framework

✨ This Release in 30 Seconds

🧪 nika eval — Dataset-Driven Workflow Evaluation

📦 POST /v1/batch/run — Submit 50 Workflows at Once

🏷️ Job Tags + Filtered Listing

🔧 Lint Rules L080 + L090

⬆️ Upgrade Notes

📦 Install

Uh oh!

🧪 `nika eval` — Dataset-Driven Workflow Evaluation

📦 `POST /v1/batch/run` — Submit 50 Workflows at Once