Skip to content

v0.70.0 β€” Eval Framework

Choose a tag to compare

@github-actions github-actions released this 05 Apr 10:13

πŸ¦‹ Nika 0.70.0 β€” Eval Framework

Inference as Code Β· April 5, 2026 Β· 16 commits

πŸ§ͺ Tests πŸ”§ Builtins πŸ“¦ Transforms 🌐 Providers πŸ¦€ Crates
10,142 62 63 14 17

✧ infer Β· ⎈ exec Β· β˜„ fetch Β· βŠ› invoke Β· ❋ agent


✨ This Release in 30 Seconds

Workflows deserve tests. Until now, validating a Nika workflow meant running it against a real provider and eyeballing the output. That era is over. v0.70.0 introduces nika eval β€” a dataset-driven evaluation framework that lets you define inputs, set assertions, and get a pass/fail report you can wire into any CI pipeline. Pair that with a brand-new batch endpoint that accepts up to 50 workflows in a single POST, queryable job tags for organizing production runs, and a Storage V4 migration that makes it all filterable at the SQLite level β€” and you've got the building blocks for serious workflow ops. This is the release where Nika stops being "run and hope" and starts being "test, tag, batch, observe."


πŸ§ͺ nika eval β€” Dataset-Driven Workflow Evaluation

Every LLM workflow has a question nobody wants to answer: "Did the last refactor break anything?" Running manually is slow and expensive. nika eval solves this by letting you define a dataset of inputs + expectations, then running your workflow against each row with automatic assertions.

Four assertion types ship out of the box:

Assertion What it checks Example
output_contains Substring present in output output_contains: "ownership"
output_min_words Output has at least N words output_min_words: 50
output_max_words Output has at most N words output_max_words: 500
output_matches_schema Output validates against JSON Schema type: object, required: [summary]
# eval-dataset.yaml
- inputs: { topic: "Rust memory safety" }
  expect:
    output_contains: "ownership"
    output_min_words: 50

- inputs: { topic: "Python async" }
  expect:
    output_contains: "asyncio"
    output_max_words: 500

- inputs: { topic: "WebAssembly" }
  expect:
    output_matches_schema:
      type: object
      required: [summary, key_points]
# Run eval β€” mock provider by default (zero cost, instant)
nika eval research.nika.yaml --dataset eval-dataset.yaml

# JSON output for CI pipeline gates
nika eval research.nika.yaml --dataset eval-dataset.yaml --format json
πŸ’‘ Example output
πŸ§ͺ Eval: research.nika.yaml (3 rows)

  βœ… Row 1: PASS  (2/2 assertions)
  βœ… Row 2: PASS  (2/2 assertions)
  ❌ Row 3: FAIL  (output_matches_schema: missing field "key_points")
  ─────────────────────────────
  Summary: 2/3 passed

Exit code: 1 (use in CI: any failure = non-zero)

Tip

nika eval uses --provider mock by default, so your evaluation datasets run instantly with zero API cost. Switch to --provider anthropic when you want real LLM validation.


πŸ“¦ POST /v1/batch/run β€” Submit 50 Workflows at Once

If you're orchestrating Nika from an external system β€” a Node.js backend, a Python script, a CI job β€” sending one HTTP request per workflow is painful. The new batch endpoint accepts up to 50 workflow executions in a single POST, with two-pass validation: syntax check all requests first, then submit for execution. No more partial failures where request 47 fails validation after 46 are already running.

curl -X POST http://localhost:3000/v1/batch/run \
  -H "Authorization: Bearer $NIKA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "requests": [
      {
        "workflow": "translate.nika.yaml",
        "inputs": { "lang": "fr", "source": "README.md" },
        "tags": { "env": "prod", "team": "i18n" }
      },
      {
        "workflow": "translate.nika.yaml",
        "inputs": { "lang": "de", "source": "README.md" },
        "tags": { "env": "prod", "team": "i18n" }
      }
    ]
  }'
πŸ’‘ Response structure
{
  "jobs": [
    { "job_id": "abc-123", "workflow": "translate.nika.yaml", "state": "queued" },
    { "job_id": "def-456", "workflow": "translate.nika.yaml", "state": "queued" }
  ],
  "accepted": 2,
  "rejected": 0
}

Note

Two-pass validation means the batch endpoint checks every request's syntax and workflow existence before queueing any of them. Either the entire batch is accepted, or you get a detailed error report for each rejected request.


🏷️ Job Tags + Filtered Listing

Production workflows need metadata. Which environment? Which team? Which customer triggered this? Job tags let you attach arbitrary key-value pairs to any workflow execution, and the new GET /v1/jobs endpoint makes them filterable.

# List all completed jobs tagged for production
curl "http://localhost:3000/v1/jobs?state=completed&tag.env=prod&limit=20"

# Filter by team
curl "http://localhost:3000/v1/jobs?tag.team=i18n&workflow=translate.nika.yaml"

Tags are stored in SQLite (Storage V4 migration) and queried server-side β€” no client-side filtering needed. Cursor-based pagination handles large result sets with a has_more flag.

Feature Detail
🏷️ Tag format JSON key-value: { "env": "prod", "team": "i18n" }
πŸ” Query syntax ?tag.key=value in query string
πŸ“„ Pagination Cursor-based with ?cursor=xxx&limit=20
βœ… Validation Keys: 1-64 chars, alphanumeric + -_ only

Tip

Tags are set at submission time via the tags field in RunRequest. They're immutable after submission β€” think of them as labels, not mutable state.


πŸ”§ Lint Rules L080 + L090

Two new lint rules join the nika lint suite, bringing the total to 10 rules (L001 through L090). These catch workflow quality issues that aren't syntax errors but will bite you in production.


πŸ› Fixes (7)
  • 🎯 Golden file comparison β€” Previously a stub that always passed. Now performs real byte-level comparison against reference files, making nika test actually useful for regression testing.
  • πŸ”’ sum transform β€” Restricted to numeric arrays only. Previously silently coerced non-numbers, producing garbage results without warning.
  • πŸ“Š min_by / max_by debug logging β€” Removed noisy debug output that leaked into production builds. These transforms now operate silently as expected.
  • πŸ” L060 lint rule β€” Corrected terminal vs orphan node detection logic. Was flagging valid patterns and missing actual orphans.
  • πŸ“¦ Batch two-pass validation β€” First pass now correctly rejects malformed requests before any execution begins, preventing partial batch failures.
  • 🏷️ Tag key validation β€” Rejects empty keys, keys with special characters, and keys exceeding 64 characters. Previously accepted anything.
  • πŸ“„ has_more pagination β€” Cursor-based pagination correctly reports has_more: false on the final page instead of requiring one extra empty request.

⬆️ Upgrade Notes

Note

Storage V4 migration runs automatically on first start. Your existing jobs database is preserved β€” the migration adds a tags column and creates an index. No action needed.

Warning

If you relied on nika test golden file comparison always passing (it was a no-op stub), your tests may now fail if the golden files don't match actual output. Regenerate golden files with nika test --update.


πŸ“¦ Install

Method Command
πŸš€ Quick curl -fsSL https://raw.githubusercontent.com/supernovae-st/nika/main/install.sh | sh
🍺 Homebrew brew install supernovae-st/tap/nika
πŸ“¦ npm npx @supernovae-st/nika
πŸ¦€ Cargo cargo install nika
🐳 Docker docker run --rm ghcr.io/supernovae-st/nika:0.70.0
πŸ’» VS Code Search "Nika" or ext install supernovae.nika-lang

Made with πŸ’œ by SuperNovae Studio β€” Open Source, AGPL-3.0

Full Changelog: v0.69.0...v0.70.0