test(pipelines): add CI validation for official oneshot samples#542
Conversation
The official samples/pipelines/oneshot/*.yml had zero automated coverage — only the small samples/pipelines/test/ fixtures were validated. This adds a manifest-driven harness that runs every oneshot sample against a live skit server and validates its output (ffprobe for media, JSON parse for transcription/VAD), wired into both e2e pipeline-validation jobs. - New oneshot test target + oneshot-samples.toml manifest covering all 38 samples; a sample without a manifest entry fails the harness so new ones get triaged. - Generalise run_pipeline into run_pipeline_parts to support multi-field uploads (dual-input mixers) and text inputs (TTS); add JSON output validation; make the HTTP timeout configurable via PIPELINE_TIMEOUT_SECS. - Skip controls: requires_node, optional_node (plugins / VA-API — skip even under PIPELINE_REQUIRE_NODES=1), requires_env (S3 creds), and slow (opt in via PIPELINE_INCLUDE_SLOW=1 for realtime-paced / software-AV1 showcases). - Add a just test-oneshot-samples recipe. All 38 samples validated locally (Tier-A/B/EXT pass; HW skipped without a GPU). No sample was broken. Signed-off-by: streamkit-devin <devin@streamkit.dev>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
| requires_node = "plugin::native::nllb" | ||
| optional_node = true |
There was a problem hiding this comment.
🟡 Single-node gating lets multi-plugin samples run with missing dependencies
Several manifest entries gate an optional sample on only one node even though the YAML uses additional optional nodes, so a partially installed server can run the test instead of skipping it and then fail during pipeline construction/execution. For example, this entry only requires plugin::native::nllb, but samples/pipelines/oneshot/speech_to_text_translate.yml:29-47 also uses plugin::native::whisper and plugin::native::piper; the same pattern affects vad-filtered-stt (samples/pipelines/oneshot/vad-filtered-stt.yml:29-36), speech_to_text_translate_helsinki (samples/pipelines/oneshot/speech_to_text_translate_helsinki.yml:39-56), and video_dav1d_compositor_demo (samples/pipelines/oneshot/video_dav1d_compositor_demo.yml:103-116).
Prompt for agents
The oneshot sample manifest and harness currently model only a single requires_node per sample, but several samples need multiple optional nodes to be present. Update the manifest schema in tests/pipeline-validation/src/lib.rs and the check in tests/pipeline-validation/tests/oneshot.rs so an entry can require multiple node kinds, then update tests/pipeline-validation/oneshot-samples.toml for multi-plugin/multi-codec samples such as speech_to_text_translate, speech_to_text_translate_helsinki, vad-filtered-stt, and video_dav1d_compositor_demo. The intended behavior is that if any required node is missing, optional samples skip, while non-optional samples still fail under PIPELINE_REQUIRE_NODES=1.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Good catch — fixed in a1dd8f0. Added a requires_nodes list to OneshotEntry so a sample can require multiple node kinds; the harness now skips (or fails under PIPELINE_REQUIRE_NODES=1) if any required node is missing. Populated the secondary nodes for speech_to_text_translate (whisper, piper), speech_to_text_translate_helsinki (whisper, piper), vad-filtered-stt (vad), and video_dav1d_compositor_demo (svt_av1::encoder). Verified all 38 samples still pass locally.
| # --------------------------------------------------------------------------- | ||
| # Tier A — pure software, fast enough for the default CI job. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
🟡 Section-divider comment violates repository comment rules
AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Fixed in a1dd8f0 — dropped the dashed divider rules; the tier sections now use a single plain comment.
| # --------------------------------------------------------------------------- | ||
| # Tier A — heavyweight showcase pipelines (realtime pacers / slow SW AV1). | ||
| # Opt in with PIPELINE_INCLUDE_SLOW=1. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
🟡 Section-divider comment violates repository comment rules
AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Fixed in a1dd8f0 — dropped the dashed divider rules here too.
| # --------------------------------------------------------------------------- | ||
| # Tier B — marketplace plugins + models. optional_node = true so the GPU job | ||
| # (PIPELINE_REQUIRE_NODES=1, no plugins) skips rather than fails. Run locally | ||
| # after installing the plugin/model. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
🟡 Section-divider comment violates repository comment rules
AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Fixed in a1dd8f0 — dropped the dashed divider rules here too.
| # --------------------------------------------------------------------------- | ||
| # Tier C — hardware codecs. nvcodec / vulkan_video are compiled by the | ||
| # pipeline-validation-gpu CI job, so they are NOT optional_node (a missing node | ||
| # there is a real regression). VA-API is compiled by no CI job, so it is | ||
| # optional_node = true. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
🟡 Section-divider comment violates repository comment rules
AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Fixed in a1dd8f0 — dropped the dashed divider rules here too.
| #[serde(default)] | ||
| pub output_kind: OutputKind, | ||
|
|
||
| // --- Media expectations (used when `output_kind = media`). --- |
There was a problem hiding this comment.
🟡 Section-divider comment violates repository comment rules
AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added implementation section marker is a direct violation of that mandatory repository rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Fixed in a1dd8f0 — removed this // --- divider (per AGENTS.md comment guidelines).
| #[serde(default)] | ||
| pub json_contains: Vec<String>, | ||
|
|
||
| // --- Skip controls. --- |
There was a problem hiding this comment.
🟡 Section-divider comment violates repository comment rules
AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added implementation section marker is a direct violation of that mandatory repository rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Fixed in a1dd8f0 — removed this // --- Skip controls. --- divider.
| let first_line = body | ||
| .lines() | ||
| .find(|line| !line.trim().is_empty()) | ||
| .ok_or("JSON output is empty")?; | ||
|
|
||
| serde_json::from_str::<serde_json::Value>(first_line) | ||
| .map_err(|e| format!("Output is not valid JSON: {e}\n first line: {first_line}"))?; |
There was a problem hiding this comment.
🚩 JSON validation only checks the first NDJSON record
The new JSON path validates that the first non-empty line parses as JSON and then optionally checks raw substrings. That matches the current NDJSON-style transcription/VAD samples, but it means later malformed records would not be caught if a node emits valid JSON first and corrupt output later. If the intent is to smoke-test JSON framing rather than payload completeness this is acceptable; otherwise consider validating every non-empty line for output_kind = "json".
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Strengthened in a1dd8f0. validate_json_output now first tries to parse the whole body as a single JSON document; if that fails it treats the body as NDJSON and requires every non-empty line to parse (previously only the first line was checked). Substring checks still scan the full body.
| - name: Validate official oneshot samples (SW codecs only) | ||
| run: | | ||
| cd tests/pipeline-validation | ||
| PIPELINE_TEST_URL=http://127.0.0.1:4545 cargo test --test oneshot |
There was a problem hiding this comment.
📝 Info: Default CI intentionally skips slow showcase samples
The workflow invokes the oneshot harness without PIPELINE_INCLUDE_SLOW, so entries marked slow = true will not run in either the software or GPU CI jobs. This appears intentional from the justfile guidance and manifest comments, but reviewers should be aware that several long-running compositor/pacer samples remain manifest-covered rather than CI-executed by default.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Correct, and intentional: the slow showcase samples (realtime pacers, rav1e SW AV1 ~200s) are gated behind PIPELINE_INCLUDE_SLOW=1 to keep CI cheap. They remain manifest-covered and runnable on demand via just test-oneshot-samples. Happy to add a periodic (nightly) job that runs them with the flag set if you'd like that coverage in CI too.
| pub fn available_nodes() -> &'static HashSet<String> { | ||
| static NODES: OnceLock<HashSet<String>> = OnceLock::new(); | ||
| NODES.get_or_init(|| { | ||
| get_available_nodes(base_url()).unwrap_or_else(|err| { | ||
| if std::env::var("PIPELINE_REQUIRE_NODES").as_deref() == Ok("1") { | ||
| panic!( | ||
| "FATAL: Could not query available nodes: {err}\n \ | ||
| PIPELINE_REQUIRE_NODES=1 requires a reachable server at {}", | ||
| base_url() | ||
| ); | ||
| } | ||
| eprintln!("WARNING: Could not query available nodes: {err}"); | ||
| eprintln!(" Node-gated tests will be skipped."); | ||
| eprintln!(" Is skit running at {}?", base_url()); | ||
| HashSet::new() | ||
| }) | ||
| }) |
There was a problem hiding this comment.
📝 Info: Node availability is cached for the process lifetime
Moving available_nodes() into the shared library means both harnesses now use a once-per-process snapshot of /api/v1/schema/nodes. This is fine for the CI pattern where skit is built and started before tests, but a local run against a server that loads/unloads plugins during the same test process will not see those changes. That is a behavioral constraint worth keeping in mind for plugin-focused local validation.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Yes — node availability is queried once and cached for the process lifetime via OnceLock. That's fine for these batch test runs: the server's registered node set is static for the duration of a cargo test invocation, and caching avoids re-querying /api/v1/... for every sample. No change needed.
- Support multiple required node kinds per sample via requires_nodes, so multi-plugin/codec samples (translate chains, vad-filtered-stt, dav1d compositor) skip when any dependency is missing instead of running and failing during pipeline construction. - Validate every NDJSON record in JSON outputs, not just the first line. - Drop section-divider comments per AGENTS.md comment guidelines. Signed-off-by: streamkit-devin <devin@streamkit.dev>
…htly slow job Extract the shared ffprobe checks into MediaExpectations, flattened into both Expected and OneshotEntry, so a new media field is declared once instead of copied by hand in media_expected() (which also dead-copied output_extension). Mark video_nv_av1_colorbars slow: its core::pacer paces 900 frames at real time (~30s), and NVENC AV1 is already covered cheaply by the test/nv_av1_colorbars fixture, so the GPU job no longer pays ~30s per run. Add a nightly workflow that runs the slow samples with PIPELINE_INCLUDE_SLOW=1. Signed-off-by: streamkit-devin <devin@streamkit.dev>
Document the two suites, local run recipes, the per-sample manifest rule, skip-control semantics, when to mark a sample slow, and that MediaExpectations is the single source of truth for ffprobe checks. Signed-off-by: streamkit-devin <devin@streamkit.dev>
slow is checked before the node check, so the oneshot suite skips nv_av1 and only guards vulkan_video; NVENC-AV1 registration is guarded by the validate.rs nv_av1_colorbars fixture. Signed-off-by: streamkit-devin <devin@streamkit.dev>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #542 +/- ##
==========================================
+ Coverage 79.96% 79.98% +0.01%
==========================================
Files 234 234
Lines 68061 68061
Branches 1933 1933
==========================================
+ Hits 54426 54438 +12
+ Misses 13629 13617 -12
Partials 6 6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Summary
samples/pipelines/oneshot/*.ymlhad zero automated coverage — only the hand-writtensamples/pipelines/test/fixtures were validated. This adds a manifest-driven harness that runs every oneshot sample against a liveskitserver and validates the output, wired into bothpipeline-validatione2e jobs (SW + GPU).optional_node(so the GPU job'sPIPELINE_REQUIRE_NODES=1doesn't fail on them); NVENC/Vulkan run in the GPU job; S3 samples gate onrequires_env; and heavyweight showcase pipelines (realtime pacers, software AV1 — up to ~200s) are behindslow/PIPELINE_INCLUDE_SLOW=1.run_pipeline→run_pipeline_partsfor multi-field uploads (dual-input mixers) and text inputs (TTS), added JSON-body validation (transcription / VAD), and made the request timeout configurable (PIPELINE_TIMEOUT_SECS).tests/pipeline-validation/README.mddocumenting how to maintain the harness (manifest rule, skip controls, when to mark a sampleslow).This is the test-harness half of the S2 oneshot review. No oneshot sample was broken — all 38 validated locally (Tier-A/B + S3/Servo pass; HW skipped without a GPU), so there are no sample fixes or removals here.
Review-feedback follow-ups
MediaExpectationsstruct, flattened (#[serde(flatten)]) into bothExpectedandOneshotEntry. This removes the hand-writtenmedia_expected()copy (which also dead-copiedoutput_extension), so a new ffprobe field is declared once instead of in three places. A unit test pins the flattened TOML deserialization (toml 0.8 + integer fields).video_nv_av1_colorbarsslow. Itscore::pacerpaces 900 frames at real time (~30s wall-clock), and NVENC AV1 is already covered cheaply by thetest/nv_av1_colorbarsfixture (30 frames), so the GPU job no longer pays ~30s/run for redundant coverage. (The sibling Vulkan sample has no pacer, so it stays in the default GPU run.)pipeline-validation-nightly.yml, schedule + manual dispatch) that runs the slow samples withPIPELINE_INCLUDE_SLOW=1on SW codecs.Two registry/model-packaging issues surfaced during testing (not repo bugs, not in this PR) and are now tracked: parakeet ships flat model files instead of the expected named subdir (#548), and the nllb model is XZ-compressed but labeled
.tar.bz2so the installer picks the wrong decompressor (#549).Review & Validation
oneshot-samples.toml:optional_node = truefor marketplace plugins + VA-API;requires_nodeonly for HW codecs the GPU job builds (nvcodec/vulkan);slow = truenow also onvideo_nv_av1_colorbars.video_vulkan_video_h264_colorbarsat 1280×720 (the remaining non-slow HW sample) — validated here only viarequires_nodeskip (no local GPU).just test-oneshot-samples http://localhost:4545against a server with plugins reproduces the full pass; addPIPELINE_INCLUDE_SLOW=1 --test-threads=1for the showcase samples.Notes
video_av1_compositor_demo(rav1e SW AV1) alone is ~200s, the four realtime-paced compositor demos are ~30s each, andvideo_nv_av1_colorbars~30s. All runnable on demand viaPIPELINE_INCLUDE_SLOW=1and nightly.tests/is excluded from the workspace, so this crate isn't covered byjust lint/just test; correctness is enforced by the e2e job. Re-ran the harness locally after the refactor (38/38, incl. Tier-B model samples and a slow compositor sample) andcargo clippyclean.Link to Devin session: https://staging.itsdev.in/sessions/66d442e49aca4390adb6749e19cb5dec
Requested by: @streamer45
Devin Review
00b41e8(HEAD is3461138)