fix(api): repair showcase safer promote cascade#325
Conversation
Reviewer's GuideFixes the showcase_rich demo pipeline cascade by resolving the scenario planning step against the recorded champion run instead of the live alias, giving safer_promote a parseable artifact URI, and adding a best-effort alias-restore safeguard, with tests and runbook docs updated accordingly. Sequence diagram for updated demo pipeline failure handling and alias restoresequenceDiagram
participant run_pipeline
participant step_safer_promote_flow
participant step_scenario_simulate_and_save
participant step_cleanup
participant _restore_demo_alias_after_failure
participant Registry
run_pipeline->>step_safer_promote_flow: step_safer_promote_flow(ctx, client)
step_safer_promote_flow->>Registry: client.request(POST /registry/aliases)
Registry-->>step_safer_promote_flow: alias swapped to worse run
step_safer_promote_flow-->>run_pipeline: StepResult
run_pipeline->>step_scenario_simulate_and_save: step_scenario_simulate_and_save(ctx, client)
alt ctx.winning_run_id is None
step_scenario_simulate_and_save->>Registry: client.request(GET /registry/aliases/demo-production)
Registry-->>step_scenario_simulate_and_save: alias_body(run_id)
else ctx.winning_run_id is set
step_scenario_simulate_and_save-->>step_scenario_simulate_and_save: winner_run_id = ctx.winning_run_id
end
step_scenario_simulate_and_save->>Registry: client.request(GET /registry/runs/{winner_run_id})
Registry-->>step_scenario_simulate_and_save: run_body(artifact_uri)
step_scenario_simulate_and_save-->>run_pipeline: StepResult
alt some_step (not cleanup) fails
run_pipeline->>_restore_demo_alias_after_failure: _restore_demo_alias_after_failure(ctx, client)
_restore_demo_alias_after_failure->>Registry: client.request(POST /registry/aliases)
Registry-->>_restore_demo_alias_after_failure: response
_restore_demo_alias_after_failure-->>run_pipeline: return
run_pipeline-->>run_pipeline: pipeline aborted
else all steps succeed
run_pipeline->>step_cleanup: step_cleanup(ctx, client)
step_cleanup->>Registry: client.request(POST /registry/aliases)
Registry-->>step_cleanup: response
step_cleanup-->>run_pipeline: StepResult
end
Flow diagram for champion resolution in scenario_simulate_and_saveflowchart TD
A[start step_scenario_simulate_and_save] --> B{ctx.winning_run_id is not None}
B -- Yes --> C[Set winner_run_id = ctx.winning_run_id]
B -- No --> D["client.request(GET /registry/aliases/DEMO_ALIAS)"]
D --> E{alias_run_id is str}
E -- No --> F[return fail: DEMO_ALIAS alias has no run_id]
E -- Yes --> G[Set winner_run_id = alias_run_id]
C --> H["client.request(GET /registry/runs/{winner_run_id})"]
G --> H
H --> I[Continue with scenario_simulate_and_save]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
_restore_demo_alias_after_failure, thelogger.warningcall drops the underlying exception details; consider capturing and logging the exception (exc_info=Trueor including the error message) so alias-restore issues are diagnosable without changing behavior. - The explanatory comments in
step_scenario_simulate_and_save,_restore_demo_alias_after_failure, and thetest_run_demo_showcase_rich_full_epicassertions are quite long and tightly coupled to specific issue numbers; consider trimming or extracting the core invariants to reduce future drift while keeping the behavior-focused parts of the comments.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_restore_demo_alias_after_failure`, the `logger.warning` call drops the underlying exception details; consider capturing and logging the exception (`exc_info=True` or including the error message) so alias-restore issues are diagnosable without changing behavior.
- The explanatory comments in `step_scenario_simulate_and_save`, `_restore_demo_alias_after_failure`, and the `test_run_demo_showcase_rich_full_epic` assertions are quite long and tightly coupled to specific issue numbers; consider trimming or extracting the core invariants to reduce future drift while keeping the behavior-focused parts of the comments.
## Individual Comments
### Comment 1
<location path="app/features/demo/pipeline.py" line_range="1787-1788" />
<code_context>
+ # consumer that parses it via ``_parse_artifact_key`` does not choke
+ # on a placeholder. KEY is hex-only (dashes stripped) to satisfy the
+ # ``model_([0-9a-f]+)`` parser regex.
+ "artifact_uri": (
+ f"demo/seasonal_naive-model_{worse_run_id_raw.replace('-', '')[:12]}.joblib"
+ ),
"artifact_hash": "0" * 64,
</code_context>
<issue_to_address>
**suggestion:** Consider centralizing artifact key formatting to avoid relying on an inline magic slice and keep it aligned with the parser.
This logic tightly couples URI formatting to the `model_([0-9a-f]+)` regex and hard-codes the `[:12]` truncation. To make future changes safer (e.g., regex or key length changes), consider extracting artifact key generation into a helper shared with `_parse_artifact_key` (or at least a small local helper), so the format/truncation rules and assumptions about `worse_run_id_raw` are defined in one place.
Suggested implementation:
```python
artifact_key = _format_demo_artifact_key(worse_run_id_raw)
run_body = await client.request(
json_body={
"status": "success",
"metrics": {"wape": 99.0},
# issue #324 — write a real-shape, parseable artifact_uri (V1 demo
# shape ``demo/{model_type}-model_{KEY}.joblib``) so any downstream
# consumer that parses it via ``_parse_artifact_key`` does not choke
# on a placeholder. KEY is hex-only (dashes stripped) to satisfy the
# ``model_([0-9a-f]+)`` parser regex.
"artifact_uri": f"demo/seasonal_naive-model_{artifact_key}.joblib",
"artifact_hash": "0" * 64,
"artifact_size_bytes": 1,
},
)
```
```python
def _format_demo_artifact_key(run_id_raw: str) -> str:
"""Format the artifact key for demo runs.
This keeps the formatting aligned with the ``model_([0-9a-f]+)`` parser
regex by stripping dashes and truncating to the expected key length.
"""
normalized = run_id_raw.replace("-", "")
# NOTE: if the parser regex or expected key length changes, update this helper.
return normalized[:12]
async def _restore_demo_alias_after_failure(ctx: DemoContext, client: _Client) -> None:
"""Best-effort restore of the demo-production alias after a mid-run failure.
issue #324 — when a step fails the pipeline aborts before the trailing
``cleanup`` row runs, which would otherwise leave ``demo-production``
```
To fully centralize formatting with `_parse_artifact_key`, consider:
1. Updating `_parse_artifact_key` (wherever it is defined) to either:
- Use `_format_demo_artifact_key` for constructing/validating keys, or
- Move `_format_demo_artifact_key` into a shared module (e.g., a `utils` or `artifacts` helper) and have both this pipeline code and `_parse_artifact_key` import it.
2. If the `model_([0-9a-f]+)` regex or the required key length changes, adjust `_format_demo_artifact_key` accordingly and ensure `_parse_artifact_key`’s regex stays in sync.
</issue_to_address>
### Comment 2
<location path="app/features/demo/pipeline.py" line_range="1975-1980" />
<code_context>
+ "description": ("Restored by the showcase pipeline failure safeguard (#324)."),
+ },
+ )
+ except (_StepError, httpx.HTTPError, OSError):
+ # Best-effort — a restore failure must never mask the original failure.
+ logger.warning(
+ "demo.cleanup.alias_restore_safeguard_failed",
+ run_id=ctx.original_demo_alias_run_id,
+ )
</code_context>
<issue_to_address>
**suggestion:** Log the exception details when the alias restore safeguard fails.
The safeguard correctly avoids masking the original failure by swallowing `_StepError`, `httpx.HTTPError`, and `OSError`, but the warning only logs the event name and `run_id`, so you lose the exception details needed to debug intermittent restore issues. Please include the exception context (e.g., `exc_info=True` on `logger.warning`) so stack traces are captured while keeping the current failure semantics unchanged.
```suggestion
except (_StepError, httpx.HTTPError, OSError):
# Best-effort — a restore failure must never mask the original failure.
logger.warning(
"demo.cleanup.alias_restore_safeguard_failed",
run_id=ctx.original_demo_alias_run_id,
exc_info=True,
)
```
</issue_to_address>
### Comment 3
<location path="tests/test_e2e_demo.py" line_range="503-512" />
<code_context>
+ scenario_step = by_name.get("scenario_simulate_and_save")
</code_context>
<issue_to_address>
**suggestion (testing):** Consider asserting that the overall pipeline status is pass when there are no env-dependent failures
You could add an assertion that when `failed` is empty, `result["overall_status"] == "pass"`. This would help catch future regressions where the overall status diverges from the per-step statuses and keep this e2e test aligned with the intended behavior.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| scenario_step = by_name.get("scenario_simulate_and_save") | ||
| assert scenario_step is not None, "scenario_simulate_and_save did not run on showcase_rich" | ||
| assert scenario_step["status"] == "pass", ( | ||
| "scenario_simulate_and_save must pass after #324, got " | ||
| f"status={scenario_step['status']!r} detail={scenario_step['detail']!r}" | ||
| ) | ||
|
|
||
| # Any OTHER failed step must be an environment-dependent knowledge-phase step | ||
| # (embedding provider unreachable / misconfigured key). Those are designed to | ||
| # skip gracefully when the provider is absent (RUNBOOKS entry 20-22); a real |
There was a problem hiding this comment.
suggestion (testing): Consider asserting that the overall pipeline status is pass when there are no env-dependent failures
You could add an assertion that when failed is empty, result["overall_status"] == "pass". This would help catch future regressions where the overall status diverges from the per-step statuses and keep this e2e test aligned with the intended behavior.
Summary
Repairs the
showcase_richdemo pipeline cascade where the planning phase failed on the safer-promote placeholder artifact. Fix is contained entirely in theapp/features/demoslice (noscenarios/registrychanges).Closes #324.
Root cause
Two compounding issues in
app/features/demo/pipeline.py:step_safer_promote_flow(PRP-39, decision phase) swapped thedemo-productionalias to a deliberately-worse run whoseartifact_uriwas the literal placeholderdemo/safer-promote-placeholder.joblib.step_scenario_simulate_and_save(PRP-40, planning phase, runs after the swap) resolved that live alias, read the placeholder URI, and_parse_artifact_key()(regexmodel_([0-9a-f]+)(?:\.joblib)?$) raisedCannot parse artifact-key…→ the step failed and the pipeline aborted at step 16.Aggravating factor:
run_pipelinestops on the first failed step andstep_cleanupis the last row, so a failure between the swap and cleanup leftdemo-productionpermanently pointing at the worse-WAPE run (the R15 restore never ran).Fix
ctx.winning_run_id(recorded bystep_register, never touched by the swap) instead of the live alias instep_scenario_simulate_and_save; fall back to the alias only when no champion was recorded. This routes replay to a run with a real, parseable, loadable artifact — fixing both the parse error and the latent "model not found" failure.step_safer_promote_flownow writes a real-shape, parseableartifact_uri(demo/seasonal_naive-model_{hex12}.joblib) instead of the placeholder._restore_demo_alias_after_failuresafeguard wired into the orchestrator fail-branch — restoresdemo-productionto the champion on any non-cleanup mid-run failure (best-effort, never raises), so the alias is never left on the worse run.Tests / validation
app/features/demo/tests/test_pipeline.py: updated the 3 scenario tests for the new champion-resolution path; added 5 regressions —_parse_artifact_keyrejects the old placeholder / accepts the new shape, the step ignores the corrupted alias, and the alias-restore safeguard (repoint / no-op / swallows-errors).tests/test_e2e_demo.py: removed theKNOWN_PREEXISTING_FAILURES = {"scenario_simulate_and_save"}tolerance; the step must now pass (only environment-dependent knowledge-phase steps are tolerated).docs/_base/RUNBOOKS.md: entry 18 marked FIXED.Gates run locally:
ruff check .— passedruff format --check .— passedpytest -m "not integration"— 1807 passed, 7 skipped, 3 deselectedpytest -m integration tests/test_e2e_demo.py::test_run_demo_showcase_rich_full_epic— passed (pipeline reaches agents + ops + cleanup; cascade gone)mypy app/— only pre-existingxgbooststub errors; none in changed filespyright— clean (0 errors) on the changed filesNotes
.gitignoreanddocs/user-guide/showcase-manual-demo-guide.md.rag_index_subsetfails rather than skipping gracefully when the embedding provider returns HTTP 401 from a placeholder key — a separate pre-existing RAG-phase robustness gap.Summary by Sourcery
Fix the showcase_rich demo pipeline cascade by resolving scenario simulation against the recorded champion run instead of the mutable demo-production alias, hardening alias handling, and tightening end-to-end expectations.
Bug Fixes:
Enhancements:
Documentation:
Tests: