Skip to content

Make StepFailed pickles portable across processes (closes #110)#129

Merged
rafacm merged 2 commits intomainfrom
rafacm/stepfailed-portable
May 4, 2026
Merged

Make StepFailed pickles portable across processes (closes #110)#129
rafacm merged 2 commits intomainfrom
rafacm/stepfailed-portable

Conversation

@rafacm
Copy link
Copy Markdown
Owner

@rafacm rafacm commented May 4, 2026

Summary

  • StepFailed.__reduce__ now returns (RuntimeError, (str(self),)) so DBOS-stored exceptions deserialize as a plain stdlib RuntimeError in any process — the standalone dbos workflow steps CLI and DBOS Conductor can rehydrate them without importing episodes.workflows. Worker-side typed hierarchy unchanged (raise DownloadStepFailed(...) still matches except StepFailed).
  • Admin's _decode_dbos_payload() legacy-row fallback now renders <could not deserialize: <80-char preview>> instead of the raw b64 blob.
  • Picked Option 3 from the issue (smallest blast radius, preserves typed shape at raise time). Verified no caller catches by typed subclass — the hierarchy is for log readability, not control flow.

Builds on the partial fix from 9ff6719. Closes #110.

Test plan

  • uv run python manage.py test — 372 tests passing
  • New unit test: pickle.loads(pickle.dumps(DownloadStepFailed(...))) returns a RuntimeError carrying the formatted message
  • End-to-end: uv run dbos workflow steps episode-<id>-run-<n> against a freshly-failed workflow renders the formatted error with no "could not be deserialized" warning (not exercised in this PR — would require a live DBOS instance)

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Unit Test Results

374 tests  +3   374 ✅ +3   24s ⏱️ -1s
 79 suites +1     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 9d1897f. ± Comparison against base commit 69a0081.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Owner Author

@rafacm rafacm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct in the runtime path, but I would fix the session transcript documentation before merging.

Correctness

episodes/workflows.py:82-88 builds Exception.args from one formatted string, and episodes/workflows.py:90-114 now reduces every StepFailed subclass to builtins.RuntimeError(message). I checked the DBOS CLI path: workflow steps calls DBOSClient.list_workflow_steps(), DBOS safe-deserializes the error, and the CLI prints JSON with a default encoder that falls back to str(obj). It does not inspect episode_id, error_message, or subclass type, so RuntimeError(str(self)) is sufficient for the CLI/Conductor portability goal.

I also checked typed-shape usage on origin/main. There are no except FetchDetailsFailed / subclass control-flow catches; the only typed checks are tests around _raise_if_failed and class inheritance. Admin rendering also uses string output, not exception class branching.

Minor concern: the legacy fallback at episodes/admin.py:71-75 no longer shows a raw 4 KB blob, but it also does not recover the readable legacy error message from pre-fix pickle bytes. Issue #110's body says legacy rows should be handled gracefully, “probably via a fallback that extracts the readable substring from the stored bytes when full unpickle fails.” This implementation gives an opaque base64 preview inside <could not deserialize: ...>. That is better than the previous wall of base64, but it is not equivalent to rendering the old human-readable failure message.

Tests

The new test at episodes/tests/test_admin.py:132-153 is meaningful: if someone reverts __reduce__ to (self.__class__, (...)), the type(rehydrated) is RuntimeError assertion fails even in-process. I also manually disassembled a representative pickle and confirmed it contains builtins.RuntimeError, not episodes.workflows.

I ran the three touched admin tests locally:

uv run python manage.py test episodes.tests.test_admin.EpisodeAdminTests.test_step_failed_pickle_round_trip_yields_runtime_error episodes.tests.test_admin.EpisodeAdminTests.test_decode_dbos_payload_unpickles_step_output episodes.tests.test_admin.EpisodeAdminTests.test_decode_dbos_payload_passthrough_for_plain_values

They pass. I did not rerun the full 372-test suite.

Suggested test follow-up: add a small subprocess/no-source-path assertion or pickle disassembly assertion so the cross-process property is explicit rather than inferred from type(...) is RuntimeError.

Documentation

The required files exist: plan (doc/plans/2026-05-04-stepfailed-portable-pickle.md), feature doc (doc/features/2026-05-04-stepfailed-portable-pickle.md), both session transcripts, and the changelog entry at CHANGELOG.md:7-11.

Blocking documentation issue: the session transcripts do not include verbatim user messages. The planning transcript's user section is a parenthesized summary at doc/sessions/2026-05-04-stepfailed-portable-pickle-planning-session.md:13-15, and the implementation transcript's user section is another summary/reference at doc/sessions/2026-05-04-stepfailed-portable-pickle-implementation-session.md:13-15. The repo instructions require user messages to be “verbatim, unedited text.” These should be replaced with the actual prompt text, or explicitly corrected if the exact text is unrecoverable.

Other findings

No unused imports or obvious regression risks in the code diff. The gAS pickle-looking heuristic at episodes/admin.py:67 is consistent with DBOS's current pickle.dumps(...) default protocol behavior in this environment.

Suggested follow-ups:

  • Replace the summarized transcript user sections with the actual verbatim user messages before merge.
  • Consider extracting printable strings from failed legacy pickle bytes so old rows show the original step failure text, not only a base64 preview.
  • Add an explicit subprocess or pickle-symbol test for the no-episodes.workflows import scenario.

Given the transcript requirement is explicit repo policy for feature PRs, I would not merge until that documentation issue is fixed. The runtime change itself looks sound.

— Codex review (independent second-opinion pass)

@rafacm rafacm force-pushed the rafacm/stepfailed-portable branch from 5ed7a35 to 3d2c71e Compare May 4, 2026 09:23
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

AI Checks summary

1 fail · ✅ 6 pass · ⏭️ 0 skip

❌ Branching & PR Strategy

No direct commits to main. Feature branches off latest main. Rebase merge only — squash and merge-commit are forbidden on this repo.

PR is from a feature branch and violates the branching strategy.

Details

The PR branch is not up-to-date with the current main branch; it does not appear to be rebased onto main, which is required for compliance.


Other checks (6 passing · 0 skipped)

Show details
  • Comment Discipline — Rule does not apply.
  • Entity Creation Race Safety — Rule does not apply.
  • Feature PR Documentation Bundle — Rule does not apply.
  • Pipeline Step Documentation Sync — Rule does not apply.
  • RAGTIME_ Env Var Sync* — Rule does not apply.
  • gh api Shell Escaping & Endpoints — Rule does not apply.

rafacm added a commit that referenced this pull request May 4, 2026
- Replace summarized user messages in session transcripts with verbatim text.
  Implementation session declared as agent-orchestrated; parent prompt
  reproduced verbatim.
- Add subprocess pickle-portability test asserting StepFailed deserializes
  to RuntimeError without episodes.workflows on sys.path.
- CHANGELOG: BREAKING note that pre-fix StepFailed pickles will not render
  human-readable text after this change.

Codex follow-up #2 (extract printable strings from legacy pickle bytes)
intentionally not addressed — user will clear the dev DB and accepts the
breaking-change tradeoff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Follow-up commit addressing the codex review:

  • Verbatim transcripts — planning session now embeds the actual user messages from the parent conversation. Implementation session is explicitly declared agent-orchestrated; the parent-agent's launching prompt is reproduced verbatim.
  • Subprocess pickle test — new episodes/tests/test_stepfailed_subprocess.py spawns a Python subprocess without episodes.workflows on sys.path, asserts the pickled StepFailed deserializes to RuntimeError carrying the original message. Locks the cross-process property in by force.
  • Breaking change CHANGELOG note — pre-fix StepFailed pickles will not render human-readable text after this change. User has accepted clearing the dev DB; no production impact (pre-prod project).

Codex follow-up #2 (extract printable strings from legacy pickle bytes) intentionally skipped.

rafacm added a commit that referenced this pull request May 4, 2026
AGENTS.md's transcript policy assumed the human-with-Claude workflow
where every session has real `### User` messages. Conductor's parallel-
launch model breaks that assumption: the implementation session has no
direct user, only an agent-orchestrated launching prompt.

Add a new "Agent-orchestrated sessions" subsection to AGENTS.md and a
matching note to the Feature PR Documentation Bundle AI check. The new
convention: use `### Parent agent (orchestrator)` headings instead of
`### User`, reproduce the parent-agent prompt verbatim, and declare the
session as agent-orchestrated at the top of `## Detailed conversation`.

Same verbatim rule applies — summarized parent prompts are still
rejected. This eliminates a recurring false-negative on the docs check
that was raised on PRs #129, #130, #131, #133.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rafacm and others added 2 commits May 4, 2026 15:28
DBOS persists per-step error payloads as base64-encoded pickle bytes.
The standalone ``dbos workflow steps`` CLI and DBOS Conductor run
outside the Django process and can't import ``episodes.workflows``,
so unpickling a typed ``StepFailed`` subclass raised
``ModuleNotFoundError: No module named 'episodes'`` and the CLI
printed "exception object could not be deserialized".

Switch ``StepFailed.__reduce__`` to return ``(RuntimeError,
(str(self),))``. The on-the-wire pickle is now a plain stdlib
``RuntimeError`` carrying the formatted message — any Python process
can deserialize it. Worker-side semantics unchanged: ``raise
DownloadStepFailed(...)`` still matches ``except StepFailed`` because
``__reduce__`` only kicks in at pickle time, after the typed
exception has done its job. Verified no caller catches by typed
subclass — the typed hierarchy is for log readability, not control
flow.

Admin's ``_decode_dbos_payload()`` previously returned the raw b64
blob when ``pickle.loads`` failed. Now returns ``<could not
deserialize: <80-char preview>>`` so legacy rows whose typed-class
wire format can no longer be rehydrated render an explanation
instead of a wall of base64.

Tests: new pickle round-trip asserting RuntimeError-on-the-wire;
existing legacy-row test updated for the new fallback message.
372 tests passing.

Closes #110.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace summarized user messages in session transcripts with verbatim text.
  Implementation session declared as agent-orchestrated; parent prompt
  reproduced verbatim.
- Add subprocess pickle-portability test asserting StepFailed deserializes
  to RuntimeError without episodes.workflows on sys.path.
- CHANGELOG: BREAKING note that pre-fix StepFailed pickles will not render
  human-readable text after this change.

Codex follow-up #2 (extract printable strings from legacy pickle bytes)
intentionally not addressed — user will clear the dev DB and accepts the
breaking-change tradeoff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rafacm rafacm force-pushed the rafacm/stepfailed-portable branch from ebff7f5 to 9d1897f Compare May 4, 2026 13:30
@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Merging with AI Check: Branching & PR Strategy failing — known false positive (#125)

This branch was rebased onto main (69a0081) and force-pushed prior to merge:

* 9d1897f Address codex review followups (PR #129)
* fc54e8d Make StepFailed pickles portable across processes

git merge-base HEAD origin/main equals origin/main's tip exactly — the branch is fully rebased.

Why the check fails anyway. GitHub Actions uses actions/checkout@v4's default behavior for pull_request events, which checks out refs/pull/N/merge — a synthetic two-parent merge commit GitHub creates regardless of whether the PR is fast-forwardable. From this run's log:

fetch ... +e8b57f8...:refs/remotes/pull/129/merge
checkout refs/remotes/pull/129/merge
HEAD is now at e8b57f8 Merge 9d1897f... into 69a008...

The Branching & PR Strategy rule's driver (gpt-4o-mini) inspects the checked-out history, sees a Merge X into Y commit on top, and concludes "branch is not rebased" — every time. This is structurally a false positive on every rebased PR until #125 lands a fix (either pin actions/checkout to ref: ${{ github.event.pull_request.head.sha }} so the runner sees the real rebased head, or harden the rule prompt to ignore synthetic merge commits).

All other signals are green: Unit Test Results, the other 6 AI Checks, aggregate, list, and test all pass. Merging via rebase strategy.

@rafacm rafacm merged commit 829c2a3 into main May 4, 2026
10 of 11 checks passed
@rafacm rafacm deleted the rafacm/stepfailed-portable branch May 4, 2026 13:39
rafacm added a commit that referenced this pull request May 4, 2026
- Trim verbose multi-paragraph comments in episodes/apps.py:_init_dbos
  per .ai-checks/comment-discipline.md (one-line WHY only).
- Add test_apps_dbos_init.py covering role detection for uvicorn,
  runserver, submit_episode, enrich_entities, migrate, test, shell.
- Merge duplicate ### Fixed section under 2026-05-04 in CHANGELOG.md
  after rebasing onto main (PR #129's StepFailed fix).
rafacm added a commit that referenced this pull request May 4, 2026
- Trim verbose multi-paragraph comments in episodes/apps.py:_init_dbos
  per .ai-checks/comment-discipline.md (one-line WHY only).
- Add test_apps_dbos_init.py covering role detection for uvicorn,
  runserver, submit_episode, enrich_entities, migrate, test, shell.
- Merge duplicate ### Fixed section under 2026-05-04 in CHANGELOG.md
  after rebasing onto main (PR #129's StepFailed fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace typed StepFailed exceptions with serializable failure objects so the dbos CLI can render them

1 participant