Skip to content

Fail stale jobs and cull zombie pending txns#106

Merged
0xFirekeeper merged 3 commits intomainfrom
firekeeper/stale
Apr 21, 2026
Merged

Fail stale jobs and cull zombie pending txns#106
0xFirekeeper merged 3 commits intomainfrom
firekeeper/stale

Conversation

@0xFirekeeper
Copy link
Copy Markdown
Member

@0xFirekeeper 0xFirekeeper commented Apr 21, 2026

Bound job lifetimes and remove stale pending transactions to prevent zombie retries and unbounded resource growth. Adds a 24h max age check for confirmation and send jobs (EIP-7702 and external bundler), including a job_age_seconds helper and logging, so long-lived retrying jobs are permanently failed. Implements peek_pending_transactions_older_than in the EOA store to fetch pending entries older than a cutoff (and clean up missing data). Adds EOA worker logic to cull stale pending transactions (24h cutoff, max 500 per cycle) by batch-failing them and enqueuing failure webhooks. Small logging/error messages added to surface these events.

Summary by CodeRabbit

  • New Features
    • Jobs and transactions now expire after a configured age (24 hours by default), causing permanent failure to stop endless retries.
    • Worker cycles now detect and mark stale pending transactions (capped per cycle), logging warnings for visibility.
    • Confirmation and send handlers immediately fail overly stale jobs to avoid “zombie” retries and noisy retry loops.

Bound job lifetimes and remove stale pending transactions to prevent zombie retries and unbounded resource growth. Adds a 24h max age check for confirmation and send jobs (EIP-7702 and external bundler), including a job_age_seconds helper and logging, so long-lived retrying jobs are permanently failed. Implements peek_pending_transactions_older_than in the EOA store to fetch pending entries older than a cutoff (and clean up missing data). Adds EOA worker logic to cull stale pending transactions (24h cutoff, max 500 per cycle) by batch-failing them and enqueuing failure webhooks. Small logging/error messages added to surface these events.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 04377023-f268-434b-bd1c-ebdc97c4280a

📥 Commits

Reviewing files that changed from the base of the PR and between 1cf1f24 and f773afe.

📒 Files selected for processing (1)
  • executors/src/eoa/store/mod.rs

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.


Walkthrough

Adds time-based staleness guards: send/confirm handlers now permanently fail jobs older than configured thresholds; EOA store gains age-based peek and centralized hydration; EOA worker culls and fails stale pending transactions at start of its loop.

Changes

Cohort / File(s) Summary
EIP7702 Executor
executors/src/eip7702_executor/confirm.rs, executors/src/eip7702_executor/send.rs
Added job-age checks in process methods. Confirm uses MAX_CONFIRMATION_JOB_AGE_SECONDS and job_age_seconds; send enforces a 24h limit. Stale jobs are logged and returned as permanent failures via .map_err_fail().
External Bundler
executors/src/external_bundler/confirm.rs, executors/src/external_bundler/send.rs
Added confirmation job-age check with MAX_CONFIRMATION_JOB_AGE_SECONDS, job_age_seconds, and new error variant UserOpConfirmationError::StaleJob { ... }. Send handler enforces 24h pre-send guard. Stale jobs are logged and permanently failed.
EOA Store
executors/src/eoa/store/mod.rs
Replaced earlier pagination API with peek_pending_transactions_older_than(older_than_unix_ms, limit) (ZRANGEBYSCORE). Introduced private hydrate_pending_transactions(...) to HGET+deserialize and ZREM orphan IDs. Reintroduced peek_pending_transactions_paginated(offset, limit) implemented via zrange_withscores + hydration. Hydration now preallocates result vec and runs deletion pipeline via deletion_pipe.query_async::<()>(conn).
EOA Worker
executors/src/eoa/worker/mod.rs
Added constants for max pending age (24h in ms) and per-cycle cap (500). New cull_stale_pending_transactions computes cutoff, peeks stale transactions (bounded), logs, builds failure messages, and calls store.fail_pending_transactions_batch. Culling runs at workflow start; errors are inspected and propagated.

Sequence Diagram(s)

(Skipped — changes are primarily guards, store hydration/pagination, and worker culling without new multi-component sequential flows requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fail stale jobs and cull zombie pending txns' directly describes the main changes: adding stale job failure logic across multiple handlers and implementing zombie pending transaction culling in the EOA worker.
Docstring Coverage ✅ Passed Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch firekeeper/stale

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@executors/src/eoa/worker/mod.rs`:
- Around line 314-321: The current code logs errors from
self.cull_stale_pending_transactions().await and continues to call send_flow(),
which allows stale >24h pending entries to remain; instead fail the worker when
cull_stale_pending_transactions() errors by propagating or returning the error
so the worker run is nacked and retried later. Locate the call to
cull_stale_pending_transactions() in mod.rs (the block that logs and then
proceeds to call send_flow()), remove the swallowing log-only behavior and
replace it with error propagation (use the ? operator or return Err(e) converted
to the function's Result type) so send_flow() is not invoked on cull failure.

In `@executors/src/external_bundler/confirm.rs`:
- Around line 167-181: The stale-age branch currently returns
UserOpConfirmationError::ReceiptNotAvailable which lacks any message/age, so
create and use a distinct error variant (e.g., UserOpConfirmationError::StaleJob
{ user_op_hash, attempt_number, age_seconds } or add a message/age field to
ReceiptNotAvailable) and return that variant from the branch where
job_age_seconds(job) > MAX_CONFIRMATION_JOB_AGE_SECONDS; update the failing call
site that currently uses ReceiptNotAvailable to construct the new variant and
ensure downstream handlers (e.g., on_fail) can detect and surface the "aged out
after X seconds" reason.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eff66910-440f-484b-8e6c-a7171cba0ab7

📥 Commits

Reviewing files that changed from the base of the PR and between 8d18867 and cca2e05.

📒 Files selected for processing (6)
  • executors/src/eip7702_executor/confirm.rs
  • executors/src/eip7702_executor/send.rs
  • executors/src/eoa/store/mod.rs
  • executors/src/eoa/worker/mod.rs
  • executors/src/external_bundler/confirm.rs
  • executors/src/external_bundler/send.rs

Comment thread executors/src/eoa/worker/mod.rs Outdated
Comment thread executors/src/external_bundler/confirm.rs
Replace manual if-let logging with combinators in EoaExecutorWorker: call cull_stale_pending_transactions().await.inspect_err(...).map_err(|e| e.handle()) to log errors and convert them via handle().

Introduce a new UserOpConfirmationError::StaleJob variant carrying user_op_hash, attempt_number, and age_seconds to represent confirmation jobs that aged out; return this variant when a job exceeds MAX_CONFIRMATION_JOB_AGE_SECONDS and update the error-to-message mapping to include a message for StaleJob.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
executors/src/eoa/store/mod.rs (1)

583-619: Extract pending hydration/cleanup into a shared helper to avoid drift.

This block duplicates near-identical logic from peek_pending_transactions_paginated and peek_pending_transactions_with_optimistic_nonce (same HGET user_request + deserialize + ZREM missing entries flow). A shared internal helper would reduce future divergence bugs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@executors/src/eoa/store/mod.rs` around lines 583 - 619, The code in
peek_pending_transactions_paginated and
peek_pending_transactions_with_optimistic_nonce duplicates the logic that HGETs
"user_request", deserializes it, collects PendingTransaction structs, and ZREM's
missing entries; extract this into a private helper (e.g., a method like
hydrate_pending_transactions or collect_and_cleanup_pending) that accepts the
Redis connection, an iterator/Vec of (transaction_id, queued_at), and uses
transaction_data_key_name to build keys, deserializes into
PendingTransaction.user_request, accumulates results, and issues ZREM calls
against keys.pending_transactions_zset_name() for missing entries; replace the
duplicated blocks with calls to that helper to centralize error handling and
deletion logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@executors/src/eoa/store/mod.rs`:
- Around line 583-619: The code in peek_pending_transactions_paginated and
peek_pending_transactions_with_optimistic_nonce duplicates the logic that HGETs
"user_request", deserializes it, collects PendingTransaction structs, and ZREM's
missing entries; extract this into a private helper (e.g., a method like
hydrate_pending_transactions or collect_and_cleanup_pending) that accepts the
Redis connection, an iterator/Vec of (transaction_id, queued_at), and uses
transaction_data_key_name to build keys, deserializes into
PendingTransaction.user_request, accumulates results, and issues ZREM calls
against keys.pending_transactions_zset_name() for missing entries; replace the
duplicated blocks with calls to that helper to centralize error handling and
deletion logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39537e4f-cad8-477f-a1da-798bc08d8b01

📥 Commits

Reviewing files that changed from the base of the PR and between cca2e05 and 1cf1f24.

📒 Files selected for processing (3)
  • executors/src/eoa/store/mod.rs
  • executors/src/eoa/worker/mod.rs
  • executors/src/external_bundler/confirm.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • executors/src/eoa/worker/mod.rs
  • executors/src/external_bundler/confirm.rs

Introduce hydrate_pending_transactions to centralize the logic for pipelined HGET of transaction user_request fields, JSON deserialization into PendingTransaction, and cleanup (ZREM) of orphaned zset entries. Replace duplicated hydration code in several peek_pending_transactions* methods with calls to the new helper, allocate the result Vec with capacity, and adjust pipeline query calls accordingly. This refactor reduces duplication and consolidates deserialization/error handling and cleanup in one place for easier maintenance.
@0xFirekeeper 0xFirekeeper merged commit b6f39f4 into main Apr 21, 2026
2 of 3 checks passed
@0xFirekeeper 0xFirekeeper deleted the firekeeper/stale branch April 21, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant