Fail stale jobs and cull zombie pending txns#106
Conversation
Bound job lifetimes and remove stale pending transactions to prevent zombie retries and unbounded resource growth. Adds a 24h max age check for confirmation and send jobs (EIP-7702 and external bundler), including a job_age_seconds helper and logging, so long-lived retrying jobs are permanently failed. Implements peek_pending_transactions_older_than in the EOA store to fetch pending entries older than a cutoff (and clean up missing data). Adds EOA worker logic to cull stale pending transactions (24h cutoff, max 500 per cycle) by batch-failing them and enqueuing failure webhooks. Small logging/error messages added to surface these events.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
Disabled knowledge base sources:
WalkthroughAdds time-based staleness guards: send/confirm handlers now permanently fail jobs older than configured thresholds; EOA store gains age-based peek and centralized hydration; EOA worker culls and fails stale pending transactions at start of its loop. Changes
Sequence Diagram(s)(Skipped — changes are primarily guards, store hydration/pagination, and worker culling without new multi-component sequential flows requiring visualization.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@executors/src/eoa/worker/mod.rs`:
- Around line 314-321: The current code logs errors from
self.cull_stale_pending_transactions().await and continues to call send_flow(),
which allows stale >24h pending entries to remain; instead fail the worker when
cull_stale_pending_transactions() errors by propagating or returning the error
so the worker run is nacked and retried later. Locate the call to
cull_stale_pending_transactions() in mod.rs (the block that logs and then
proceeds to call send_flow()), remove the swallowing log-only behavior and
replace it with error propagation (use the ? operator or return Err(e) converted
to the function's Result type) so send_flow() is not invoked on cull failure.
In `@executors/src/external_bundler/confirm.rs`:
- Around line 167-181: The stale-age branch currently returns
UserOpConfirmationError::ReceiptNotAvailable which lacks any message/age, so
create and use a distinct error variant (e.g., UserOpConfirmationError::StaleJob
{ user_op_hash, attempt_number, age_seconds } or add a message/age field to
ReceiptNotAvailable) and return that variant from the branch where
job_age_seconds(job) > MAX_CONFIRMATION_JOB_AGE_SECONDS; update the failing call
site that currently uses ReceiptNotAvailable to construct the new variant and
ensure downstream handlers (e.g., on_fail) can detect and surface the "aged out
after X seconds" reason.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eff66910-440f-484b-8e6c-a7171cba0ab7
📒 Files selected for processing (6)
executors/src/eip7702_executor/confirm.rsexecutors/src/eip7702_executor/send.rsexecutors/src/eoa/store/mod.rsexecutors/src/eoa/worker/mod.rsexecutors/src/external_bundler/confirm.rsexecutors/src/external_bundler/send.rs
Replace manual if-let logging with combinators in EoaExecutorWorker: call cull_stale_pending_transactions().await.inspect_err(...).map_err(|e| e.handle()) to log errors and convert them via handle(). Introduce a new UserOpConfirmationError::StaleJob variant carrying user_op_hash, attempt_number, and age_seconds to represent confirmation jobs that aged out; return this variant when a job exceeds MAX_CONFIRMATION_JOB_AGE_SECONDS and update the error-to-message mapping to include a message for StaleJob.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
executors/src/eoa/store/mod.rs (1)
583-619: Extract pending hydration/cleanup into a shared helper to avoid drift.This block duplicates near-identical logic from
peek_pending_transactions_paginatedandpeek_pending_transactions_with_optimistic_nonce(sameHGET user_request+ deserialize +ZREMmissing entries flow). A shared internal helper would reduce future divergence bugs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@executors/src/eoa/store/mod.rs` around lines 583 - 619, The code in peek_pending_transactions_paginated and peek_pending_transactions_with_optimistic_nonce duplicates the logic that HGETs "user_request", deserializes it, collects PendingTransaction structs, and ZREM's missing entries; extract this into a private helper (e.g., a method like hydrate_pending_transactions or collect_and_cleanup_pending) that accepts the Redis connection, an iterator/Vec of (transaction_id, queued_at), and uses transaction_data_key_name to build keys, deserializes into PendingTransaction.user_request, accumulates results, and issues ZREM calls against keys.pending_transactions_zset_name() for missing entries; replace the duplicated blocks with calls to that helper to centralize error handling and deletion logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@executors/src/eoa/store/mod.rs`:
- Around line 583-619: The code in peek_pending_transactions_paginated and
peek_pending_transactions_with_optimistic_nonce duplicates the logic that HGETs
"user_request", deserializes it, collects PendingTransaction structs, and ZREM's
missing entries; extract this into a private helper (e.g., a method like
hydrate_pending_transactions or collect_and_cleanup_pending) that accepts the
Redis connection, an iterator/Vec of (transaction_id, queued_at), and uses
transaction_data_key_name to build keys, deserializes into
PendingTransaction.user_request, accumulates results, and issues ZREM calls
against keys.pending_transactions_zset_name() for missing entries; replace the
duplicated blocks with calls to that helper to centralize error handling and
deletion logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 39537e4f-cad8-477f-a1da-798bc08d8b01
📒 Files selected for processing (3)
executors/src/eoa/store/mod.rsexecutors/src/eoa/worker/mod.rsexecutors/src/external_bundler/confirm.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- executors/src/eoa/worker/mod.rs
- executors/src/external_bundler/confirm.rs
Introduce hydrate_pending_transactions to centralize the logic for pipelined HGET of transaction user_request fields, JSON deserialization into PendingTransaction, and cleanup (ZREM) of orphaned zset entries. Replace duplicated hydration code in several peek_pending_transactions* methods with calls to the new helper, allocate the result Vec with capacity, and adjust pipeline query calls accordingly. This refactor reduces duplication and consolidates deserialization/error handling and cleanup in one place for easier maintenance.
Bound job lifetimes and remove stale pending transactions to prevent zombie retries and unbounded resource growth. Adds a 24h max age check for confirmation and send jobs (EIP-7702 and external bundler), including a job_age_seconds helper and logging, so long-lived retrying jobs are permanently failed. Implements peek_pending_transactions_older_than in the EOA store to fetch pending entries older than a cutoff (and clean up missing data). Adds EOA worker logic to cull stale pending transactions (24h cutoff, max 500 per cycle) by batch-failing them and enqueuing failure webhooks. Small logging/error messages added to surface these events.
Summary by CodeRabbit