Skip to content

fix: handle recovery in queue processing#57

Merged
ren0503 merged 1 commit intomasterfrom
fix/ren/56-recovery-job-in-queue-process
Oct 20, 2025
Merged

fix: handle recovery in queue processing#57
ren0503 merged 1 commit intomasterfrom
fix/ren/56-recovery-job-in-queue-process

Conversation

@ren0503
Copy link
Contributor

@ren0503 ren0503 commented Oct 20, 2025

No description provided.

@ren0503 ren0503 added this to the Queue v2.1.1 milestone Oct 20, 2025
@ren0503 ren0503 linked an issue Oct 20, 2025 that may be closed by this pull request
@coderabbitai
Copy link

coderabbitai bot commented Oct 20, 2025

Summary by CodeRabbit

  • Chores

    • Updated Go toolchain from v1.23.0 to v1.24.0 and bumped key dependencies for improved performance and stability.
  • Bug Fixes

    • Enhanced error handling in job execution with panic recovery and proper logging to prevent unexpected crashes.
  • Tests

    • Updated crash handling tests for improved error scenario coverage.

Summary by CodeRabbit

  • Chores
    • Upgraded Go toolchain from 1.23.0 to 1.24.0 and updated external dependencies for compatibility and stability.
  • Bug Fixes
    • Improved job queue robustness by adding panic recovery and centralizing error handling.

Walkthrough

Renamed the job error handler to exported HandlerError, added a defer-recover that logs panics in job execution goroutines, and bumped Go/tooling and several dependencies; tests updated to exercise an early panic path.

Changes

Cohort / File(s) Summary
Dependency Updates
go.mod
Bumped Go version (1.23.0 → 1.24.0) and toolchain (go1.24.1); upgraded github.com/go-redsync/redsync/v4 v4.13.0→v4.14.0, github.com/redis/go-redis/v9 v9.10.0→v9.14.1, github.com/tinh-tinh/tinhtinh/v2 v2.1.3→v2.3.4; removed indirect golang.org/x/sync v0.7.0.
Job Error Handling
job.go
Renamed handlerError → exported HandlerError; updated call sites to use Job.HandlerError(...); error storage key changed to use queue name directly.
Panic Recovery & Logging
queue.go
Added fmt import; wrapped job execution goroutines with defer+recover to convert panics into formatted logs via fmt.Sprintf and q.formatLog; refactored log dispatch to switch on log type.
Test Coverage
queue_test.go
Updated Test_Crash: enqueued a second job (Id == "2") and added an early panic path before Process() to validate recovery behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Queue
    participant Goroutine
    participant Job
    participant Logger

    Queue->>Goroutine: spawn job execution
    activate Goroutine
    Note right of Goroutine: defer { recover -> format log } (NEW)
    Goroutine->>Job: invoke Process()
    alt Process returns error
        Job->>Job: HandlerError(reason)
        Job-->>Goroutine: status handled
    else Process panics
        Goroutine->>Logger: recover panic -> fmt.Sprintf -> q.formatLog (NEW)
        Logger-->>Goroutine: fatal/log emitted
    end
    deactivate Goroutine
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through code where panics played,

Caught by a defer and softly laid.
Logs now bloom where crashes stood,
Gentle fixes, tidy and good. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ❓ Inconclusive No pull request description was provided by the author. The PR description field is completely empty, which means there is no content to evaluate against the criterion of being related to the changeset. Without any description present, it is impossible to determine whether there is a meaningful relation to the changes or not, making a definitive pass or fail assessment impossible. The author should add a pull request description that explains the purpose and context of the changes. While the title provides some information about handling recovery in queue processing, a more detailed description would help reviewers understand the motivation for the changes, particularly regarding the dependency updates, the error handler renaming in job.go, and the panic recovery mechanism in queue.go.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "fix: handle recovery in queue processing" directly aligns with the primary changes in the changeset. The most significant modification is in queue.go, which adds a defer-recover block to the Run method to handle panics during job execution and convert them to formatted logs. This is precisely what the title describes. While the changeset includes secondary modifications such as dependency updates in go.mod and a method rename in job.go, the title appropriately captures the main purpose of the PR as indicated by the branch name "fix/ren/56-recovery-job-in-queue-process" and the Codecov comments highlighting queue.go coverage.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/ren/56-recovery-job-in-queue-process

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ren0503 ren0503 force-pushed the fix/ren/56-recovery-job-in-queue-process branch from 1c917fc to 278848f Compare October 20, 2025 13:54
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
queue.go 77.77% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
queue.go (1)

256-266: Data race on finishedJob slice in Retry().

Multiple goroutines append to finishedJob without synchronization. This is a race and can corrupt the slice.

-       var finishedJob []string
+       var finishedJob []string
+       var mu sync.Mutex
        for i := range numJobs {
            job := numJobs[i]
            wg.Add(1)
            go func(job *Job) {
                defer wg.Done()
                q.jobFnc(job)
                if job.IsFinished() {
-                   finishedJob = append(finishedJob, job.Id)
+                   mu.Lock()
+                   finishedJob = append(finishedJob, job.Id)
+                   mu.Unlock()
                }
            }(job)
        }
🧹 Nitpick comments (4)
queue.go (1)

181-183: Avoid deferring cancel() inside the processing loop.

Deferring cancel in the loop delays timer cleanup until Run returns, leaking timers per batch. Call cancel() after the select instead.

-       ctx, cancel := context.WithTimeout(context.Background(), q.config.Timeout)
-       defer cancel()
+       ctx, cancel := context.WithTimeout(context.Background(), q.config.Timeout)
...
        select {
        case <-done:
            q.formatLog(LoggerInfo, "All jobs done\n")
        case <-ctx.Done():
            q.MarkJobFailedTimeout(numJobs)
        }
+       // Clean up the timer promptly per-iteration
+       cancel()
queue_test.go (1)

181-183: Test_Crash may still fail if recovered panics are logged with Fatal.

If the production code logs recovered panics with LoggerFatal, the process exits (os.Exit) and tests abort. After applying the non-fatal logging fix in queue.go, this should be stable. If you need a stopgap, set Logger to LoggerInfo or LoggerDisabled here.

 userQueue := queue.New("crash", &queue.Options{
@@
-    RetryFailures: 3,
+    RetryFailures: 3,
+    // Optional: keep CI resilient even if logging behavior changes
+    // Logger:        queue.LoggerInfo,
 })

Also applies to: 194-197

job.go (2)

73-79: Optional: capture stack trace for diagnostics on panic.

You already convert the panic into a failure via HandlerError. Capturing debug.Stack() here improves postmortem debugging.

 import (
     "context"
     "fmt"
+    "runtime/debug"
     "time"
 )
@@
     defer func() {
         if r := recover(); r != nil {
-            failedReason := fmt.Sprintf("%v", r)
-            job.HandlerError(failedReason)
+            job.Stacktrace = append(job.Stacktrace, string(debug.Stack()))
+            job.HandlerError(fmt.Sprintf("%v", r))
         }
     }()

89-105: Use the queue’s context when storing errors.

HandlerError should use job.queue.ctx (or a passed-in ctx) instead of context.Background(), to respect caller cancellation and timeouts.

-   _, err := client.HSet(context.Background(), job.queue.Name, job.Id, job.FailedReason).Result()
+   _, err := client.HSet(job.queue.ctx, job.queue.Name, job.Id, job.FailedReason).Result()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff38f41 and 1c917fc.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • go.mod (1 hunks)
  • job.go (2 hunks)
  • queue.go (2 hunks)
  • queue_test.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
queue.go (1)
constant.go (1)
  • LoggerFatal (8-8)
queue_test.go (1)
queue.go (1)
  • AddJobOptions (110-114)
🔇 Additional comments (1)
go.mod (1)

3-5: Verify minimum Go 1.24 across CI/dev; forward compatibility with 1.25+ confirmed.

The go 1.24.0 directive is correct, but the concern about "builds will fail" without go1.24.x is imprecise. Builds succeed on go 1.25+ (verified: system running go 1.25.1 resolves all modules and compiles successfully). The actual requirement is go ≥ 1.24.0; versions older than 1.24.0 will cause build failures.

Edge cases to verify in your CI fleet:

  • If any runner is on macOS 11: Go 1.25+ drops macOS 11 support; stay on 1.24.x or upgrade runners to macOS 12+.
  • If tests are floating-point sensitive (AMD64): Go 1.25+ uses FMA instructions, which may alter precision; confirm test stability.
  • If using cgo with C bindings: Go 1.25 enables ASAN leak detection by default; may flag previously silent leaks (disable with ASAN_OPTIONS if needed).

Otherwise, ensuring runners have go1.24.0+ (including 1.25+) is sufficient.

Comment on lines +192 to +197
defer func() {
if r := recover(); r != nil {
failedReason := fmt.Sprintf("%v", r)
q.formatLog(LoggerFatal, "Error when processing job: %v\n", failedReason)
}
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Recovered panic should not terminate the process; also mark the job failed/delayed.

  • log.Fatalf in a recover path exits the process, defeating recovery and likely breaking Test_Crash.
  • The job isn’t marked failed/delayed here, so an early panic (before job.Process) leaves it stuck in Wait.

Fix by logging non-fatally and recording the failure via HandlerError:

-               defer func() {
-                   if r := recover(); r != nil {
-                       failedReason := fmt.Sprintf("%v", r)
-                       q.formatLog(LoggerFatal, "Error when processing job: %v\n", failedReason)
-                   }
-               }()
+               defer func() {
+                   if r := recover(); r != nil {
+                       // Ensure the job transitions out of Wait/Active
+                       job.HandlerError(fmt.Sprint(r))
+                       // Do not exit the process; just log the incident
+                       q.formatLog(LoggerInfo, "Recovered panic while processing job %s: %v\n", job.Id, r)
+                   }
+               }()

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
job.go (1)

73-78: Consider capturing the stack trace for panics.

The defer-recover block correctly handles panics, but it only captures the panic value without the stack trace. The Job struct has a Stacktrace field (line 27) that could be populated here for better debugging.

Apply this diff to capture the stack trace:

+import (
+	"runtime/debug"
+)
+
 defer func() {
 	if r := recover(); r != nil {
 		failedReason := fmt.Sprintf("%v", r)
+		job.Stacktrace = append(job.Stacktrace, string(debug.Stack()))
 		job.HandlerError(failedReason)
 	}
 }()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c917fc and 278848f.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • go.mod (1 hunks)
  • job.go (2 hunks)
  • queue.go (3 hunks)
  • queue_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • queue.go
  • queue_test.go
  • go.mod
🔇 Additional comments (3)
job.go (3)

69-87: LGTM: Consistent error handling flow.

The Process method now handles both panic recovery (line 76) and normal errors (line 85) through the same HandlerError method, providing a consistent error handling path.


89-89: Let me search for any documentation or test references to confirm whether this export is intentional:

Verify that exporting HandlerError is intentional and necessary.

Based on a search of the codebase, HandlerError is not called from any external packages—only internally within job.go. According to Go best practices, unexported methods are a perfectly good design choice when they're not needed to satisfy an interface, keeping them from becoming part of the externally visible API.

Consider whether this method should remain exported or be reverted to an unexported method, as minimizing the exported API surface keeps the package's usage clear and intentions explicit.


94-94: Breaking change confirmed: Redis error storage key was simplified, but collision concern is unsupported.

The change from <queue>store to <queue> is intentional (commit 278848f). However, the review's concern about collisions is unfounded—no other data in the codebase uses queue.Name as a Redis key. The application itself won't break since error data is never read back.

Verify: Is this simplification intentional? If so, document it as a breaking change for any external tools reading the old <queue>store key and provide migration guidance for existing deployments.

@ren0503 ren0503 merged commit a76f848 into master Oct 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Processing In Queue Not Recovery When Panic

2 participants