Fix client state not being reset after failed Start()#1136
Fix client state not being reset after failed Start()#1136mfly wants to merge 2 commits intoriverqueue:masterfrom
Conversation
|
Fixed the deadlock issue revealed by the stress tests - they pass locally now |
When Client.Start() failed (e.g., due to database connection errors or missing tables), the internal isRunning flag remained true. This caused subsequent Start() calls to return nil immediately without actually attempting to start the client, leaving the application in a non-functional state where jobs were never processed. The fix calls baseStartStop.Stop() after closing the stopped channel on startup failure, which properly resets the client's internal state so that Start() can be called again.
When Start() fails due to a real error (e.g., database connection failure), the client's internal state was left in a running state, preventing subsequent Start() calls from succeeding. Add StartFailed() method to BaseStartStop that properly resets internal state after a startup failure. This is separate from Stop() handling - when Stop() cancels the context (ErrStop), Stop() itself handles cleanup via finalizeStop(). Fixes the issue where a client could not be restarted after a transient startup failure.
ff3ea11 to
f01b895
Compare
|
@mfly Thanks for this! Let me spend a little time looking a bit more closely — the idea behind the |
Yes, please do 🙏 - I did ponder a bit on a more automated solution, but opted for being more explicit. I'm obviously lacking the context here. |
This one's presented as an alternative to #1136. Basically, a current problem with the start/stop infrastructure is that in the event of a partial start where a service returns from its start function, but without `Stop` having been called on, we can get into a situation where the start/stop's `isRunning` flag is still set to true, and when the start/stop is started again, it'll fall through thinking it's already running. Here, we check for this condition on subsequent starts. If the `stopped` channel is non-nil but already closed, we reset all internal state including `isRunning` so the service can start again. To prove this works, I pull in the test case added in #1136 verbatim, and also add one more specific test in `start_stop_test.go` for a more precise version.
This one's presented as an alternative to #1136. Basically, a current problem with the start/stop infrastructure is that in the event of a partial start where a service returns from its start function, but without `Stop` having been called on, we can get into a situation where the start/stop's `isRunning` flag is still set to true, and when the start/stop is started again, it'll fall through thinking it's already running. Here, we check for this condition on subsequent starts. If the `stopped` channel is non-nil but already closed, we reset all internal state including `isRunning` so the service can start again. To prove this works, I pull in the test case added in #1136 verbatim, and also add one more specific test in `start_stop_test.go` for a more precise version.
|
Sorry for the delay on this one. I put up a variant at #1187 and copied your test out to make sure it also resolves the problem. |
This one's presented as an alternative to #1136. Basically, a current problem with the start/stop infrastructure is that in the event of a partial start where a service returns from its start function, but without `Stop` having been called on, we can get into a situation where the start/stop's `isRunning` flag is still set to true, and when the start/stop is started again, it'll fall through thinking it's already running. Here, we check for this condition on subsequent starts. If the `stopped` channel is non-nil but already closed, we reset all internal state including `isRunning` so the service can start again. To prove this works, I pull in the test case added in #1136 verbatim, and also add one more specific test in `start_stop_test.go` for a more precise version.
|
(And thanks for the original fix!) |
Great, thanks! Closing this one! |
…art (#1187) This one's presented as an alternative to #1136. Basically, a current problem with the start/stop infrastructure is that in the event of a partial start where a service returns from its start function, but without `Stop` having been called on, we can get into a situation where the start/stop's `isRunning` flag is still set to true, and when the start/stop is started again, it'll fall through thinking it's already running. Here, we check for this condition on subsequent starts. If the `stopped` channel is non-nil but already closed, we reset all internal state including `isRunning` so the service can start again. To prove this works, I pull in the test case added in #1136 verbatim, and also add one more specific test in `start_stop_test.go` for a more precise version.
When Client.Start() failed (e.g., due to database connection errors or missing tables), the internal isRunning flag remained true. This caused subsequent Start() calls to return nil immediately without actually attempting to start the client, leaving the application in a non-functional state where jobs were never processed.
The fix adds a new StartFailed() method to BaseStartStop that properly resets internal state after a startup failure. This is called when Start() encounters a real error (not when Stop() cancels the context).