-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(queue): fix ability to cancel a zombied execution #4473
fix(queue): fix ability to cancel a zombied execution #4473
Conversation
) | ||
queue.ensure(taskMessage, Duration.ZERO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key change. If an execution is a zombie, queue.reschedule()
will not do anything. Calling ensure()
first puts a RunTask
message in the queue if it does not already exist.
As the execution is already flagged as cancelled at this point this is a safe operation, since the RunTaskHandler
checks this first before doing anything else.
@Mergifyio update |
✅ Branch has been successfully updated |
@@ -46,6 +46,18 @@ class ExecutionLatch(private val predicate: Predicate<ExecutionComplete>) : | |||
fun await() = latch.await(10, TimeUnit.SECONDS) | |||
} | |||
|
|||
fun ConfigurableApplicationContext.run(execution: PipelineExecution, launcher: (PipelineExecution) -> Unit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumb question: what's the purpose of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runUntilCompletion()
starts the pipeline and then waits 10 seconds for it to finish, which doesn't work for the purposes of this test when we need to mutate the queue after starting the execution.
This method is the same as runUntilCompletion
without the part that waits. The sleep is to allow the messages for starting the execution to be processed before we continue on to deleting the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and thank you for fixing this! I do have some questions though. Apologies in advance for no good deed going unpunished. Now that you've figured this out, I'd love to try to get all the info in your brain written down.
- Cannot cancel pipeline spinnaker#6224 talks about not cancelling pipelines. This PR seems to fix cancelling of zombies...is it clear that 6224 is really about zombies?
- Any chance this fixes the "Failed to evaluate" messages, or the "hundreds of calls to clouddriver" from Cannot cancel pipeline spinnaker#6224 (comment)?
- Does this fix cancelling of
NOT_STARTED
stages (e.g. Cannot cancel pipeline spinnaker#6224 (comment))? - Is it worth logging a warning in RedisQueue.reschedule, similar to SqlQueue.reschedule? Maybe an error is more appropriate. Updating the javadoc to mention what happens in this case seems like a good idea too.
- Anything to update in https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/ to help people who might struggle with this? Maybe something about the kind of failure-to-cancel that could occur before spinnaker versions X.Y.Z (i.e. before this PR is merged/backported)?
I'd like to get people in spinnaker/spinnaker#6224 to confirm if they still run into this once this is merged. It's possible there are other issues that could prevent cancelling executions. I'm interested in @karlskewes account of running into this with the SQL queue as it's not clear to me how you could get zombies with SQL (assuming it's not running on Kubernetes..) I can't see anything obviously wrong with the actual cancel logic. It just (fairly not not always realistically) assumes there's a message to reschedule.
If zombies are the issue, kinda. That issue is the zombie lookup agent only considers stages that are running. This change should allow users to just cancel themselves rather than relying on the agent.
Probably. I'll add one.
Sure. PR incoming. |
Thank you but sorry, I don't have access to the environment to validate. |
@Mergifyio backport release-1.31.x release-1.30.x |
✅ Backports have been created
|
* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206)
* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206)
* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206) Co-authored-by: Matt <6519811+mattgogerly@users.noreply.github.com>
* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206) Co-authored-by: Matt <6519811+mattgogerly@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Tentative fix for spinnaker/spinnaker#6224.