fix(queue): fix ability to cancel a zombied execution #4473

mattgogerly · 2023-06-16T12:13:24Z

Tentative fix for spinnaker/spinnaker#6224.

mattgogerly · 2023-06-16T12:18:50Z

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RescheduleExecutionHandler.kt

              )
+              queue.ensure(taskMessage, Duration.ZERO)


This is the key change. If an execution is a zombie, queue.reschedule() will not do anything. Calling ensure() first puts a RunTask message in the queue if it does not already exist.

As the execution is already flagged as cancelled at this point this is a safe operation, since the RunTaskHandler checks this first before doing anything else.

mattgogerly · 2023-06-16T14:08:27Z

@Mergifyio update

mergify · 2023-06-16T14:08:31Z

update

✅ Branch has been successfully updated

nemesisOsorio · 2023-06-16T20:40:10Z

orca-queue-tck/src/main/kotlin/com/netflix/spinnaker/orca/q/ExecutionLatch.kt

@@ -46,6 +46,18 @@ class ExecutionLatch(private val predicate: Predicate<ExecutionComplete>) :
  fun await() = latch.await(10, TimeUnit.SECONDS)
 }

+fun ConfigurableApplicationContext.run(execution: PipelineExecution, launcher: (PipelineExecution) -> Unit) {


dumb question: what's the purpose of this function?

runUntilCompletion() starts the pipeline and then waits 10 seconds for it to finish, which doesn't work for the purposes of this test when we need to mutate the queue after starting the execution.

This method is the same as runUntilCompletion without the part that waits. The sleep is to allow the messages for starting the execution to be processed before we continue on to deleting the queue.

dbyron-sf

LGTM, and thank you for fixing this! I do have some questions though. Apologies in advance for no good deed going unpunished. Now that you've figured this out, I'd love to try to get all the info in your brain written down.

Cannot cancel pipeline spinnaker#6224 talks about not cancelling pipelines. This PR seems to fix cancelling of zombies...is it clear that 6224 is really about zombies?
Any chance this fixes the "Failed to evaluate" messages, or the "hundreds of calls to clouddriver" from Cannot cancel pipeline spinnaker#6224 (comment)?
Does this fix cancelling of NOT_STARTED stages (e.g. Cannot cancel pipeline spinnaker#6224 (comment))?
Is it worth logging a warning in RedisQueue.reschedule, similar to SqlQueue.reschedule? Maybe an error is more appropriate. Updating the javadoc to mention what happens in this case seems like a good idea too.
Anything to update in https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/ to help people who might struggle with this? Maybe something about the kind of failure-to-cancel that could occur before spinnaker versions X.Y.Z (i.e. before this PR is merged/backported)?

mattgogerly · 2023-06-19T19:27:11Z

spinnaker/spinnaker#6224 talks about not cancelling pipelines. This PR seems to fix cancelling of zombies...is it clear that 6224 is really about zombies?

I'd like to get people in spinnaker/spinnaker#6224 to confirm if they still run into this once this is merged.

It's possible there are other issues that could prevent cancelling executions. I'm interested in @karlskewes account of running into this with the SQL queue as it's not clear to me how you could get zombies with SQL (assuming it's not running on Kubernetes..)

I can't see anything obviously wrong with the actual cancel logic. It just (fairly not not always realistically) assumes there's a message to reschedule.

Does this fix cancelling of NOT_STARTED stages (e.g. spinnaker/spinnaker#6224 (comment))?

If zombies are the issue, kinda. That issue is the zombie lookup agent only considers stages that are running.

This change should allow users to just cancel themselves rather than relying on the agent.

Is it worth logging a warning in RedisQueue.reschedule, similar to SqlQueue.reschedule? Maybe an error is more appropriate. Updating the javadoc to mention what happens in this case seems like a good idea too.

Probably. I'll add one.

Anything to update in https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/ to help people who might struggle with this? Maybe something about the kind of failure-to-cancel that could occur before spinnaker versions X.Y.Z (i.e. before this PR is merged/backported)?

Sure. PR incoming.

karlskewes · 2023-06-20T01:58:48Z

It's possible there are other issues that could prevent cancelling executions. I'm interested in @karlskewes account of running into this with the SQL queue as it's not clear to me how you could get zombies with SQL (assuming it's not running on Kubernetes..)

Thank you but sorry, I don't have access to the environment to validate.
Orca (all incl queue) was backed by AWS Aurora MySQL 2.09.x at the time.
IIRC, it was more likely to happen when Orca pods were rolled mid pipeline execution, but it was a couple of years ago.
Might have to see what others in community say. Definitely not a blocker for this fix.

mattgogerly · 2023-06-20T13:00:24Z

@Mergifyio backport release-1.31.x release-1.30.x

mergify · 2023-06-20T13:00:32Z

backport release-1.31.x release-1.30.x

✅ Backports have been created

#4477 fix(queue): fix ability to cancel a zombied execution (backport #4473) has been created for branch release-1.31.x
#4478 fix(queue): fix ability to cancel a zombied execution (backport #4473) has been created for branch release-1.30.x

* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206)

* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206) Co-authored-by: Matt <6519811+mattgogerly@users.noreply.github.com>

* fix(queue): fix ability to cancel a zombied execution * fix(queue): undo unintentional change * fix(queue): add more logging --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 56c7206) Co-authored-by: Matt <6519811+mattgogerly@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

mattgogerly added 2 commits June 16, 2023 13:12

fix(queue): fix ability to cancel a zombied execution

d10930f

fix(queue): undo unintentional change

6858afe

mattgogerly commented Jun 16, 2023

View reviewed changes

Merge branch 'master' into cancel-zombie-execution

de2f7e9

nemesisOsorio reviewed Jun 16, 2023

View reviewed changes

nemesisOsorio approved these changes Jun 16, 2023

View reviewed changes

dbyron-sf approved these changes Jun 19, 2023

View reviewed changes

fix(queue): add more logging

36121b1

mattgogerly added the ready to merge Approved and ready for merge label Jun 20, 2023

mergify bot added the auto merged Merged automatically by a bot label Jun 20, 2023

mergify bot merged commit 56c7206 into spinnaker:master Jun 20, 2023
5 checks passed

mattgogerly deleted the cancel-zombie-execution branch June 20, 2023 10:23

spinnakerbot added the target-release/1.32 label Jun 20, 2023

This was referenced Jun 20, 2023

fix(queue): fix ability to cancel a zombied execution (backport #4473) #4477

Merged

fix(queue): fix ability to cancel a zombied execution (backport #4473) #4478

Merged

mattgogerly mentioned this pull request Jun 20, 2023

Cannot cancel pipeline spinnaker/spinnaker#6224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(queue): fix ability to cancel a zombied execution #4473

fix(queue): fix ability to cancel a zombied execution #4473

mattgogerly commented Jun 16, 2023

mattgogerly Jun 16, 2023

mattgogerly commented Jun 16, 2023

mergify bot commented Jun 16, 2023

nemesisOsorio Jun 16, 2023

mattgogerly Jun 16, 2023

dbyron-sf left a comment

mattgogerly commented Jun 19, 2023

karlskewes commented Jun 20, 2023

mattgogerly commented Jun 20, 2023

mergify bot commented Jun 20, 2023 •

edited

Loading

fix(queue): fix ability to cancel a zombied execution #4473

fix(queue): fix ability to cancel a zombied execution #4473

Conversation

mattgogerly commented Jun 16, 2023

mattgogerly Jun 16, 2023

Choose a reason for hiding this comment

mattgogerly commented Jun 16, 2023

mergify bot commented Jun 16, 2023

✅ Branch has been successfully updated

nemesisOsorio Jun 16, 2023

Choose a reason for hiding this comment

mattgogerly Jun 16, 2023

Choose a reason for hiding this comment

dbyron-sf left a comment

Choose a reason for hiding this comment

mattgogerly commented Jun 19, 2023

karlskewes commented Jun 20, 2023

mattgogerly commented Jun 20, 2023

mergify bot commented Jun 20, 2023 • edited Loading

✅ Backports have been created

mergify bot commented Jun 20, 2023 •

edited

Loading