Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(queue): fix ability to cancel a zombied execution #4473

Merged
merged 4 commits into from
Jun 20, 2023

Conversation

mattgogerly
Copy link
Member

Tentative fix for spinnaker/spinnaker#6224.

)
queue.ensure(taskMessage, Duration.ZERO)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change. If an execution is a zombie, queue.reschedule() will not do anything. Calling ensure() first puts a RunTask message in the queue if it does not already exist.

As the execution is already flagged as cancelled at this point this is a safe operation, since the RunTaskHandler checks this first before doing anything else.

@mattgogerly
Copy link
Member Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jun 16, 2023

update

✅ Branch has been successfully updated

@@ -46,6 +46,18 @@ class ExecutionLatch(private val predicate: Predicate<ExecutionComplete>) :
fun await() = latch.await(10, TimeUnit.SECONDS)
}

fun ConfigurableApplicationContext.run(execution: PipelineExecution, launcher: (PipelineExecution) -> Unit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question: what's the purpose of this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runUntilCompletion() starts the pipeline and then waits 10 seconds for it to finish, which doesn't work for the purposes of this test when we need to mutate the queue after starting the execution.

This method is the same as runUntilCompletion without the part that waits. The sleep is to allow the messages for starting the execution to be processed before we continue on to deleting the queue.

Copy link
Contributor

@dbyron-sf dbyron-sf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thank you for fixing this! I do have some questions though. Apologies in advance for no good deed going unpunished. Now that you've figured this out, I'd love to try to get all the info in your brain written down.

@mattgogerly
Copy link
Member Author

spinnaker/spinnaker#6224 talks about not cancelling pipelines. This PR seems to fix cancelling of zombies...is it clear that 6224 is really about zombies?

I'd like to get people in spinnaker/spinnaker#6224 to confirm if they still run into this once this is merged.

It's possible there are other issues that could prevent cancelling executions. I'm interested in @karlskewes account of running into this with the SQL queue as it's not clear to me how you could get zombies with SQL (assuming it's not running on Kubernetes..)

I can't see anything obviously wrong with the actual cancel logic. It just (fairly not not always realistically) assumes there's a message to reschedule.

Does this fix cancelling of NOT_STARTED stages (e.g. spinnaker/spinnaker#6224 (comment))?

If zombies are the issue, kinda. That issue is the zombie lookup agent only considers stages that are running.

This change should allow users to just cancel themselves rather than relying on the agent.

Is it worth logging a warning in RedisQueue.reschedule, similar to SqlQueue.reschedule? Maybe an error is more appropriate. Updating the javadoc to mention what happens in this case seems like a good idea too.

Probably. I'll add one.

Anything to update in https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/ to help people who might struggle with this? Maybe something about the kind of failure-to-cancel that could occur before spinnaker versions X.Y.Z (i.e. before this PR is merged/backported)?

Sure. PR incoming.

@karlskewes
Copy link
Contributor

It's possible there are other issues that could prevent cancelling executions. I'm interested in @karlskewes account of running into this with the SQL queue as it's not clear to me how you could get zombies with SQL (assuming it's not running on Kubernetes..)

Thank you but sorry, I don't have access to the environment to validate.
Orca (all incl queue) was backed by AWS Aurora MySQL 2.09.x at the time.
IIRC, it was more likely to happen when Orca pods were rolled mid pipeline execution, but it was a couple of years ago.
Might have to see what others in community say. Definitely not a blocker for this fix.

@mattgogerly mattgogerly added the ready to merge Approved and ready for merge label Jun 20, 2023
@mergify mergify bot added the auto merged Merged automatically by a bot label Jun 20, 2023
@mergify mergify bot merged commit 56c7206 into spinnaker:master Jun 20, 2023
5 checks passed
@mattgogerly mattgogerly deleted the cancel-zombie-execution branch June 20, 2023 10:23
@mattgogerly
Copy link
Member Author

@Mergifyio backport release-1.31.x release-1.30.x

@mergify
Copy link
Contributor

mergify bot commented Jun 20, 2023

backport release-1.31.x release-1.30.x

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jun 20, 2023
* fix(queue): fix ability to cancel a zombied execution

* fix(queue): undo unintentional change

* fix(queue): add more logging

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 56c7206)
mergify bot pushed a commit that referenced this pull request Jun 20, 2023
* fix(queue): fix ability to cancel a zombied execution

* fix(queue): undo unintentional change

* fix(queue): add more logging

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 56c7206)
mergify bot added a commit that referenced this pull request Jun 20, 2023
* fix(queue): fix ability to cancel a zombied execution

* fix(queue): undo unintentional change

* fix(queue): add more logging

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 56c7206)

Co-authored-by: Matt <6519811+mattgogerly@users.noreply.github.com>
mergify bot added a commit that referenced this pull request Jun 29, 2023
* fix(queue): fix ability to cancel a zombied execution

* fix(queue): undo unintentional change

* fix(queue): add more logging

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 56c7206)

Co-authored-by: Matt <6519811+mattgogerly@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merged Merged automatically by a bot ready to merge Approved and ready for merge target-release/1.32
Projects
None yet
5 participants