-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scylla manager controller deletes and creates tasks even without hitting conflicts #1752
Comments
Most likely related or a duplicate of #1729 |
@scylladb/sig-operator we should discuss how to fix the logic behind task reconciliation. I'm wondering whether there's any reason we don't adopt manager tasks based on the matching names, instead of the IDs? |
An additional risk is losing the task history, including e.g. its retention, |
Since this is a prerequisite for #1671, I'm raising the priority and moving it to the current sprint. /priority critical-urgent |
This bug was reproduced in QA weekly test run here: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-eks-test/5/ |
Thanks @vponomaryov for reporting it. To the best of our knowledge this is a conceptual issue that has always been there and therefore is not a regression. As mentioned above the impact is small and eventually recovers on its own. Please put a sleep or something into the suite you have to not fail. We are tracking it here to fix it. |
What happened?
There's an issue with manager task synchronisation in scylla manager controller. On each reconciliation iteration, the controller goes through the tasks gathered from current manager state, and deletes the ones which are missing from ScyllaCluster status.
scylla-operator/pkg/controller/manager/sync_action.go
Line 273 in c8901da
What that means in practice is that the tasks can be deleted right after they've been scheduled and saved in the object status. And it doesn't even require the status update to fail, it's enough that the next key in queue is the previous generation of the object without the updated status. This can result in many iterations of task creations and deletions.
Example logs:
What did you expect to happen?
The manager tasks should not be deleted after they've been created, at least not when the status update succeeded.
How can we reproduce it (as minimally and precisely as possible)?
Scylla Operator version
master
Kubernetes platform name and version
n/a
Please attach the must-gather archive.
scylla-operator-must-gather-rc5jh7tf6fdn.tar.gz
Anything else we need to know?
While this isn't particularly dangerous, it's quite an annoyance for scripting. As a user, I'd expect that I can wait for the status to be updated with manager task ID to consider it scheduled. Unfortunately, due to this behaviour, the status of the particular task can be overwritten many times, and the previously reported tasks deleted.
The text was updated successfully, but these errors were encountered: