-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix manager task reconciliation and status synchronisation #1850
Fix manager task reconciliation and status synchronisation #1850
Conversation
@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik. Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
dfec879
to
d79db06
Compare
/hold for #1851 |
221b411
to
6a2f1bb
Compare
31e78bd
to
0df1c7c
Compare
0df1c7c
to
d151611
Compare
@rzetelskik: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
0acdebf
to
23cae55
Compare
935db59
to
726ae02
Compare
0e85808
to
31e3331
Compare
thanks for the review @tnozicka, I believe I've addressed all of your comments |
31e3331
to
459737b
Compare
459737b
to
2905d18
Compare
25f9252
to
7c2ca20
Compare
@rzetelskik: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
7c2ca20
to
98d22ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good start, thanks for the updates
/approve
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rzetelskik, tnozicka, zimnx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of your changes: As described in #1752, there's currently an issue with manager tasks being deleted and recreated when the manager controller reconciles an older version of the object. The root cause of the issue is that the controller depends on the object's status to save and retrieve the task's ID, which it uses for identifying the task. If an older generation of the object is reconciled, and it's missing the task's ID from the status, it's going to delete the task and recreate it, even if the task is present in the manager's state. The main change in this PR is to make the controller use tasks' names for identity, instead of their ID, which can't be retrieved if it's missing from the status.
To make it possible, the PR modifies the validation logic to allow for non-unique names across tasks of different types. This is aligned with Scylla Manager, which doesn't impose such restrictions, i.e. it is enough that the name is unique across tasks of equal type. This allows us to use tasks' name for identity, and in turn, reconciling the tasks without retrieving their IDs.
The reconciliation fix also addresses other issues causing flakiness now, like stale tasks staying in status post deletion, having the controller hit an update conflict.
Futhermore, currently the tasks' statuses inline tasks' specs as defined in ScyllaCluster spec. Due to the spec having some default values, this currently causes the statuses to not be aligned with manager's state, and even display default values despite task scheduling errors and the tasks missing from manager state entirely.
This PR fixes it by:
On top of that, the existing e2e tests are extended with verification of error propagation.
As agreed internally, this PR tries to achieve all of the above while minimising the amount of refactoring of the existing codebase. Therefore the general flow of the manager controller is maintained as is. This also means that the wiggle room in terms of refactoring was very limited, and so this PR should be treated as an intermediate fix targeting the controller's stability and correctness.
Any additional features or overall refactoring of the integration-related code should only be considered when we merge this and establish a baseline stability of the integration.
Which issue is resolved by this Pull Request:
Resolves #1752 #1694
/kind bug
/priority critical-urgent
/cc