Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix manager task reconciliation and status synchronisation #1850

Merged

Conversation

rzetelskik
Copy link
Member

@rzetelskik rzetelskik commented Mar 19, 2024

Description of your changes: As described in #1752, there's currently an issue with manager tasks being deleted and recreated when the manager controller reconciles an older version of the object. The root cause of the issue is that the controller depends on the object's status to save and retrieve the task's ID, which it uses for identifying the task. If an older generation of the object is reconciled, and it's missing the task's ID from the status, it's going to delete the task and recreate it, even if the task is present in the manager's state. The main change in this PR is to make the controller use tasks' names for identity, instead of their ID, which can't be retrieved if it's missing from the status.

To make it possible, the PR modifies the validation logic to allow for non-unique names across tasks of different types. This is aligned with Scylla Manager, which doesn't impose such restrictions, i.e. it is enough that the name is unique across tasks of equal type. This allows us to use tasks' name for identity, and in turn, reconciling the tasks without retrieving their IDs.

The reconciliation fix also addresses other issues causing flakiness now, like stale tasks staying in status post deletion, having the controller hit an update conflict.

Futhermore, currently the tasks' statuses inline tasks' specs as defined in ScyllaCluster spec. Due to the spec having some default values, this currently causes the statuses to not be aligned with manager's state, and even display default values despite task scheduling errors and the tasks missing from manager state entirely.

This PR fixes it by:

  1. splitting tasks' statuses from specs in API definition to avoid the default values being displayed and making the fields optional
  2. propagating tasks' statuses as they're pertained in manager state, with an exception for appending client side errors

On top of that, the existing e2e tests are extended with verification of error propagation.

As agreed internally, this PR tries to achieve all of the above while minimising the amount of refactoring of the existing codebase. Therefore the general flow of the manager controller is maintained as is. This also means that the wiggle room in terms of refactoring was very limited, and so this PR should be treated as an intermediate fix targeting the controller's stability and correctness.

Any additional features or overall refactoring of the integration-related code should only be considered when we merge this and establish a baseline stability of the integration.

Which issue is resolved by this Pull Request:
Resolves #1752 #1694

/kind bug
/priority critical-urgent
/cc

Copy link
Contributor

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #1752

/kind bug
/priority critical-urgent
/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@scylla-operator-bot scylla-operator-bot bot added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 19, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch from dfec879 to d79db06 Compare March 19, 2024 23:11
@rzetelskik
Copy link
Member Author

rzetelskik commented Mar 19, 2024

/hold for #1851

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch 2 times, most recently from 221b411 to 6a2f1bb Compare March 26, 2024 09:34
@scylla-operator-bot scylla-operator-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 26, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch 2 times, most recently from 31e78bd to 0df1c7c Compare March 26, 2024 09:42
@rzetelskik rzetelskik marked this pull request as draft April 15, 2024 09:19
@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch from 0df1c7c to d151611 Compare April 16, 2024 19:58
@scylla-operator-bot scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
Copy link
Contributor

@rzetelskik: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test build
  • /test docs
  • /test e2e-gke-parallel
  • /test e2e-gke-parallel-clusterip
  • /test e2e-gke-release-script-latest
  • /test e2e-gke-serial
  • /test helm-build
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps
  • /test verify-vendorability

The following commands are available to trigger optional jobs:

  • /test docs-multiversion

Use /test all to run the following jobs that were automatically triggered:

  • pull-scylla-operator-master-build
  • pull-scylla-operator-master-docs
  • pull-scylla-operator-master-docs-multiversion
  • pull-scylla-operator-master-e2e-gke-parallel
  • pull-scylla-operator-master-e2e-gke-parallel-clusterip
  • pull-scylla-operator-master-e2e-gke-serial
  • pull-scylla-operator-master-helm-build
  • pull-scylla-operator-master-images
  • pull-scylla-operator-master-unit
  • pull-scylla-operator-master-verify
  • pull-scylla-operator-master-verify-deps
  • pull-scylla-operator-master-verify-vendorability

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rzetelskik
Copy link
Member Author

thanks for the review @tnozicka, I believe I've addressed all of your comments

@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch from 31e3331 to 459737b Compare May 7, 2024 07:06
@scylla-operator-bot scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch from 459737b to 2905d18 Compare May 7, 2024 07:10
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch 2 times, most recently from 25f9252 to 7c2ca20 Compare May 9, 2024 17:45
Copy link
Contributor

scylla-operator-bot bot commented May 9, 2024

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-release-script-latest 0df1c7c link true /test e2e-gke-release-script-latest

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rzetelskik rzetelskik changed the title Fix manager task reconciliation and status synchronisation [WIP] Fix manager task reconciliation and status synchronisation May 9, 2024
@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2024
@rzetelskik rzetelskik force-pushed the manager-controller-task-reclaim branch from 7c2ca20 to 98d22ff Compare May 9, 2024 21:25
@rzetelskik rzetelskik changed the title [WIP] Fix manager task reconciliation and status synchronisation Fix manager task reconciliation and status synchronisation May 10, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 10, 2024
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good start, thanks for the updates
/approve
/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
3 participants