-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: config: make consistent_cluster_management mandatory #16334
Conversation
Docs Preview 📖Docs Preview for this pull request is available here Changed Files:
Note: This preview will be available for 30 days and will be automatically deleted after that period. You can manually trigger a new build by committing changes. |
s/6.0/next release/ it appears we plan to release 5.5, which may have this PR |
@gleb-cloudius adding you to reviewers just to add some heat to our discussion |
So, I basically need to replace all 6.0 occurrences with 5.5? #16254 was merged with 6.0 in the description... |
service/raft/raft_group0.cc
Outdated
@@ -1714,7 +1694,7 @@ void raft_group0::register_metrics() { | |||
namespace sm = seastar::metrics; | |||
_metrics.add_group("raft_group0", { | |||
sm::make_gauge("status", [this] { return static_cast<uint8_t>(_status_for_monitoring); }, | |||
sm::description("status of the raft group, 0 - disabled, 1 - normal, 2 - aborted")) | |||
sm::description("status of the raft group, 0 - normal, 1 - aborted")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is a backwards incompatible user-visible change, we should probably keep 2 - aborted
and say that 1
is unused or sth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and say that
1
is unused or sth
Didn't you mean 0
instead of 1
? If yes, is reverting my changes and only renaming disabled
to unused
the solution here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to keep mentioning 0 in the description? I think it's implied that a value is not used if it is not mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. Also, this unused
in the description may be understood as if it described the status of the group 0. Group 0 cannot be "unused", so it is misinformation.
docs/architecture/raft.rst
Outdated
@@ -44,7 +44,12 @@ Enabling Raft | |||
|
|||
See :doc:`the upgrade guide from 5.2 to 5.4 </upgrade/index>` for details. | |||
|
|||
ScyllaDB Open Source 5.2 and later, and ScyllaDB Enterprise 2023.1 and later come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/index/>`). | |||
.. warning:: | |||
Raft is mandatory in ScyllaDB Open Source 6.0 and later, and ScyllaDB Enterprise 2024.2 and later. You cannot prevent Raft from getting enabled upon upgrade to 6.0 or 2024.2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's the question, how we should write this -- if this PR gets into 5.5
cc @annastuchlik -- please review the last commit (docs: update after making consistent_cluster_management mandatory
)
docs/architecture/raft.rst
Outdated
|
||
.. TODO: add reference to the upgrade guide from 5.4 to 6.0 | ||
|
||
ScyllaDB Open Source 5.2 and 5.4, and ScyllaDB Enterprise 2023.x and 2024.1 come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/index/>`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was wrong with "and later"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My (maybe incorrect) logic was that this section describes enabling Raft without a version upgrade, and that isn't possible in versions later than 5.4. If you have a properly working cluster in 5.5 or later, then you must have already enabled Raft (during rolling upgrade to 5.5 at the latest). So, I thought that this "and later" might be confusing.
But since we are going to remove the "Enabling Raft" section (the discussion below), it doesn't matter anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All versions starting from 5.2 have the "Raft upgrade procedure".
It's true that eventually, this procedure will not be needed (if we forget about recovery for a while) because the version you're upgrading from already went through the procedure.
But 5.5 still needs the procedure, because you can have a cluster in 5.4 with Raft disabled, then you do a rolling upgrade. As soon as all nodes have upgraded to version 5.5, that's when the Raft upgrade procedure starts. It starts after rolling upgrade, not in the middle of it. For versions which still have the consistent_cluster_management flag -- the procedure starts after all nodes enable the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as all nodes have upgraded to version 5.5, that's when the Raft upgrade procedure starts. It starts after rolling upgrade, not in the middle of it.
That's what I understood wrong. Thank you for the explanation.
docs/architecture/raft.rst
Outdated
.. warning:: | ||
Raft is mandatory in ScyllaDB Open Source 6.0 and later, and ScyllaDB Enterprise 2024.2 and later. You cannot prevent Raft from getting enabled upon upgrade to 6.0 or 2024.2. | ||
|
||
.. TODO: add reference to the upgrade guide from 5.4 to 6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we add this new upgrade doc, we'll need to include a big note saying "Raft shall be enforced upon you with no way out. Prepare for quorum requirement" or sth (cc @kostja @gleb-cloudius )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... @annastuchlik @tzach what do you think of the following idea:
Let's have a "template" upgrade guide present at all times in master branch: "Upgrade guide from last-release
to NEW_RELEASE_PLACEHOLDER". It would be hidden from users, but visible to developers, so we can update it as we create new PRs. We can also refer to it in other docs; the paragraphs or sentences that refer to it would be hidden. Or maybe NEW_RELEASE_PLACEHOLDER would be automatically replaced with "upcoming release" or something that makes sense.
Once new release is branched, we would have an automatic process that copies this template guide and replaces NEW_RELEASE_PLACEHOLDER with the actual version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to have only one upgrade guide: from the previous version to the new version. When I add it, you won't have to worry about which version you should link to.
Also, it would solve the problem that cost us a lot of updates and backports in 5.4. It stems from the Raft page being both OSS and Enterprise. If you link to an OSS-only page from the Raft page, the Enterprise build will fail.
Even if you exclude it with the .. only:: opensource directive
, the build reports errors, although the documentation renders OK. See scylladb/sphinx-scylladb-theme#944 for details.
In summary, you should link to </upgrade/index>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, from the previous version to this version, where this version is the one being documented. So 6.0 documentation should explain how to document to 6.0.
docs/architecture/raft.rst
Outdated
@@ -65,6 +70,7 @@ When all the nodes in the cluster and updated and restarted, the cluster will st | |||
Alternatively, you can enable the ``consistent_cluster_management`` option when you are: | |||
|
|||
* Performing a rolling upgrade from version 5.1 to 5.2 or version 2022.x to 2023.1 by updating ``scylla.yaml`` before restarting each node. The Raft upgrade procedure will start as soon as the last node was upgraded and restarted. As above, this requires :ref:`verifying <verify-raft-procedure>` that the procedure successfully finishes. | |||
* Performing a rolling upgrade from version 5.4 to 6.0 or version 2024.1 to 2024.2. ``consistent_cluster_management`` is mandatory in 6.0 and 2024.2, so updating ``scylla.yaml`` is unnecessary. As above, the Raft upgrade procedure will start as soon as the last node was upgraded and restarted. This also requires :ref:`verifying <verify-raft-procedure>` that the procedure successfully finishes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should refrain from referring to non-existing (but predicted) versions in documentation for now?
@annastuchlik do you have some standard process for such issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad practice, of course, but the page is both Open Source and Enterprise. We can update the info about Enterprise later, but we would have to a) remember about it and b) backport to all relevant versions.
Now that the documentation is versioned, we should avoid mentioning versions in the docs. The assumption is that if you selected version 6.0 (or use the default), you're reading the docs related to version 6.0. If you want to know some details about previous versions, go to the docs for that version.
Here's how I think we should change the Raft doc:
- In general, remove everything irrelevant in version 5.5/6.0.
- Remove the "Enabling Raft" section, including "Verifying that the Raft upgrade procedure finished successfully".
- You can leave "Verifying that Raft is enabled" section if you think it's useful (I think it's not, but it's up to you).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including "Verifying that the Raft upgrade procedure finished successfully".
This is an important step that the cluster administrator must take when enabling Raft.
Even more important when we make Raft mandatory, which this PR does.
Otherwise -- if the procedure gets stuck -- they'll end up, without being aware of it, with blocked schema and topology operations, and only learn about it when dung hits the fan (e.g. they need to removenode a dead node, and it turns out they cannot because the upgrade procedure is stuck).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get it. Let's say I'm an administrator, and I install/upgrade to version 5.5/6.0 where Raft is mandatory and enabled. I'm not aware of the consistent_cluster_management
option, as it's (correctly) removed from the docs.
WHEN should I verify that the Raft upgrade procedure was successful? Is it a necessary step after each installation/upgrade? If so, we should add it to the installation/upgrade guides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a necessary step after each installation/upgrade? If so, we should add it to the installation/upgrade guides.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or leave a reference to the "Verifying" section on the Raft page.
Which there already is: https://opensource.docs.scylladb.com/master/upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/upgrade-guide-from-5.1-to-5.2-generic.html#validate-raft-setup
But I see it's missing in the ->5.4 upgrade guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a necessary step after each installation/upgrade?
It's necessary only once, the first time you enable Raft in the cluster.
Following upgrades don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I see it's missing in the ->5.4 upgrade guide.
I created #16347
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a necessary step after each installation/upgrade?
It's necessary only once, the first time you enable Raft in the cluster. Following upgrades don't need it.
OK, so we need the information in the upgrade guides:
- 5.2 to 5.4
- 5.4 to 5.5/6.0
In later versions, it won't be necessary because all clusters will have Raft enabled in a previous version(5.5/6.0 or earlier). Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
No worries, this description is for developers, I'm only worried about making references in user-facing docs. |
Let's consult @annastuchlik and @tzach -- I pinged you on relevant code snippets. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
A similar section appears in the 5.1->5.2 upgrade procedure, we also need it in 5.2->5.4. These sections will have to be included in all upgrade guides until the version we're upgrading *from* is enabling Raft unconditionally. First such planned version is 5.5; therefore (if everything goes according to plan) the 5.5->later upgrade guide will no longer need the section, but 5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need it. This commit needs to be backported to 5.4 and corresponding enterprise release branch. Refs: scylladb#15854, scylladb#16334
A similar section appears in the 5.1->5.2 upgrade procedure, we also need it in 5.2->5.4. These sections will have to be included in all upgrade guides until the version we're upgrading *from* is enabling Raft unconditionally. First such planned version is 5.5; therefore (if everything goes according to plan) the 5.5->later upgrade guide will no longer need the section, but 5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need it. This commit needs to be backported to 5.4 and corresponding enterprise release branch. Refs: scylladb#15854, scylladb#16334
A similar section appears in the 5.1->5.2 upgrade procedure, we also need it in 5.2->5.4. These sections will have to be included in all upgrade guides until the version we're upgrading *from* is enabling Raft unconditionally. First such planned version is 5.5; therefore (if everything goes according to plan) the 5.5->later upgrade guide will no longer need the section, but 5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need it. This commit needs to be backported to 5.4 and corresponding enterprise release branch. Refs: scylladb#15854, scylladb#16334
A similar section appears in the 5.1->5.2 upgrade procedure, we also need it in 5.2->5.4. These sections will have to be included in all upgrade guides until the version we're upgrading *from* is enabling Raft unconditionally. First such planned version is 5.5; therefore (if everything goes according to plan) the 5.5->later upgrade guide will no longer need the section, but 5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need it. This commit needs to be backported to 5.4 and corresponding enterprise release branch. Refs: scylladb#15854, scylladb#16334
3ef75c8
to
ed7837b
Compare
v2:
Docs build started to fail after these changes. There are two references to the deleted "Enabling Raft" section:
I don't know how to deal with this for now. I'll come back to this later. |
Hmm. @annastuchlik do we need to keep this file on |
It's a regression. I somehow missed edit: I didn't miss anything, |
89ea509
to
34a7d72
Compare
v4: fix for the CI failures above (the new second commit) |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Please add upgrade information to the cover letter. Is there a change in what versions we can upgrade from? If so, what's the change? |
IIUC the limitations are no larger than in #16254, i.e. upgrade is possible from the version which first introduced schema commitlog feature (or later). So from 5.1 (or corresponding Enterprise version) or later. For OSS we only need to support upgrade from 5.4 and for Enterprise only from 2023.x (based on 5.2), so we're good with a spare. |
I added this note to the cover letter:
|
Do you mean all these 3: |
Yeah, let's do it |
89d4927
to
0b561ef
Compare
v5: changed |
In scylladb#16254, we made force_schema_commit_log unused. After this change, if someone passes this option as the command line argument, the boot fails. This behavior is undesired. We only want this option to be ignored. We can achieve this effect by making it deprecated.
The override_decommission option is supported only when consistent_cluster_management is disabled. In the following commit, we make consistent_cluster_management mandatory, which makes overwrite_decommission unusable.
In the following commits, we make consistent cluster management mandatory. This will make disable_raft_schema_config unusable, so we need to get rid of it. However, we don't want to remove tests that use it. The idea is to use the Raft RECOVERY mode instead of disabling consistent cluster management directly.
Code that executed only when consistent_cluster_management=false is removed. In particular, after this patch: - raft_group0 and raft_group_registry are always enabled, - raft_group0::status_for_monitoring::disabled becomes unused, - topology tests can only run with consistent_cluster_management.
Broken in the previous patch.
We remove Raft documentation irrelevant in 5.5. One of the changes is removing a part of the "Enabling Raft" section in raft.rst. Since Raft is mandatory in 5.5, the only way to enable it in this version is by performing a rolling upgrade from 5.4. We only need to have this case well-documented. In particular, we remove information that also appears in the upgrade guides like verifying schema synchronization. Similarly, we remove a sentence from the "Manual Recovery Procedure" section in handling-node-failures.rst because it mentions enabling Raft manually, which is impossible in 5.5. The rest of the changes are just removing information about checking or setting consistent_cluster_management, which has become unused.
0b561ef
to
f23f862
Compare
I think this diff got lost when I resent only a typo fix in the commit message: |
🔴 CI State: FAILURE✅ - Build Failed Tests (1/23442):Build Details:
|
Known flaky test -- #16219. I restarted CI. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
We make
consistent_cluster_management
mandatory in 5.5. Thisoption will be always unused and assumed to be true.
Additionally, we make
override_decommission
deprecated, as this optionhas been supported only with
consistent_cluster_management=false
.Making
consistent_cluster_management
mandatory also simplifiesthe code. Branches that execute only with
consistent_cluster_management
disabled are removed.We also update documentation by removing information irrelevant in 5.5.
Fixes #15854
Note about upgrades: this PR does not introduce any more limitations
to the upgrade procedure than there are already. As in
#16254, we can upgrade from the first version of Scylla
that supports the schema commitlog feature, i.e. from 5.1 (or
corresponding Enterprise release) or later. Assuming this PR ends up in
5.5, the documented upgrade support is from 5.4. For corresponding
Enterprise release, it's from 2023.x (based on 5.2), so all requirements
are met.