Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: config: make consistent_cluster_management mandatory #16334

Merged
merged 6 commits into from
Dec 19, 2023

Conversation

patjed41
Copy link
Contributor

@patjed41 patjed41 commented Dec 7, 2023

We make consistent_cluster_management mandatory in 5.5. This
option will be always unused and assumed to be true.

Additionally, we make override_decommission deprecated, as this option
has been supported only with consistent_cluster_management=false.

Making consistent_cluster_management mandatory also simplifies
the code. Branches that execute only with
consistent_cluster_management disabled are removed.

We also update documentation by removing information irrelevant in 5.5.

Fixes #15854

Note about upgrades: this PR does not introduce any more limitations
to the upgrade procedure than there are already. As in
#16254, we can upgrade from the first version of Scylla
that supports the schema commitlog feature, i.e. from 5.1 (or
corresponding Enterprise release) or later. Assuming this PR ends up in
5.5, the documented upgrade support is from 5.4. For corresponding
Enterprise release, it's from 2023.x (based on 5.2), so all requirements
are met.

@kbr-scylla
Copy link
Contributor

in 6.0

s/6.0/next release/

it appears we plan to release 5.5, which may have this PR

@kbr-scylla kbr-scylla requested review from avikivity and removed request for tgrabiec and nyh December 7, 2023 16:08
@kbr-scylla
Copy link
Contributor

@kbr-scylla
Copy link
Contributor

@gleb-cloudius adding you to reviewers just to add some heat to our discussion

@patjed41
Copy link
Contributor Author

patjed41 commented Dec 7, 2023

in 6.0

s/6.0/next release/

it appears we plan to release 5.5, which may have this PR

So, I basically need to replace all 6.0 occurrences with 5.5?

#16254 was merged with 6.0 in the description...

@@ -1714,7 +1694,7 @@ void raft_group0::register_metrics() {
namespace sm = seastar::metrics;
_metrics.add_group("raft_group0", {
sm::make_gauge("status", [this] { return static_cast<uint8_t>(_status_for_monitoring); },
sm::description("status of the raft group, 0 - disabled, 1 - normal, 2 - aborted"))
sm::description("status of the raft group, 0 - normal, 1 - aborted"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is a backwards incompatible user-visible change, we should probably keep 2 - aborted and say that 1 is unused or sth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and say that 1 is unused or sth

Didn't you mean 0 instead of 1? If yes, is reverting my changes and only renaming disabled to unused the solution here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to keep mentioning 0 in the description? I think it's implied that a value is not used if it is not mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. Also, this unused in the description may be understood as if it described the status of the group 0. Group 0 cannot be "unused", so it is misinformation.

service/raft/raft_group0.hh Outdated Show resolved Hide resolved
test/boost/group0_test.cc Outdated Show resolved Hide resolved
@@ -44,7 +44,12 @@ Enabling Raft

See :doc:`the upgrade guide from 5.2 to 5.4 </upgrade/index>` for details.

ScyllaDB Open Source 5.2 and later, and ScyllaDB Enterprise 2023.1 and later come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/index/>`).
.. warning::
Raft is mandatory in ScyllaDB Open Source 6.0 and later, and ScyllaDB Enterprise 2024.2 and later. You cannot prevent Raft from getting enabled upon upgrade to 6.0 or 2024.2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's the question, how we should write this -- if this PR gets into 5.5
cc @annastuchlik -- please review the last commit (docs: update after making consistent_cluster_management mandatory)


.. TODO: add reference to the upgrade guide from 5.4 to 6.0

ScyllaDB Open Source 5.2 and 5.4, and ScyllaDB Enterprise 2023.x and 2024.1 come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/index/>`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was wrong with "and later"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (maybe incorrect) logic was that this section describes enabling Raft without a version upgrade, and that isn't possible in versions later than 5.4. If you have a properly working cluster in 5.5 or later, then you must have already enabled Raft (during rolling upgrade to 5.5 at the latest). So, I thought that this "and later" might be confusing.

But since we are going to remove the "Enabling Raft" section (the discussion below), it doesn't matter anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All versions starting from 5.2 have the "Raft upgrade procedure".

It's true that eventually, this procedure will not be needed (if we forget about recovery for a while) because the version you're upgrading from already went through the procedure.

But 5.5 still needs the procedure, because you can have a cluster in 5.4 with Raft disabled, then you do a rolling upgrade. As soon as all nodes have upgraded to version 5.5, that's when the Raft upgrade procedure starts. It starts after rolling upgrade, not in the middle of it. For versions which still have the consistent_cluster_management flag -- the procedure starts after all nodes enable the flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as all nodes have upgraded to version 5.5, that's when the Raft upgrade procedure starts. It starts after rolling upgrade, not in the middle of it.

That's what I understood wrong. Thank you for the explanation.

.. warning::
Raft is mandatory in ScyllaDB Open Source 6.0 and later, and ScyllaDB Enterprise 2024.2 and later. You cannot prevent Raft from getting enabled upon upgrade to 6.0 or 2024.2.

.. TODO: add reference to the upgrade guide from 5.4 to 6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add this new upgrade doc, we'll need to include a big note saying "Raft shall be enforced upon you with no way out. Prepare for quorum requirement" or sth (cc @kostja @gleb-cloudius )

Copy link
Contributor

@kbr-scylla kbr-scylla Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... @annastuchlik @tzach what do you think of the following idea:

Let's have a "template" upgrade guide present at all times in master branch: "Upgrade guide from last-release to NEW_RELEASE_PLACEHOLDER". It would be hidden from users, but visible to developers, so we can update it as we create new PRs. We can also refer to it in other docs; the paragraphs or sentences that refer to it would be hidden. Or maybe NEW_RELEASE_PLACEHOLDER would be automatically replaced with "upcoming release" or something that makes sense.

Once new release is branched, we would have an automatic process that copies this template guide and replaces NEW_RELEASE_PLACEHOLDER with the actual version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to have only one upgrade guide: from the previous version to the new version. When I add it, you won't have to worry about which version you should link to.

Also, it would solve the problem that cost us a lot of updates and backports in 5.4. It stems from the Raft page being both OSS and Enterprise. If you link to an OSS-only page from the Raft page, the Enterprise build will fail.
Even if you exclude it with the .. only:: opensource directive, the build reports errors, although the documentation renders OK. See scylladb/sphinx-scylladb-theme#944 for details.

In summary, you should link to </upgrade/index>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, from the previous version to this version, where this version is the one being documented. So 6.0 documentation should explain how to document to 6.0.

@@ -65,6 +70,7 @@ When all the nodes in the cluster and updated and restarted, the cluster will st
Alternatively, you can enable the ``consistent_cluster_management`` option when you are:

* Performing a rolling upgrade from version 5.1 to 5.2 or version 2022.x to 2023.1 by updating ``scylla.yaml`` before restarting each node. The Raft upgrade procedure will start as soon as the last node was upgraded and restarted. As above, this requires :ref:`verifying <verify-raft-procedure>` that the procedure successfully finishes.
* Performing a rolling upgrade from version 5.4 to 6.0 or version 2024.1 to 2024.2. ``consistent_cluster_management`` is mandatory in 6.0 and 2024.2, so updating ``scylla.yaml`` is unnecessary. As above, the Raft upgrade procedure will start as soon as the last node was upgraded and restarted. This also requires :ref:`verifying <verify-raft-procedure>` that the procedure successfully finishes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should refrain from referring to non-existing (but predicted) versions in documentation for now?
@annastuchlik do you have some standard process for such issues?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad practice, of course, but the page is both Open Source and Enterprise. We can update the info about Enterprise later, but we would have to a) remember about it and b) backport to all relevant versions.

Now that the documentation is versioned, we should avoid mentioning versions in the docs. The assumption is that if you selected version 6.0 (or use the default), you're reading the docs related to version 6.0. If you want to know some details about previous versions, go to the docs for that version.

Here's how I think we should change the Raft doc:

  • In general, remove everything irrelevant in version 5.5/6.0.
  • Remove the "Enabling Raft" section, including "Verifying that the Raft upgrade procedure finished successfully".
  • You can leave "Verifying that Raft is enabled" section if you think it's useful (I think it's not, but it's up to you).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including "Verifying that the Raft upgrade procedure finished successfully".

This is an important step that the cluster administrator must take when enabling Raft.
Even more important when we make Raft mandatory, which this PR does.

Otherwise -- if the procedure gets stuck -- they'll end up, without being aware of it, with blocked schema and topology operations, and only learn about it when dung hits the fan (e.g. they need to removenode a dead node, and it turns out they cannot because the upgrade procedure is stuck).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get it. Let's say I'm an administrator, and I install/upgrade to version 5.5/6.0 where Raft is mandatory and enabled. I'm not aware of the consistent_cluster_management option, as it's (correctly) removed from the docs.

WHEN should I verify that the Raft upgrade procedure was successful? Is it a necessary step after each installation/upgrade? If so, we should add it to the installation/upgrade guides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a necessary step after each installation/upgrade? If so, we should add it to the installation/upgrade guides.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or leave a reference to the "Verifying" section on the Raft page.
Which there already is: https://opensource.docs.scylladb.com/master/upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/upgrade-guide-from-5.1-to-5.2-generic.html#validate-raft-setup
But I see it's missing in the ->5.4 upgrade guide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a necessary step after each installation/upgrade?

It's necessary only once, the first time you enable Raft in the cluster.
Following upgrades don't need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I see it's missing in the ->5.4 upgrade guide.

I created #16347

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a necessary step after each installation/upgrade?

It's necessary only once, the first time you enable Raft in the cluster. Following upgrades don't need it.

OK, so we need the information in the upgrade guides:

  • 5.2 to 5.4
  • 5.4 to 5.5/6.0

In later versions, it won't be necessary because all clusters will have Raft enabled in a previous version(5.5/6.0 or earlier). Correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@kbr-scylla
Copy link
Contributor

#16254 was merged with 6.0 in the description...

No worries, this description is for developers, I'm only worried about making references in user-facing docs.

@kbr-scylla
Copy link
Contributor

So, I basically need to replace all 6.0 occurrences with 5.5?

Let's consult @annastuchlik and @tzach -- I pinged you on relevant code snippets.

test/alternator/test_table.py Outdated Show resolved Hide resolved
test/cql-pytest/conftest.py Outdated Show resolved Hide resolved
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 6 hr 54 min
  • Builder: monster

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this pull request Dec 8, 2023
A similar section appears in the 5.1->5.2 upgrade procedure, we also
need it in 5.2->5.4.

These sections will have to be included in all upgrade guides until the
version we're upgrading *from* is enabling Raft unconditionally. First
such planned version is 5.5; therefore (if everything goes according to
plan) the 5.5->later upgrade guide will no longer need the section, but
5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need
it.

This commit needs to be backported to 5.4 and corresponding enterprise
release branch.

Refs: scylladb#15854, scylladb#16334
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this pull request Dec 8, 2023
A similar section appears in the 5.1->5.2 upgrade procedure, we also
need it in 5.2->5.4.

These sections will have to be included in all upgrade guides until the
version we're upgrading *from* is enabling Raft unconditionally. First
such planned version is 5.5; therefore (if everything goes according to
plan) the 5.5->later upgrade guide will no longer need the section, but
5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need
it.

This commit needs to be backported to 5.4 and corresponding enterprise
release branch.

Refs: scylladb#15854, scylladb#16334
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this pull request Dec 8, 2023
A similar section appears in the 5.1->5.2 upgrade procedure, we also
need it in 5.2->5.4.

These sections will have to be included in all upgrade guides until the
version we're upgrading *from* is enabling Raft unconditionally. First
such planned version is 5.5; therefore (if everything goes according to
plan) the 5.5->later upgrade guide will no longer need the section, but
5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need
it.

This commit needs to be backported to 5.4 and corresponding enterprise
release branch.

Refs: scylladb#15854, scylladb#16334
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this pull request Dec 8, 2023
A similar section appears in the 5.1->5.2 upgrade procedure, we also
need it in 5.2->5.4.

These sections will have to be included in all upgrade guides until the
version we're upgrading *from* is enabling Raft unconditionally. First
such planned version is 5.5; therefore (if everything goes according to
plan) the 5.5->later upgrade guide will no longer need the section, but
5.2->5.4 and 5.4->5.5 and corresponding enterprise versions still need
it.

This commit needs to be backported to 5.4 and corresponding enterprise
release branch.

Refs: scylladb#15854, scylladb#16334
@patjed41
Copy link
Contributor Author

patjed41 commented Dec 8, 2023

v2:

  • changed 6.0 to 5.5 in the PR header,
  • reverted changes in status_for_monitoring and renamed disabled to unused,
  • reverted removal of the fails_without_consistent_cluster_management fixture in the alternator and cql-pytest suites,
  • changes in raft.rst:
    • removed most of the "Enabling Raft" section,
    • left the "Verifying that the Raft upgrade procedure finished successfully" subsection and changed it to a new section,
    • left the "Verifying that Raft is enabled" section and merged it into "Verifying that the Raft upgrade procedure finished successfully".

Docs build started to fail after these changes. There are two references to the deleted "Enabling Raft" section:

  • in upgrade-guide-from-5.1-to-5.2-generic.rst,
  • in handling-node-failures.rst.

I don't know how to deal with this for now. I'll come back to this later.

@kbr-scylla
Copy link
Contributor

in upgrade-guide-from-5.1-to-5.2-generic.rst,

Hmm. @annastuchlik do we need to keep this file on master branch? (Or in general -- keep guides for old versions forever?) I'd guess that we do want to keep it in case we'd like to update it which is done first through master and then backported. But alternatively we could always do the fix directly on the 5.2 branch if necessary. cc @avikivity

@patjed41
Copy link
Contributor Author

patjed41 commented Dec 12, 2023

🔴 CI State: FAILURE

✅ - Build ✅ - dtest ❌ - Unit Tests

Failed Tests (2/22274):

Build Details:

  • Duration: 2 hr 15 min
  • Builder: spider6.cloudius-systems.com

It's a regression. I somehow missed disable_raft_schema_config usage in these two tests. Fortunately, these tests started to fail after the rebase. I'll fix these tests and update this PR.

edit: I didn't miss anything, disable_raft_schema_config was added in #16242. Anyway, the fix should be simple.

@patjed41
Copy link
Contributor Author

v4: fix for the CI failures above (the new second commit)

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - dtest

Build Details:

  • Duration: 2 hr 34 min
  • Builder: spider1.cloudius-systems.com

@avikivity
Copy link
Member

Please add upgrade information to the cover letter. Is there a change in what versions we can upgrade from? If so, what's the change?

@kbr-scylla
Copy link
Contributor

IIUC the limitations are no larger than in #16254, i.e. upgrade is possible from the version which first introduced schema commitlog feature (or later). So from 5.1 (or corresponding Enterprise version) or later.

For OSS we only need to support upgrade from 5.4 and for Enterprise only from 2023.x (based on 5.2), so we're good with a spare.

@kbr-scylla
Copy link
Contributor

I added this note to the cover letter:

Note about upgrades: this PR does not introduce any more limitations to the upgrade procedure than there are already. As in #16254, we can upgrade from the first version of Scylla that supports the schema commitlog feature, i.e. from 5.1 (or corresponding Enterprise release) or later. Assuming this PR ends up in 5.5, the documented upgrade support is from 5.4. For corresponding Enterprise release, it's from 2023.x (based on 5.2), so all requirements are met.

@kbr-scylla
Copy link
Contributor

@patjed41 according to our discussion with @ptrsmrn in #16338, we should use the new Deprecated instead of Unused for these config options.

@patjed41
Copy link
Contributor Author

patjed41 commented Dec 14, 2023

@patjed41 according to our discussion with @ptrsmrn in #16338, we should use the new Deprecated instead of Unused for these config options.

Do you mean all these 3: force_schema_commit_log, override_decommission, consistent_cluster_management? force_schema_commit_log has already been merged but I think I can change it in this PR.

@kbr-scylla
Copy link
Contributor

Yeah, let's do it

@patjed41 patjed41 force-pushed the raft-mandatory branch 2 times, most recently from 89d4927 to 0b561ef Compare December 14, 2023 15:50
@patjed41
Copy link
Contributor Author

v5: changed force_schema_commit_log, override_decommission, consistent_cluster_management from unused to deprecated (the first change is in the new first commit as it is unrelated)

In scylladb#16254, we made force_schema_commit_log unused.
After this change, if someone passes this option as the command line
argument, the boot fails. This behavior is undesired. We only want
this option to be ignored. We can achieve this effect by making it
deprecated.
The override_decommission option is supported only when
consistent_cluster_management is disabled. In the following commit,
we make consistent_cluster_management mandatory, which makes
overwrite_decommission unusable.
In the following commits, we make consistent cluster management
mandatory. This will make disable_raft_schema_config unusable,
so we need to get rid of it. However, we don't want to remove
tests that use it.

The idea is to use the Raft RECOVERY mode instead of disabling
consistent cluster management directly.
Code that executed only when consistent_cluster_management=false is
removed. In particular, after this patch:
- raft_group0 and raft_group_registry are always enabled,
- raft_group0::status_for_monitoring::disabled becomes unused,
- topology tests can only run with consistent_cluster_management.
We remove Raft documentation irrelevant in 5.5.

One of the changes is removing a part of the "Enabling Raft" section
in raft.rst. Since Raft is mandatory in 5.5, the only way to enable
it in this version is by performing a rolling upgrade from 5.4. We
only need to have this case well-documented. In particular, we
remove information that also appears in the upgrade guides like
verifying schema synchronization.

Similarly, we remove a sentence from the "Manual Recovery Procedure"
section in handling-node-failures.rst because it mentions enabling
Raft manually, which is impossible in 5.5.

The rest of the changes are just removing information about
checking or setting consistent_cluster_management, which has become
unused.
@patjed41
Copy link
Contributor Author

patjed41 commented Dec 14, 2023

I think this diff got lost when I resent only a typo fix in the commit message:
https://github.com/scylladb/scylladb/compare/34a7d7254bdaed11186765bb505b72739b1a5182..89d4927482cf51b1a3258d0f100559009616918d

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests
❌ - dtest

Failed Tests (1/23442):

Build Details:

  • Duration: 2 hr 27 min
  • Builder: spider1.cloudius-systems.com

@patjed41
Copy link
Contributor Author

🔴 CI State: FAILURE

✅ - Build ✅ - Unit Tests ❌ - dtest

Failed Tests (1/23442):

* [test_disablebinary_and_disablegossip](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/5256/testReport/junit/nodetool_additional_test/TestNodetool/Tests___dtest___test_disablebinary_and_disablegossip) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_disablebinary_and_disablegossip)

Build Details:

* Duration: 2 hr 27 min

* Builder: spider1.cloudius-systems.com

Known flaky test -- #16219. I restarted CI.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 10 min
  • Builder: spider3.cloudius-systems.com

@scylladb-promoter scylladb-promoter merged commit 3b108f2 into scylladb:master Dec 19, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Raft (to be precise: schema changes based on group 0) mandatory
8 participants