cluster: fast-forward feature table before joining cluster #8282

jcsp · 2023-01-17T22:46:25Z

This follows #8225

Introduce the concept of an "earliest version" in the feature table. This is the earliest version that the node will interact with, and it's the version that it will fast-forward its feature table to on an initial startup. For former property makes the latter safe: if our earliest version is 8 and we fast forward to 8, then there's no risk of that being "too far" because we should never be allowed to join a cluster whose active version is any older than that.
Expose earliest+latest versions in admin API: this makes it easy for an external program like a test to reason about whether the node is fully updated or not.
Extend validation on node_join_request RPCs, where the incoming RPC will specify its earliest-latest version range, and the RPC shall be rejected if the active version of the cluster is not within that range (inclusive)
Fix locations where RPC clients were being constructed with always transport version 1, this included the clients used to generate node join requests. Transports for join requests are now always constructed at rpc version 2, which is safe because a 23.1 cluster should never be trying to join a <= 21.11 cluster.

Backports Required

UX Changes

None

Release Notes

Improvements

Additional safety checks on the version compatibility of nodes joining a cluster. Please ensure that when adding nodes to a cluster, the new nodes are from the same feature series, or at most one feature series newer (e.g. adding a 23.1.1 node to a cluster of 22.3.x nodes). Nodes from an older feature version will not be permitted to join a cluster that has previously been upgraded fully to a newer feature version.

dotnwat · 2023-02-21T21:32:34Z

src/v/cluster/service.cc

+    if (req.logical_version > expect_version) {
+        vlog(
+          clusterlog.warn,
+          "Rejecting join request from node {}, logical version {} > {}",
+          req.node,


the interesting cases here would be

a rolling upgrade: a newer upgraded node wouldn't be "joining" in this case (i was just thinking that the same api were used for new nodes joining and existing nodes just showing up to participate).

we wouldn't be able to perform a rolling upgrade into a larger cluster (ie upgrade+node-add simultaneously)? not that we'd want to anyway...

? that's all i can think of

a rolling upgrade: a newer upgraded node wouldn't be "joining" in this case (i was just thinking that the same api were used for new nodes joining and existing nodes just showing up to participate).

Hmm, good point, need to think about the cases where folks "upgrade" a node by killing it and starting a brand new one. I guess this check is too strict for that. That does rather throw a spanner in the works. I'll think about this: maybe we need to do something a bit more nuanced like giving nodes a "base feature version" that they'll fast-forward before joining, and then having them advertise that in a join message so that we would still be safe, but would permit newer versions to join as long as their base version wasn't ahead of our current version.

existing nodes just showing up to participate

Right, the API is used to register members of an existing cluster, but we'd only fast forward the feature table if the joining node were completely empty. For the typical upgrade path, wouldn't we still have data and join whatever version existed before the upgrade?

For the typical upgrade path, wouldn't we still have data and join whatever version existed before the upgrade?

i'm not sure this is always the case. for local-ssd k8s in cloud i'm pretty sure we nuke all the local data, upgrade as a new, empty node that joins the cluster.

dotnwat · 2023-02-21T21:39:58Z

src/v/redpanda/application.cc

+              // We do _not_ write a snapshot here: the persistent record of
+              // feature table state is only set for the first time in
+              // bootstrap_backend (or feature_backend).  This is important,
+              // so that someone who starts a too-new Redpanda that can't join
+              // their cluster can easily stop it and run an older version,
+              // before we've committed any version info to disk.


this makes sense. does node_join (the other crucial part of this gating) happen early enough that we can assume that other parts of redpanda, that may be feature-aware and enabled by an incorrectly-assumed-too-new-version, will also obey this rule and not send rpcs out or write data to disk that it shouldn't given the version mismatch?

This should be safe -- this happens when we first start up, there's no data locally, and we haven't joined a cluster, so until are able to join a cluster, we at least shouldn't be able to host new partitions. I don't think we have other persistent state that needs to be gated here -- the only things that come to mind would mandate having already joined the cluster and creating some topics.

andrwng · 2023-02-21T22:52:51Z

src/v/redpanda/application.cc

+    // because we might have joined via an earlier version of redpanda, and
+    // just upgraded to a version that stores cluster and node UUIDs.  We must
+    // also check for an controller log state on disk.


Do we need to keep maintaining logic for upgrading directly from v22.2? Upon upgrading to v22.3, we will asynchronously write a cluster UUID if one doesn't exist https://github.com/redpanda-data/redpanda/blob/dev/src/v/cluster/controller.cc#L645-L659. It's not fool proof since even if we waited for the version to bump, there's no guarantee that we write the cluster UUID before attempting the next rewrite, but still worth pondering. Curious if you saw some tests failing without this check

andrwng · 2023-02-21T23:52:06Z

src/v/cluster/service.cc

+    if (req.logical_version > expect_version) {
+        vlog(
+          clusterlog.warn,
+          "Rejecting join request from node {}, logical version {} > {}",
+          req.node,


existing nodes just showing up to participate

Right, the API is used to register members of an existing cluster, but we'd only fast forward the feature table if the joining node were completely empty. For the typical upgrade path, wouldn't we still have data and join whatever version existed before the upgrade?

tests/rptest/tests/cluster_features_test.py

jcsp · 2023-02-22T14:42:12Z

@dotnwat @andrwng thank you for reviews -- I've revised this quite a bit.

The basic fast forwarding is still there, but now we fast forward to a new "earliest logical version" rather than all the way to the latest version. That means that a brand new 23.1 node will fast-forward to a 22.3.x compatible feature table, and thereby be able to join a 22.3.x cluster (important for the kubernetes upgrades that nuke and replace nodes).

Now that we're only fast forwarding to an earliest supported version, the logic for doing the fast forward is less risky: if we get the "this is a bare node" check wrong, we are fast forwarding to a safe version rather than potentially skipping ahead of the active version.

When testing this, I also realized that we weren't actually respecting the rpc_v2_by_default everywhere. This included node joines, which created connections directly and not via the connection cache. That is fixed so that we really are doing serde everywhere by default now.

andrwng

The approach makes sense to me -- I like the idea of encoding / explicitly supporting just an upgrade from the previous feature version, as we do elsewhere.

andrwng · 2023-02-22T19:39:08Z

src/v/cluster/types.h

+    // The lowest version that the joining node supports, it will already
+    // have its feature table initialized to this version.
+    cluster_version earliest_logical_version{cluster::invalid_version};


IIRC the typical guidance for adding fields to envelopes is that if we don't want to bump the compat_version, this would need to be added to the end of the struct.

In this instance the declaration was mid-struct but the item was at the end of serde_fields list, so encoding should work properly. However you're right that this was a bit wonky looking, I've moved the declaration to match the order of serialized fields.

andrwng · 2023-02-22T19:49:03Z

src/v/features/feature_table.cc

@@ -232,10 +232,10 @@ void feature_state::notify_version(cluster_version v) {
    }
 }

-void feature_table::set_active_version(cluster_version v) {
+void feature_table::set_active_version(cluster_version v, bool ephemeral) {


nit: maybe consider some enum like version_setting_type{ cluster_driven, local_only } or somesuch?

Good idea, I've switched this bool to an enum.

This is to enable nodes to safely fast-forward their feature tables to this version when they first start. They cannot fast forward all the way to their latest version, because they might be joining a cluster of older-versioned nodes.

This enables tests and tools to reason about the cluster_version and original_version in the context of the supported version range of running code.

Before the node joins a cluster, it may assume that its own current logical version will be matched by the cluster. This enables the node to use modern RPC formats by default, without having to use legacy encoding for its initial cluster join.

The original version should reflect the first cluster-wide activated version, not the value of earliest_version when a node started. This change enables the feature table to have an active version without having an original_version: this is a sign that the active version is something set ephemerally during startup on a fresh node, rather than something set persisently via the controller log, or loaded from a snapshot.

(version in this context means logical cluster version, not repdanda package version) This is necessary because nodes now initialize their feature table to their earliest compat version before joining a cluster. If we allowed too-new nodes to join, then we would end up in an inconsistent state where the newly joined nodes had features enabled that other nodes did not have enabled. Update test for this: - Cover the case of a too-new node trying to join a cluster - Update the existing test to use previous version instead of hardcoded 22.1 - Add a variant of the "too old" test that uses synthetic versions rather than a physically older version. This gives more confidence that the version check is really happening, as previously the test could false-pass if there was some other incompatibilty between the versions.

This is needed to select RPC transport version when this class constructs its own RPC transport objects.

While we have a rpc_v2_by_default feature flag that is respected in connection_cache, there are various places that don't use connection cache. Use transport_version::v2 for join requests, because we should never be trying to join a 23.1.x node to such old versions that they don't speak rpc v2. For metadata dissemination service, be conservative and respect the feature flag like connection cache does.

This test used latest version environment variable override to impersonate an older Redpanda. Because the impersonated version was older than our current binary's earliest version, this wasn't working: we had to override earliest version as well. We should consider removing this test or modifying it to use real old binaries rather than relying on the current binary to be able to authentically output structures in the format of logical version 5 for upgrade testing.

andrwng

LGTM

andrwng · 2023-02-24T00:56:31Z

src/v/redpanda/application.cc

+              // We do _not_ write a snapshot here: the persistent record of
+              // feature table state is only set for the first time in
+              // bootstrap_backend (or feature_backend).  This is important,
+              // so that someone who starts a too-new Redpanda that can't join
+              // their cluster can easily stop it and run an older version,
+              // before we've committed any version info to disk.


This should be safe -- this happens when we first start up, there's no data locally, and we haven't joined a cluster, so until are able to join a cluster, we at least shouldn't be able to host new partitions. I don't think we have other persistent state that needs to be gated here -- the only things that come to mind would mandate having already joined the cluster and creating some topics.

andrwng · 2023-02-24T00:58:22Z

tests/rptest/tests/cluster_features_test.py

@@ -26,11 +27,27 @@ def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.admin = Admin(self.redpanda)
        self.installer = self.redpanda._installer
+        self.head_latest_logical_version = None
+        self.head_earliest = None


nit: head_earliest_logical_version?

github-actions bot added the area/redpanda label Jan 17, 2023

jcsp mentioned this pull request Jan 17, 2023

cluster: set feature_table version during bootstrap #8225

Merged

6 tasks

andrwng mentioned this pull request Feb 1, 2023

Return retryable error when method not found during bootstrap #8482

Merged

6 tasks

jcsp mentioned this pull request Feb 6, 2023

CI Failure (BadLogLines admin_api_server unexpected rpc::errc::unknown) in RandomNodeOperationsTest.test_node_operations #8406

Closed

jcsp force-pushed the issue-8203-pt2 branch from 994b59d to 383b39d Compare February 14, 2023 11:13

jcsp marked this pull request as ready for review February 14, 2023 11:14

jcsp assigned dotnwat, dlex and andrwng Feb 14, 2023

dotnwat reviewed Feb 21, 2023

View reviewed changes

andrwng reviewed Feb 21, 2023

View reviewed changes

jcsp force-pushed the issue-8203-pt2 branch 2 times, most recently from bb80398 to 1f0deae Compare February 22, 2023 14:28

jcsp force-pushed the issue-8203-pt2 branch from 1f0deae to 42164dc Compare February 22, 2023 14:42

jcsp mentioned this pull request Feb 22, 2023

cluster: strict upgrade version compatibility check on upgrade #9048

Merged

6 tasks

andrwng reviewed Feb 22, 2023

View reviewed changes

jcsp added 3 commits February 23, 2023 13:19

features: add 'earliest_version' concept

73bfcfd

This is to enable nodes to safely fast-forward their feature tables to this version when they first start. They cannot fast forward all the way to their latest version, because they might be joining a cluster of older-versioned nodes.

admin: expose node_earliest_version, node_latest_version

67cdf26

This enables tests and tools to reason about the cluster_version and original_version in the context of the supported version range of running code.

jcsp force-pushed the issue-8203-pt2 branch from 42164dc to 0fe0c7e Compare February 23, 2023 13:19

jcsp requested review from andrwng and dotnwat February 23, 2023 13:19

jcsp added 5 commits February 23, 2023 13:35

cluster: add feature table ref to metadata_dissemination_service

70305e1

This is needed to select RPC transport version when this class constructs its own RPC transport objects.

cluster: clang-format

b10259a

jcsp force-pushed the issue-8203-pt2 branch from 0fe0c7e to b10259a Compare February 23, 2023 13:35

andrwng approved these changes Feb 24, 2023

View reviewed changes

dotnwat approved these changes Feb 24, 2023

View reviewed changes

dotnwat merged commit dc0f245 into redpanda-data:dev Feb 24, 2023

jcsp deleted the issue-8203-pt2 branch February 24, 2023 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: fast-forward feature table before joining cluster #8282

cluster: fast-forward feature table before joining cluster #8282

jcsp commented Jan 17, 2023 •

edited

dotnwat Feb 21, 2023

jcsp Feb 21, 2023

andrwng Feb 21, 2023

dotnwat Feb 22, 2023

dotnwat Feb 21, 2023

andrwng Feb 24, 2023 •

edited

andrwng Feb 21, 2023

andrwng Feb 21, 2023

jcsp commented Feb 22, 2023

andrwng left a comment

andrwng Feb 22, 2023

jcsp Feb 23, 2023

andrwng Feb 22, 2023

jcsp Feb 23, 2023

andrwng left a comment

andrwng Feb 24, 2023 •

edited

andrwng Feb 24, 2023

cluster: fast-forward feature table before joining cluster #8282

cluster: fast-forward feature table before joining cluster #8282

Conversation

jcsp commented Jan 17, 2023 • edited

Backports Required

UX Changes

Release Notes

Improvements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng Feb 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Feb 22, 2023

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

andrwng Feb 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Jan 17, 2023 •

edited

andrwng Feb 24, 2023 •

edited

andrwng Feb 24, 2023 •

edited