Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: fast-forward feature table before joining cluster #8282

Merged
merged 9 commits into from
Feb 24, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jan 17, 2023

This follows #8225

  • Introduce the concept of an "earliest version" in the feature table. This is the earliest version that the node will interact with, and it's the version that it will fast-forward its feature table to on an initial startup. For former property makes the latter safe: if our earliest version is 8 and we fast forward to 8, then there's no risk of that being "too far" because we should never be allowed to join a cluster whose active version is any older than that.
  • Expose earliest+latest versions in admin API: this makes it easy for an external program like a test to reason about whether the node is fully updated or not.
  • Extend validation on node_join_request RPCs, where the incoming RPC will specify its earliest-latest version range, and the RPC shall be rejected if the active version of the cluster is not within that range (inclusive)
  • Fix locations where RPC clients were being constructed with always transport version 1, this included the clients used to generate node join requests. Transports for join requests are now always constructed at rpc version 2, which is safe because a 23.1 cluster should never be trying to join a <= 21.11 cluster.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

None

Release Notes

Improvements

  • Additional safety checks on the version compatibility of nodes joining a cluster. Please ensure that when adding nodes to a cluster, the new nodes are from the same feature series, or at most one feature series newer (e.g. adding a 23.1.1 node to a cluster of 22.3.x nodes). Nodes from an older feature version will not be permitted to join a cluster that has previously been upgraded fully to a newer feature version.

Comment on lines 98 to 102
if (req.logical_version > expect_version) {
vlog(
clusterlog.warn,
"Rejecting join request from node {}, logical version {} > {}",
req.node,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interesting cases here would be

  1. a rolling upgrade: a newer upgraded node wouldn't be "joining" in this case (i was just thinking that the same api were used for new nodes joining and existing nodes just showing up to participate).
  2. we wouldn't be able to perform a rolling upgrade into a larger cluster (ie upgrade+node-add simultaneously)? not that we'd want to anyway...

? that's all i can think of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a rolling upgrade: a newer upgraded node wouldn't be "joining" in this case (i was just thinking that the same api were used for new nodes joining and existing nodes just showing up to participate).

Hmm, good point, need to think about the cases where folks "upgrade" a node by killing it and starting a brand new one. I guess this check is too strict for that. That does rather throw a spanner in the works. I'll think about this: maybe we need to do something a bit more nuanced like giving nodes a "base feature version" that they'll fast-forward before joining, and then having them advertise that in a join message so that we would still be safe, but would permit newer versions to join as long as their base version wasn't ahead of our current version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing nodes just showing up to participate

Right, the API is used to register members of an existing cluster, but we'd only fast forward the feature table if the joining node were completely empty. For the typical upgrade path, wouldn't we still have data and join whatever version existed before the upgrade?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the typical upgrade path, wouldn't we still have data and join whatever version existed before the upgrade?

i'm not sure this is always the case. for local-ssd k8s in cloud i'm pretty sure we nuke all the local data, upgrade as a new, empty node that joins the cluster.

Comment on lines +1654 to +1661
// We do _not_ write a snapshot here: the persistent record of
// feature table state is only set for the first time in
// bootstrap_backend (or feature_backend). This is important,
// so that someone who starts a too-new Redpanda that can't join
// their cluster can easily stop it and run an older version,
// before we've committed any version info to disk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense. does node_join (the other crucial part of this gating) happen early enough that we can assume that other parts of redpanda, that may be feature-aware and enabled by an incorrectly-assumed-too-new-version, will also obey this rule and not send rpcs out or write data to disk that it shouldn't given the version mismatch?

Copy link
Contributor

@andrwng andrwng Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be safe -- this happens when we first start up, there's no data locally, and we haven't joined a cluster, so until are able to join a cluster, we at least shouldn't be able to host new partitions. I don't think we have other persistent state that needs to be gated here -- the only things that come to mind would mandate having already joined the cluster and creating some topics.

Comment on lines +1627 to +1630
// because we might have joined via an earlier version of redpanda, and
// just upgraded to a version that stores cluster and node UUIDs. We must
// also check for an controller log state on disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep maintaining logic for upgrading directly from v22.2? Upon upgrading to v22.3, we will asynchronously write a cluster UUID if one doesn't exist https://github.com/redpanda-data/redpanda/blob/dev/src/v/cluster/controller.cc#L645-L659. It's not fool proof since even if we waited for the version to bump, there's no guarantee that we write the cluster UUID before attempting the next rewrite, but still worth pondering. Curious if you saw some tests failing without this check

Comment on lines 98 to 102
if (req.logical_version > expect_version) {
vlog(
clusterlog.warn,
"Rejecting join request from node {}, logical version {} > {}",
req.node,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing nodes just showing up to participate

Right, the API is used to register members of an existing cluster, but we'd only fast forward the feature table if the joining node were completely empty. For the typical upgrade path, wouldn't we still have data and join whatever version existed before the upgrade?

tests/rptest/tests/cluster_features_test.py Outdated Show resolved Hide resolved
@jcsp jcsp force-pushed the issue-8203-pt2 branch 2 times, most recently from bb80398 to 1f0deae Compare February 22, 2023 14:28
@jcsp
Copy link
Contributor Author

jcsp commented Feb 22, 2023

@dotnwat @andrwng thank you for reviews -- I've revised this quite a bit.

The basic fast forwarding is still there, but now we fast forward to a new "earliest logical version" rather than all the way to the latest version. That means that a brand new 23.1 node will fast-forward to a 22.3.x compatible feature table, and thereby be able to join a 22.3.x cluster (important for the kubernetes upgrades that nuke and replace nodes).

Now that we're only fast forwarding to an earliest supported version, the logic for doing the fast forward is less risky: if we get the "this is a bare node" check wrong, we are fast forwarding to a safe version rather than potentially skipping ahead of the active version.

When testing this, I also realized that we weren't actually respecting the rpc_v2_by_default everywhere. This included node joines, which created connections directly and not via the connection cache. That is fixed so that we really are doing serde everywhere by default now.

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach makes sense to me -- I like the idea of encoding / explicitly supporting just an upgrade from the previous feature version, as we do elsewhere.

Comment on lines 1050 to 1052
// The lowest version that the joining node supports, it will already
// have its feature table initialized to this version.
cluster_version earliest_logical_version{cluster::invalid_version};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the typical guidance for adding fields to envelopes is that if we don't want to bump the compat_version, this would need to be added to the end of the struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance the declaration was mid-struct but the item was at the end of serde_fields list, so encoding should work properly. However you're right that this was a bit wonky looking, I've moved the declaration to match the order of serialized fields.

@@ -232,10 +232,10 @@ void feature_state::notify_version(cluster_version v) {
}
}

void feature_table::set_active_version(cluster_version v) {
void feature_table::set_active_version(cluster_version v, bool ephemeral) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe consider some enum like version_setting_type{ cluster_driven, local_only } or somesuch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've switched this bool to an enum.

This is to enable nodes to safely fast-forward their
feature tables to this version when they first start.

They cannot fast forward all the way to their latest
version, because they might be joining a cluster of
older-versioned nodes.
This enables tests and tools to reason about the cluster_version
and original_version in the context of the supported version range
of running code.
Before the node joins a cluster, it may assume that its
own current logical version will be matched by the cluster.

This enables the node to use modern RPC formats by default,
without having to use legacy encoding for its initial
cluster join.
The original version should reflect the first cluster-wide
activated version, not the value of earliest_version when
a node started.

This change enables the feature table to have an active
version without having an original_version: this is a sign
that the active version is something set ephemerally during
startup on a fresh node, rather than something set persisently
via the controller log, or loaded from a snapshot.
(version in this context means logical cluster version,
 not repdanda package version)

This is necessary because nodes now initialize their
feature table to their earliest compat version before joining
a cluster.  If we allowed too-new nodes to join,
then we would end up in an inconsistent state where the newly
joined nodes had features enabled that other nodes did not
have enabled.

Update test for this:
- Cover the case of a too-new node trying to join a cluster
- Update the existing test to use previous version instead
  of hardcoded 22.1
- Add a variant of the "too old" test that uses synthetic
  versions rather than a physically older version. This gives
  more confidence that the version check is really happening, as
  previously the test could false-pass if there was some other
  incompatibilty between the versions.
This is needed to select RPC transport version when
this class constructs its own RPC transport objects.
While we have a rpc_v2_by_default feature flag that is respected
in connection_cache, there are various places that don't use
connection cache.

Use transport_version::v2 for join requests, because we should never
be trying to join a 23.1.x node to such old versions that they don't
speak rpc v2.  For metadata dissemination service, be conservative
and respect the feature flag like connection cache does.
This test used latest version environment variable override
to impersonate an older Redpanda.  Because the impersonated version
was older than our current binary's earliest version, this
wasn't working: we had to override earliest version as well.

We should consider removing this test or modifying it to use
real old binaries rather than relying on the current binary
to be able to authentically output structures in the format
of logical version 5 for upgrade testing.
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +1654 to +1661
// We do _not_ write a snapshot here: the persistent record of
// feature table state is only set for the first time in
// bootstrap_backend (or feature_backend). This is important,
// so that someone who starts a too-new Redpanda that can't join
// their cluster can easily stop it and run an older version,
// before we've committed any version info to disk.
Copy link
Contributor

@andrwng andrwng Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be safe -- this happens when we first start up, there's no data locally, and we haven't joined a cluster, so until are able to join a cluster, we at least shouldn't be able to host new partitions. I don't think we have other persistent state that needs to be gated here -- the only things that come to mind would mandate having already joined the cluster and creating some topics.

@@ -26,11 +27,27 @@ def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.admin = Admin(self.redpanda)
self.installer = self.redpanda._installer
self.head_latest_logical_version = None
self.head_earliest = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: head_earliest_logical_version?

@dotnwat dotnwat merged commit dc0f245 into redpanda-data:dev Feb 24, 2023
@jcsp jcsp deleted the issue-8203-pt2 branch February 24, 2023 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants