Relax feature flag compat check during join cluster #9729

dumbbell · 2023-10-19T08:04:13Z

Why

When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster.

However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine.

How

We introduce a new variant of check_node_compatibility/2 that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset).

This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster.

This also removes the need to use $RABBITMQ_FEATURE_FLAGS environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version.

References #9677.

... with older RabbitMQ versions which don't know about Khepri. [Why] When an older node wants to join a cluster, it calls `node_info/0` and `cluster_status_from_mnesia/0` directly using RPC calls. If it does that against a node already using Khepri, t will get an error telling it that Mnesia is not running. The error is reported to the end user, making it difficult to understand the problem: both nodes are simply incompatible. It's better to leave the final decision to the Feature flags subsystem, but for that, `rabbit_mnesia` on the newer Khepri-based node still needs to return something the older version can accept. [How] `cluster_status_from_mnesia/0` and `node_info/0` are modified to verify if Khepri is enabled and if it is, return a value based on Khepri's status as if it was from Mnesia. This will let the remote older node to continue all its checks and eventually refuse to join because the Feature flags subsystem will indicate they are incompatible.

…stency` is false [Why] `CheckNodesConsistency` is set to false when the `check_cluster_consistency()` is called as part of a node joining a cluster. And the generic compatibility check was already executed by `rabbit_db_cluster`. There is no need to run it again. This is even counter-productive with the improvement to `rabbit_feature_flags:check_node_compatibility/2` that follows.

... that considers the local node as if it was reset. [Why] When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster. However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine. [How] We introduce a new variant of `check_node_compatibility/2` that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset). This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster. This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version. References #9677.

rabbitmq/rabbitmq-server#9729 has been merged. Starting with 4.1, there's no need to disable the new FFs when starting a new node.

dumbbell added this to the 3.13.0 milestone Oct 19, 2023

dumbbell self-assigned this Oct 19, 2023

dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch from c0be3ee to 783ddce Compare October 19, 2023 10:57

dumbbell mentioned this pull request Oct 20, 2023

Feature flags need quality of life improvements #9677

Closed

9 tasks

dumbbell removed this from the 3.13.0 milestone Oct 20, 2023

dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch from 783ddce to 5870c34 Compare October 24, 2023 13:24

dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch 2 times, most recently from 194795b to ead4a05 Compare September 24, 2024 15:50

mergify bot added the bazel label Sep 24, 2024

dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch 3 times, most recently from c14d327 to dba4b6e Compare September 25, 2024 14:55

dumbbell marked this pull request as ready for review September 25, 2024 16:10

dumbbell requested a review from mkuratczyk September 25, 2024 16:10

dumbbell added 3 commits October 1, 2024 10:47

dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch from dba4b6e to f69c082 Compare October 1, 2024 08:52

mkuratczyk approved these changes Oct 1, 2024

View reviewed changes

dumbbell merged commit 6855ebc into main Oct 1, 2024
439 checks passed

dumbbell deleted the relax-feature-flag-compat-check-during-join_cluster branch October 1, 2024 09:52

mkuratczyk added a commit to rabbitmq/rabbitmq-website that referenced this pull request Oct 1, 2024

grow-then-shrink: remove FF note

5adaad3

rabbitmq/rabbitmq-server#9729 has been merged. Starting with 4.1, there's no need to disable the new FFs when starting a new node.

dumbbell added this to the 4.1.0 milestone Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relax feature flag compat check during join cluster #9729

Relax feature flag compat check during join cluster #9729

Uh oh!

dumbbell commented Oct 19, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Relax feature flag compat check during join cluster #9729

Relax feature flag compat check during join cluster #9729

Uh oh!

Conversation

dumbbell commented Oct 19, 2023

Why

How

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants