-
Notifications
You must be signed in to change notification settings - Fork 4k
Relax feature flag compat check during join cluster #9729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
dumbbell
merged 3 commits into
main
from
relax-feature-flag-compat-check-during-join_cluster
Oct 1, 2024
Merged
Relax feature flag compat check during join cluster #9729
dumbbell
merged 3 commits into
main
from
relax-feature-flag-compat-check-during-join_cluster
Oct 1, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c0be3ee to
783ddce
Compare
9 tasks
783ddce to
5870c34
Compare
194795b to
ead4a05
Compare
c14d327 to
dba4b6e
Compare
... with older RabbitMQ versions which don't know about Khepri. [Why] When an older node wants to join a cluster, it calls `node_info/0` and `cluster_status_from_mnesia/0` directly using RPC calls. If it does that against a node already using Khepri, t will get an error telling it that Mnesia is not running. The error is reported to the end user, making it difficult to understand the problem: both nodes are simply incompatible. It's better to leave the final decision to the Feature flags subsystem, but for that, `rabbit_mnesia` on the newer Khepri-based node still needs to return something the older version can accept. [How] `cluster_status_from_mnesia/0` and `node_info/0` are modified to verify if Khepri is enabled and if it is, return a value based on Khepri's status as if it was from Mnesia. This will let the remote older node to continue all its checks and eventually refuse to join because the Feature flags subsystem will indicate they are incompatible.
…stency` is false [Why] `CheckNodesConsistency` is set to false when the `check_cluster_consistency()` is called as part of a node joining a cluster. And the generic compatibility check was already executed by `rabbit_db_cluster`. There is no need to run it again. This is even counter-productive with the improvement to `rabbit_feature_flags:check_node_compatibility/2` that follows.
... that considers the local node as if it was reset. [Why] When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster. However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine. [How] We introduce a new variant of `check_node_compatibility/2` that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset). This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster. This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version. References #9677.
dba4b6e to
f69c082
Compare
mkuratczyk
approved these changes
Oct 1, 2024
mkuratczyk
added a commit
to rabbitmq/rabbitmq-website
that referenced
this pull request
Oct 1, 2024
rabbitmq/rabbitmq-server#9729 has been merged. Starting with 4.1, there's no need to disable the new FFs when starting a new node.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster.
However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine.
How
We introduce a new variant of
check_node_compatibility/2that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset).This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster.
This also removes the need to use
$RABBITMQ_FEATURE_FLAGSenvironment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version.References #9677.