rabbit_db: Copy feature states early when joining a cluster #9682

dumbbell · 2023-10-11T16:35:18Z

Why

So far, the feature states were copied from the cluster after the actual join. However, the join may have reloaded the feature flags registry, using the previous on-disk record, defeating the purpose of copying the cluster's states.

This was done in this order to have a simpler error handling.

How

This time, we copy the remote cluster's feature states just after the reset.

If the join fails, we reset the feature flags again, including the on-disk states.

This is probably the result of a copy-paste.

[Why] When a Khepri-based node joins a Mnesia-based cluster, it is reset and switches back from Khepri to Mnesia. If there are Mnesia files left in its data directory, Mnesia will restart with stale/incorrect data and the operation will fail. After a migration to Khepri, we need to make sure there is no stale Mnesia files. [How] We use `rabbit_mnesia` to query the Mnesia files and delete them.

[Why] This relies on Mnesia and is only useful when using Mnesia. Indeed, there is no partition handling when Khepri is used.

[Why] `reset_registry/0` reset the in-memory states so far, but left the on-disk record. This is inconsistent. [How] After resetting the in-memory states, we remove the file on disk.

…disk [Why] Sometimes, we need to reset the in-memory registry only, like when we restart the `rabbit` application, not the whole Erlang node. However, sometimes, we also need to delete the feature states on disk. This is the case when a node joins a cluster. [How] We expose a new `reset/0` function which covers both the in-memory and on-disk states. This will be used in a follow-up commit to correctly reset the feature flags states in `rabbit_db_cluster:join/2`.

[Why] So far, the feature states were copied from the cluster after the actual join. However, the join may have reloaded the feature flags registry, using the previous on-disk record, defeating the purpose of copying the cluster's states. This was done in this order to have a simpler error handling. [How] This time, we copy the remote cluster's feature states just after the reset. If the join fails, we reset the feature flags again, including the on-disk states.

dumbbell added this to the 3.13.0 milestone Oct 11, 2023

dumbbell self-assigned this Oct 11, 2023

dumbbell force-pushed the fix-feature-flags-reset-during-join_cluster branch 9 times, most recently from 8ebc018 to a010a79 Compare October 16, 2023 14:55

dumbbell added 6 commits October 17, 2023 09:38

rabbit_khepri: Fix incorrect log level

cfdf9dc

This is probably the result of a copy-paste.

rabbit_node_monitor: Don't monitor partitioned nodes when using Khepri

a98637e

[Why] This relies on Mnesia and is only useful when using Mnesia. Indeed, there is no partition handling when Khepri is used.

rabbit_feature_flags: Remove feature_flags file as part of a reset

86e5431

[Why] `reset_registry/0` reset the in-memory states so far, but left the on-disk record. This is inconsistent. [How] After resetting the in-memory states, we remove the file on disk.

dumbbell force-pushed the fix-feature-flags-reset-during-join_cluster branch from a010a79 to f95ccaf Compare October 17, 2023 07:38

dumbbell marked this pull request as ready for review October 17, 2023 10:58

dumbbell merged commit 549a87c into main Oct 17, 2023
16 checks passed

dumbbell deleted the fix-feature-flags-reset-during-join_cluster branch October 17, 2023 10:58

dumbbell mentioned this pull request Oct 19, 2023

Feature flags need quality of life improvements #9677

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rabbit_db: Copy feature states early when joining a cluster #9682

rabbit_db: Copy feature states early when joining a cluster #9682

dumbbell commented Oct 11, 2023

rabbit_db: Copy feature states early when joining a cluster #9682

rabbit_db: Copy feature states early when joining a cluster #9682

Conversation

dumbbell commented Oct 11, 2023

Why

How