storage: introduce on_master_enable service by mrForza · Pull Request #646 · tarantool/vshard

mrForza · 2026-03-20T09:03:52Z

Before this patch the rebalancer and recovery service could start
just right after master switch (by auto master detection or manual
reconfiguration) before the master had time to sync its vclock with
other replicas in replicaset. It could lead to doubled buckets according
to "Doubled buckets RFC".

To fix it we introduce a new storage service - on_master_enable
service. If master is changed in replicaset, this service is triggered
and waits until newly elected master syncs its vclock with other
replicas. Other storage services - rebalancer and recovery can't
start until on_master_enable set M.buckets_are_in_sync.

Closes #214

NO_TEST=bugfix
NO_DOC=bugfix

The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See tarantool/luatest#453 for details. Let's use 1.4.1 for now. NO_DOC=ci NO_TEST=ci

Serpentian

Thank you for the patch! I'm impressed with the speed you do the patches. I've left some comments, the most critical ones:

We loose data during recovery (last commit with recovery changes)
We don't have upgrade process (commit with generation, I wonder, how upgrade tests pass)
We must figure out, what to do in the #646 (comment), but for that we need an opinion from @Gerold103

Serpentian · 2026-04-23T13:14:15Z

    end, {bids, uuid})

-    local msg = "Error during recovery of bucket"
+    local msg = "Error during recovery of bucket.*MISSING_MASTER"


Looks like this doesn't help with the flakiness fix, the catch_flaky and 1.10 workflow fails, let's try figure this out. And why did you decide to fix it in the first place? AFAICS, it's not changed in your patchset.

It is fixed now, 100%

That solution with suffixes looks very complicated. Why hardcoding the error.code doesn't help? For me the test is 100% stable with the following patch. Is it for you?

Details

diff --git a/test/storage-luatest/log_verbosity_2_2_test.lua b/test/storage-luatest/log_verbosity_2_2_test.lua index 846d307..1e8a97a 100644 --- a/test/storage-luatest/log_verbosity_2_2_test.lua +++ b/test/storage-luatest/log_verbosity_2_2_test.lua @@ -59,7 +59,7 @@ test_group.test_recovery_do_not_spam_same_errors = function(g) box.commit() end, {bids, uuid}) - local msg = "Error during recovery of bucket" + local msg = "Error during recovery of bucket.*MISSING_MASTER" g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1, on_yield = function() ivshard.storage.recovery_wakeup() @@ -118,7 +118,7 @@ test_group.test_rebalancer_do_not_spam_same_errors = function(g) error('TimedOut') end end) - local msg = "Error during downloading rebalancer states" + local msg = "Error during downloading rebalancer states.*TimedOut" g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1, on_yield = function() ivshard.storage.rebalancer_wakeup() end}) g.replica_2_a:exec(function()

Serpentian · 2026-04-23T13:17:16Z

 local function bucket_recovery_continue()
    vshard.storage.internal.errinj.ERRINJ_RECOVERY_PAUSE = false
-    vshard.storage.garbage_collector_wakeup()
+    vshard.storage.recovery_wakeup()


And it's not refactoring, as it's stated in the commit msg. We're fixing the test, refactoring should not change the behavior

Serpentian · 2026-04-23T13:32:43Z

+}
+
+local function recover_bucket_to_state(recovered_buckets, bucket_id, state)
+    if state == BACTIVE then


This one is a good commit (with refactoring of the refactory), I'll take it to my patchset in order to simplify the merging of our fixes in the future, I'm changing these lines too. And I'll refactor it a little bit to better fit there.

Please, let's take the commit from the #633

Serpentian · 2026-04-23T13:38:53Z

 --     where the error happened.
 --
-local function replicaset_map_call(replicaset, func, args, opts)
+local function replicas_map_send_and_wait(replicas, func, args, opts)


I agree with Vlad here. replicas_map_call is the better name for the function. Maybe even just map_call. You still can't reuse it or cluster_map_call in map_callrw on router?

This patch fixes the flakiness of the `log_verbosity_2_2_test`. In Vshard some logs have a composite structure, for example "Error during recovery: <ERR>", where ERR can change over the time and have a different type. In rare situations it can lead to bugs when we use `wait_log_exactly_once` function, because it ignores the volatile part of the log. To correctly determine whether a log with a composite structure was duplicated, we need to compare not only its common (unchanging) part, but also its volatile part. For this, in `wait_log_exactly_once` we compare not only the common substrings of logs, but also their suffixes, which in some scenarios can change. The suffixes are applicable only when we use this function with ".*" sign in the end of the log string. Otherwise its behaviour is default. Needed for tarantool#214 NO_DOC=test

Before this patch we woke up the GC service instead of recovery service in `bucket_recovery_pause`. It could lead to a longer tests' execution time. Now, we fix it by changing `garbage_collector_wakeup` to `recovery_wakeup`. Needed for tarantool#214 NO_DOC=test

Before this patch the `bucket_move` and `bucket_wait_transfer` helper functions were used only in `storage_1_1_1_test`. However in future patches these helpers can also be applicable (e.g. in tarantoolgh-214). This patch moves `bucket_move` and `bucket_wait_transfer` into `vtest` module so that we can use it in other tests. Needed for tarantool#214 NO_TEST=test NO_DOC=test

Before this patch we compared vclocks only in `wait_lsn` function in storage module. However in future patches (e.g. tarantoolgh-214) we will need to do this even in tests. Also in tarantoolgh-214 we will use very similar logic of waiting vclocks but with different sign (all vclock components of current storage should be "greater or equal" than components of replicas' vclocks instead of "less or equal") To avoid duplication of code we unify the process of vclocks' comparison and transform `vclock_lesseq` into more general `vclock_compare` function which can allow us to make different comparisons of vclocks by comparator. We move this function in `util` vshard module. Needed for tarantool#214 NO_DOC=internal

Before this patch the recovery service used functions, which determine if the bucket can be recovered, one by one in `recovery_step_by_type`. Since in further patches the number of bucket's statuses will be increased and the logic of recovery service will be more complex, it can lead to degradation of codebase readability. In order to fix it we decided to: 1) Used for-loop based checking instead of one by one based checking of buckets' recoverability. 2) Join the logic of saving recovered bucket ids and changing of `_bucket` space into one separate function - `recover_bucket_to_state`. Needed for tarantool#214 NO_TEST=refactoring NO_DOC=refactoring

In this patch we extract a block of code responsible for sending asynchronous requests and waiting corresponding responses from `replicaset_map_call` into the separate function - `replicas_map_call`. It was done because in the next patch - tarantoolgh-214 we will need to have a function which performs map-reduce along the masters of cluster. The `replicas_map_call` can be overused in it. Needed for tarantool#214 NO_TEST=refactoring NO_DOC=refactoring

In this patch we modify the api of `wait_masters_connect` function by changing the `timeout` parameter into `opts` parameter with `timeout` key. It was done in order to easier to use extra parameters in this function (e.g. - `except` in tarantoolgh-214). Needed for tarantool#214 NO_TEST=refactoring NO_DOC=refactoring

Before this patch the `rebalancer` and `recovery` service could start just right after master switch (by `auto` master detection or manual reconfiguration) before the master had time to sync its vclock with other replicas in replicaset. The newly elected master may not have received the `_bucket` space updats yes. And when the recovery service starts on another node it can mistakenly think that there is no bucket in this replicaset. This issue and other ones which will be fixed by further patches lead to doubled buckets in the cluster. To fix it we introduce a new storage service - `on_master_enable` service. If master is changed in replicaset, this service is triggered and waits until newly elected master syncs its vclock with other replicas. Other storage services - `rebalancer` and `recovery` can't start until `on_master_enable` set `M.is_bucket_in_sync`. Also we change `storage/storage.test`, `storage/recovery.test`, `storage-luatest/log_verbosity_2_2_test` and `router/router.test` so that they wouldn't failed. Now `rebalancer` and `recovery` services don't start immediately after master switch and it can shake some tests. Part of tarantool#214 NO_TEST=bugfix NO_DOC=bugfix

Before this patch the recovery service decided whether the bucket should be recovered only based on its status on sender and receiver node and `rebalancer_transfering_buckets` system table. This approach could lead to doubled buckets, when the recovery service didn't find a remote bucket on the destination node and recovered local bucket into "active" state. In order to partially fix this issue we extend the `_bucket` system space by adding a new field - `opts` of map type. We make it nullable for backward compatibility with old vshard versions and add one table key - `generation`. When the bucket is transferred from one node to another, the `bucket_send` increments the generation of bucket by 1. The `bucket_recv` should persist its generation. Part of tarantool#214 NO_TEST=internal NO_DOC=internal

In the patch tarantoolgh-214 the storage need to perform cluster map-reduce across masters of replicasets. In order to make it able we introduce a new function to `replicaset` module - `cluster_map_master_call`. This function waits until all given masters are connected then creates a table of them (except one master which shouldn't be used in map-reduce) which is passed to `replicas_map_send_and_wait`. It is crucial to maintain the `except` field because some storage-related services need to perform map-reduce excepting the own storage instance. Part of tarantool#214 NO_TEST=internal NO_DOC=internal

In this main part of tarantoolgh-214 patch we modify the `recovery_step_by_type` function to make it able to notice the missed bucket (the bucket which is not found on the destination node). Now the recovery service work as follows: 1) If the remote bucket is not present on the destination node, the recovery service searches it across all the masters except the own storage. If during the searching some of storage responds with error, the recovery service goes to next iteration. If the bucket is found on the destination node and have greater generation, it can be restored to `garbage` state. Otherwise, if the bucket is not found across all other masters, it is restored to `active` state. 2) If the remote bucket is present on the destination node and it had a greater generation, the local bucket is restored to `garbage` state. 3) In other cases we use the old logic of the recovery service. Closes tarantool#214 NO_DOC=bugfix

Serpentian

I didn't review the last 3 commits yet. A little bit later)

Serpentian · 2026-04-30T06:50:24Z

    end, {bids, uuid})

-    local msg = "Error during recovery of bucket"
+    local msg = "Error during recovery of bucket.*MISSING_MASTER"


That solution with suffixes looks very complicated. Why hardcoding the error.code doesn't help? For me the test is 100% stable with the following patch. Is it for you?

Details

diff --git a/test/storage-luatest/log_verbosity_2_2_test.lua b/test/storage-luatest/log_verbosity_2_2_test.lua index 846d307..1e8a97a 100644 --- a/test/storage-luatest/log_verbosity_2_2_test.lua +++ b/test/storage-luatest/log_verbosity_2_2_test.lua @@ -59,7 +59,7 @@ test_group.test_recovery_do_not_spam_same_errors = function(g) box.commit() end, {bids, uuid}) - local msg = "Error during recovery of bucket" + local msg = "Error during recovery of bucket.*MISSING_MASTER" g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1, on_yield = function() ivshard.storage.recovery_wakeup() @@ -118,7 +118,7 @@ test_group.test_rebalancer_do_not_spam_same_errors = function(g) error('TimedOut') end end) - local msg = "Error during downloading rebalancer states" + local msg = "Error during downloading rebalancer states.*TimedOut" g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1, on_yield = function() ivshard.storage.rebalancer_wakeup() end}) g.replica_2_a:exec(function()

Serpentian · 2026-04-30T06:53:09Z

 --     where the error happened.
 --
-local function replicaset_map_call(replicaset, func, args, opts)
+local function replicas_map_send_and_wait(replicas, func, args, opts)


You still can't reuse it or cluster_map_call in map_callrw on router?

Serpentian · 2026-04-30T07:09:50Z

 --------------------------------------------------------------------------------

-- Vclock comparing function
-local function vclock_lesseq(vc1, vc2)


Commit: util: introduce vclock_compare

Nit: I'd consider leaving the function vclock_lesseq, which from now on will start using the util.vclock_compare. It'll way easier to read in the functions, where it's used (e.g. wait_lsn), comparators are hard to parse, IMHO.

Serpentian · 2026-04-30T07:17:37Z

+-- superior to replcas' vclock components.
+--
+local function storage_wait_vclock_replicated(timeout, interval)
+    local comparator = function(c1, c2) return c1 <= (c2 or 0) end


We actually go through the vclock1 with loop, so the c1 cannot be nil, it doesn't depend on local or remote vclock (moreover, we skip the 0 component in vclock_compare)). But I don't mind adding the check for nil for c1` just for consistency, even though it's excessive

Serpentian · 2026-04-30T07:44:32Z

+}
+
+local function recover_bucket_to_state(recovered_buckets, bucket_id, state)
+    if state == BACTIVE then


Please, let's take the commit from the #633

Serpentian · 2026-04-30T07:58:25Z

    -- there was found a bug, and reload fixes it.
    while module_version == M.module_version do
        service:next_iter()
+        if storage_check_bucket_is_synced(service, limiter) then


Let's place the check for sync in recovery service below error injection. It's impossible to pause the recovery now, if buckets are not in sync.

Serpentian · 2026-04-30T08:03:07Z

@@ -1215,6 +1248,21 @@ local function sync(timeout)
    return wait_lsn(timeout or M.sync_timeout, 0.001)
 end


Commit: storage: introduce on_master_enable service

Let's also block the gc service until the bucket is synchronized. The motivation: old leader could alredy delete the bucket and the new one will do the same and we'll get the replication conflict.

Serpentian · 2026-04-30T08:24:21Z

 local function master_on_disable()
    log.info("Stepping down from the master role")
    M.is_master = false
+    M.is_bucket_in_sync = false


Commit: storage: introduce on_master_enable service

Let's place the is_bucket_in_sync in the master_role_update, where the service is started.

And we have a problem now:

Service starts, calls the function on replicas

Node stops being a master

Node becomes master once again

The service still works and waits for old vclocks

Serpentian · 2026-04-30T08:24:49Z

+            'vshard.storage._call',
+            {'storage_wait_vclock_persisted', wait_interval}, opts)
+        if err then
+            log.warn('Error during on_master_enable: %s ' ..


Nit: log limiter, it may spam a lot

Serpentian

We're moving fast! Great work! The patchset looks way better now, we just need a little bit more polishing

Serpentian · 2026-04-30T13:04:19Z


 test_group.after_all(function(g)
-    g.cluster:drop()
+    g.cluster:stop()


Serpentian · 2026-04-30T13:05:57Z

+-- Version 0.1.40.0
+--------------------------------------------------------------------------------
+local version_0_1_40_0 = table.deepcopy(version_0_1_16_0)
+version_0_1_40_0.version = '0.1.40.0'


Should be 41, the 40th release doesn't need these changes

Serpentian · 2026-04-30T13:09:15Z

+    local _bucket = box.space._bucket
+    local bucket_space_id = _bucket.id
+    local bucket_space_format = box.space._space:get(bucket_space_id).format
+    require('vshard.util').table_extend(bucket_space_format,


Nit: Let's won't inline the requires. Moreover we already lutil in this file

Serpentian · 2026-04-30T13:12:24Z

+--
+local function storage_wait_vclock_persisted(wait_interval)
+    -- A guarantee that no new transactions will be written to this storage.
+    while M.is_master or rebalancing_is_in_progress() do


Let's also wait for recovery and gc services to stop. To be sure, that none of the internal services can change the bucket space

Serpentian · 2026-04-30T13:16:14Z

+                 res_bucket_info.id, res_id)
+        return res_id, res_bucket_info
+    end
+    return M.this_replicaset.id, nil


Firstly, no need to return smth and nil at the end. Secondly, why do we need to return that M.this_replicaset? It's obvious, that the current one has it, since we're trying to recover it. Let's just return nil, returning M.this_replicaset complicates the code without any reason, IMHO

Serpentian · 2026-04-30T13:18:43Z

+        M.replicasets, 'vshard.storage._call',
+        {'recovery_bucket_stat', bucket_id}, opts)
+    if err then
+        limiter:log_error(err, 'Error %s during scanning the master: %s',


This may spam a lot. We have very sophisticated logic for logging in recovery to avoid message spamming. Let's just return the err and log the message iff is_step_empty

Serpentian · 2026-04-30T13:23:53Z

+                err = lerror.make(string.format(
+                    'Cluster is invalid. bucket %s can\'t have generation %s',
+                    remote_bucket.id, remote_bucket.generation))
+                limiter:log_error(err, err.message)


Maybe just an assertion instead of the log message?

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from be90d04 to 2f96b14 Compare March 21, 2026 11:57

Serpentian requested changes Mar 23, 2026

View reviewed changes

Comment thread vshard/storage/init.lua Outdated

Comment thread vshard/storage/init.lua Outdated

Comment thread vshard/storage/init.lua Outdated

Comment thread vshard/consts.lua Outdated

Comment thread vshard/storage/init.lua Outdated

Comment thread vshard/storage/init.lua

Serpentian assigned mrForza Mar 23, 2026

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from 83458b3 to 396cc20 Compare March 25, 2026 13:08

mrForza requested a review from Serpentian March 25, 2026 14:43

mrForza assigned Serpentian and unassigned mrForza Mar 25, 2026

Serpentian requested changes Mar 26, 2026

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Mar 26, 2026

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 396cc20 to 78cf3e9 Compare March 31, 2026 14:56

mrForza requested a review from Serpentian April 1, 2026 10:14

mrForza assigned Serpentian and unassigned mrForza Apr 1, 2026

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 78cf3e9 to 4ba14a1 Compare April 1, 2026 13:01

Serpentian reviewed Apr 9, 2026

View reviewed changes

Comment thread test/luatest_helpers/vtest.lua Outdated

Comment thread test/router/router.result

Comment thread vshard/error.lua Outdated

Comment thread test/storage-luatest/storage_2_2_test.lua Outdated

Serpentian requested a review from Gerold103 April 9, 2026 10:04

Serpentian assigned Gerold103 and mrForza and unassigned Serpentian Apr 9, 2026

Serpentian reviewed Apr 9, 2026

View reviewed changes

Comment thread test/storage-luatest/storage_2_2_test.lua Outdated

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from c2ca05c to 2740f3a Compare April 19, 2026 11:51

mrForza requested a review from Serpentian April 19, 2026 14:04

mrForza assigned Serpentian and Gerold103 and unassigned Gerold103 and mrForza Apr 19, 2026

mrForza requested a review from Gerold103 April 21, 2026 20:47

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from 2fd0348 to f94aad3 Compare April 22, 2026 13:21

ci: use luatest 1.4.1 instead of the latest

e27411c

The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See tarantool/luatest#453 for details. Let's use 1.4.1 for now. NO_DOC=ci NO_TEST=ci

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from f94aad3 to 45fdc34 Compare April 23, 2026 09:12

Serpentian requested changes Apr 23, 2026

View reviewed changes

Serpentian assigned mrForza Apr 23, 2026

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from c894811 to 5b5f664 Compare April 28, 2026 19:55

mrForza added 11 commits April 29, 2026 16:13

mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 5b5f664 to 5d3bd46 Compare April 29, 2026 13:15

mrForza requested a review from Serpentian April 29, 2026 13:47

mrForza assigned Serpentian and Gerold103 and unassigned Gerold103 and mrForza Apr 29, 2026

Serpentian reviewed Apr 30, 2026

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Apr 30, 2026

		@@ -1215,6 +1248,21 @@ local function sync(timeout)
		return wait_lsn(timeout or M.sync_timeout, 0.001)
		end

Conversation

mrForza commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serpentian Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mrForza commented Mar 20, 2026 •

edited

Loading

Serpentian Apr 30, 2026 •

edited

Loading