Skip to content

storage: introduce on_master_enable service#646

Open
mrForza wants to merge 12 commits intotarantool:masterfrom
mrForza:gh-214-stray-tcp-doubled-buckets
Open

storage: introduce on_master_enable service#646
mrForza wants to merge 12 commits intotarantool:masterfrom
mrForza:gh-214-stray-tcp-doubled-buckets

Conversation

@mrForza
Copy link
Copy Markdown
Contributor

@mrForza mrForza commented Mar 20, 2026

Before this patch the rebalancer and recovery service could start
just right after master switch (by auto master detection or manual
reconfiguration) before the master had time to sync its vclock with
other replicas in replicaset. It could lead to doubled buckets according
to "Doubled buckets RFC".

To fix it we introduce a new storage service - on_master_enable
service. If master is changed in replicaset, this service is triggered
and waits until newly elected master syncs its vclock with other
replicas. Other storage services - rebalancer and recovery can't
start until on_master_enable set M.buckets_are_in_sync.

Closes #214

NO_TEST=bugfix
NO_DOC=bugfix

@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from be90d04 to 2f96b14 Compare March 21, 2026 11:57
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/consts.lua Outdated
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from 83458b3 to 396cc20 Compare March 25, 2026 13:08
@mrForza mrForza requested a review from Serpentian March 25, 2026 14:43
@mrForza mrForza assigned Serpentian and unassigned mrForza Mar 25, 2026
Comment thread test/luatest_helpers/vtest.lua Outdated
Comment thread test/luatest_helpers/vtest.lua Outdated
Comment thread test/luatest_helpers/vtest.lua Outdated
Comment thread test/luatest_helpers/vtest.lua
Comment thread test/luatest_helpers/vtest.lua Outdated
Comment thread vshard/storage/init.lua
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua
@Serpentian Serpentian assigned mrForza and unassigned Serpentian Mar 26, 2026
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 396cc20 to 78cf3e9 Compare March 31, 2026 14:56
@mrForza mrForza requested a review from Serpentian April 1, 2026 10:14
@mrForza mrForza assigned Serpentian and unassigned mrForza Apr 1, 2026
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 78cf3e9 to 4ba14a1 Compare April 1, 2026 13:01
Comment thread test/luatest_helpers/vtest.lua Outdated
Comment thread test/router/router.result
Comment thread vshard/error.lua Outdated
Comment thread test/storage-luatest/storage_2_2_test.lua Outdated
@Serpentian Serpentian requested a review from Gerold103 April 9, 2026 10:04
@Serpentian Serpentian assigned Gerold103 and mrForza and unassigned Serpentian Apr 9, 2026
Comment thread test/storage-luatest/storage_2_2_test.lua Outdated
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from c2ca05c to 2740f3a Compare April 19, 2026 11:51
@mrForza mrForza requested a review from Serpentian April 19, 2026 14:04
@mrForza mrForza assigned Serpentian and Gerold103 and unassigned Gerold103 and mrForza Apr 19, 2026
@mrForza mrForza requested a review from Gerold103 April 21, 2026 20:47
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from 2fd0348 to f94aad3 Compare April 22, 2026 13:21
The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See
tarantool/luatest#453 for details. Let's use 1.4.1 for now.

NO_DOC=ci
NO_TEST=ci
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from f94aad3 to 45fdc34 Compare April 23, 2026 09:12
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch! I'm impressed with the speed you do the patches. I've left some comments, the most critical ones:

  1. We loose data during recovery (last commit with recovery changes)
  2. We don't have upgrade process (commit with generation, I wonder, how upgrade tests pass)
  3. We must figure out, what to do in the #646 (comment), but for that we need an opinion from @Gerold103

end, {bids, uuid})

local msg = "Error during recovery of bucket"
local msg = "Error during recovery of bucket.*MISSING_MASTER"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this doesn't help with the flakiness fix, the catch_flaky and 1.10 workflow fails, let's try figure this out. And why did you decide to fix it in the first place? AFAICS, it's not changed in your patchset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fixed now, 100%

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That solution with suffixes looks very complicated. Why hardcoding the error.code doesn't help? For me the test is 100% stable with the following patch. Is it for you?

Details

diff --git a/test/storage-luatest/log_verbosity_2_2_test.lua b/test/storage-luatest/log_verbosity_2_2_test.lua
index 846d307..1e8a97a 100644
--- a/test/storage-luatest/log_verbosity_2_2_test.lua
+++ b/test/storage-luatest/log_verbosity_2_2_test.lua
@@ -59,7 +59,7 @@ test_group.test_recovery_do_not_spam_same_errors = function(g)
         box.commit()
     end, {bids, uuid})
 
-    local msg = "Error during recovery of bucket"
+    local msg = "Error during recovery of bucket.*MISSING_MASTER"
     g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
                                               on_yield = function()
         ivshard.storage.recovery_wakeup()
@@ -118,7 +118,7 @@ test_group.test_rebalancer_do_not_spam_same_errors = function(g)
             error('TimedOut')
         end
     end)
-    local msg = "Error during downloading rebalancer states"
+    local msg = "Error during downloading rebalancer states.*TimedOut"
     g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
         on_yield = function() ivshard.storage.rebalancer_wakeup() end})
     g.replica_2_a:exec(function()

local function bucket_recovery_continue()
vshard.storage.internal.errinj.ERRINJ_RECOVERY_PAUSE = false
vshard.storage.garbage_collector_wakeup()
vshard.storage.recovery_wakeup()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's not refactoring, as it's stated in the commit msg. We're fixing the test, refactoring should not change the behavior

Comment thread vshard/storage/init.lua
}

local function recover_bucket_to_state(recovered_buckets, bucket_id, state)
if state == BACTIVE then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a good commit (with refactoring of the refactory), I'll take it to my patchset in order to simplify the merging of our fixes in the future, I'm changing these lines too. And I'll refactor it a little bit to better fit there.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, let's take the commit from the #633

Comment thread vshard/replicaset.lua Outdated
-- where the error happened.
--
local function replicaset_map_call(replicaset, func, args, opts)
local function replicas_map_send_and_wait(replicas, func, args, opts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Vlad here. replicas_map_call is the better name for the function. Maybe even just map_call. You still can't reuse it or cluster_map_call in map_callrw on router?

Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua
Comment thread vshard/storage/init.lua Outdated
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from c894811 to 5b5f664 Compare April 28, 2026 19:55
mrForza added 11 commits April 29, 2026 16:13
This patch fixes the flakiness of the `log_verbosity_2_2_test`. In Vshard
some logs have a composite structure, for example "Error during recovery:
<ERR>", where ERR can change over the time and have a different type. In
rare situations it can lead to bugs when we use `wait_log_exactly_once`
function, because it ignores the volatile part of the log.

To correctly determine whether a log with a composite structure was
duplicated, we need to compare not only its common (unchanging) part, but
also its volatile part. For this, in `wait_log_exactly_once` we compare
not only the common substrings of logs, but also their suffixes, which in
some scenarios can change. The suffixes are applicable only when we use
this function with ".*" sign in the end of the log string. Otherwise its
behaviour is default.

Needed for tarantool#214

NO_DOC=test
Before this patch we woke up the GC service instead of recovery service
in `bucket_recovery_pause`. It could lead to a longer tests' execution
time. Now, we fix it by changing `garbage_collector_wakeup` to
`recovery_wakeup`.

Needed for tarantool#214

NO_DOC=test
Before this patch the `bucket_move` and `bucket_wait_transfer` helper
functions were used only in `storage_1_1_1_test`. However in future
patches these helpers can also be applicable (e.g. in tarantoolgh-214).

This patch moves `bucket_move` and `bucket_wait_transfer` into `vtest`
module so that we can use it in other tests.

Needed for tarantool#214

NO_TEST=test
NO_DOC=test
Before this patch we compared vclocks only in `wait_lsn` function in
storage module. However in future patches (e.g. tarantoolgh-214) we will need to
do this even in tests. Also in tarantoolgh-214 we will use very similar logic of
waiting vclocks but with different sign (all vclock components of
current storage should be "greater or equal" than components of replicas'
vclocks instead of "less or equal")

To avoid duplication of code we unify the process of vclocks' comparison
and transform `vclock_lesseq` into more general `vclock_compare` function
which can allow us to make different comparisons of vclocks by
comparator. We move this function in `util` vshard module.

Needed for tarantool#214

NO_DOC=internal
Before this patch the recovery service used functions, which determine if
the bucket can be recovered, one by one in `recovery_step_by_type`.
Since in further patches the number of bucket's statuses will be
increased and the logic of recovery service will be more complex, it can
lead to degradation of codebase readability. In order to fix it we
decided to:
1) Used for-loop based checking instead of one by one based checking of
   buckets' recoverability.
2) Join the logic of saving recovered bucket ids and changing of `_bucket`
space into one separate function - `recover_bucket_to_state`.

Needed for tarantool#214

NO_TEST=refactoring
NO_DOC=refactoring
In this patch we extract a block of code responsible for sending
asynchronous requests and waiting corresponding responses from
`replicaset_map_call` into the separate function - `replicas_map_call`.

It was done because in the next patch - tarantoolgh-214 we will need to have a
function which performs map-reduce along the masters of cluster. The
`replicas_map_call` can be overused in it.

Needed for tarantool#214

NO_TEST=refactoring
NO_DOC=refactoring
In this patch we modify the api of `wait_masters_connect` function by
changing the `timeout` parameter into `opts` parameter with `timeout`
key. It was done in order to easier to use extra parameters in this
function (e.g. - `except` in tarantoolgh-214).

Needed for tarantool#214

NO_TEST=refactoring
NO_DOC=refactoring
Before this patch the `rebalancer` and `recovery` service could start
just right after master switch (by `auto` master detection or manual
reconfiguration) before the master had time to sync its vclock with
other replicas in replicaset. The newly elected master may not have
received the `_bucket` space updats yes. And when the recovery service
starts on another node it can mistakenly think that there is no bucket
in this replicaset. This issue and other ones which will be fixed by
further patches lead to doubled buckets in the cluster.

To fix it we introduce a new storage service - `on_master_enable`
service. If master is changed in replicaset, this service is triggered
and waits until newly elected master syncs its vclock with other
replicas. Other storage services - `rebalancer` and `recovery` can't
start until `on_master_enable` set `M.is_bucket_in_sync`.

Also we change `storage/storage.test`, `storage/recovery.test`,
`storage-luatest/log_verbosity_2_2_test` and `router/router.test` so
that they wouldn't failed. Now `rebalancer` and `recovery` services
don't start immediately after master switch and it can shake some tests.

Part of tarantool#214

NO_TEST=bugfix
NO_DOC=bugfix
Before this patch the recovery service decided whether the bucket should
be recovered only based on its status on sender and receiver node and
`rebalancer_transfering_buckets` system table. This approach could lead
to doubled buckets, when the recovery service didn't find a remote
bucket on the destination node and recovered local bucket into "active"
state.

In order to partially fix this issue we extend the `_bucket` system
space by adding a new field - `opts` of map type. We make it nullable
for backward compatibility with old vshard versions and add one table
key - `generation`. When the bucket is transferred from one node to
another, the `bucket_send` increments the generation of bucket by 1.
The `bucket_recv` should persist its generation.

Part of tarantool#214

NO_TEST=internal
NO_DOC=internal
In the patch tarantoolgh-214 the storage need to perform cluster map-reduce
across masters of replicasets. In order to make it able we introduce a
new function to `replicaset` module - `cluster_map_master_call`.

This function waits until all given masters are connected then creates a
table of them (except one master which shouldn't be used in map-reduce)
which is passed to `replicas_map_send_and_wait`. It is crucial to
maintain the `except` field because some storage-related services need
to perform map-reduce excepting the own storage instance.

Part of tarantool#214

NO_TEST=internal
NO_DOC=internal
In this main part of tarantoolgh-214 patch we modify the `recovery_step_by_type`
function to make it able to notice the missed bucket (the bucket which
is not found on the destination node).

Now the recovery service work as follows:
1) If the remote bucket is not present on the destination node, the
   recovery service searches it across all the masters except the own
   storage. If during the searching some of storage responds with error,
   the recovery service goes to next iteration. If the bucket is found
   on the destination node and have greater generation, it can be
   restored to `garbage` state. Otherwise, if the bucket is not found
   across all other masters, it is restored to `active` state.
2) If the remote bucket is present on the destination node and it had a
   greater generation, the local bucket is restored to `garbage` state.
3) In other cases we use the old logic of the recovery service.

Closes tarantool#214

NO_DOC=bugfix
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 5b5f664 to 5d3bd46 Compare April 29, 2026 13:15
@mrForza mrForza requested a review from Serpentian April 29, 2026 13:47
@mrForza mrForza assigned Serpentian and Gerold103 and unassigned Gerold103 and mrForza Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the last 3 commits yet. A little bit later)

end, {bids, uuid})

local msg = "Error during recovery of bucket"
local msg = "Error during recovery of bucket.*MISSING_MASTER"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That solution with suffixes looks very complicated. Why hardcoding the error.code doesn't help? For me the test is 100% stable with the following patch. Is it for you?

Details

diff --git a/test/storage-luatest/log_verbosity_2_2_test.lua b/test/storage-luatest/log_verbosity_2_2_test.lua
index 846d307..1e8a97a 100644
--- a/test/storage-luatest/log_verbosity_2_2_test.lua
+++ b/test/storage-luatest/log_verbosity_2_2_test.lua
@@ -59,7 +59,7 @@ test_group.test_recovery_do_not_spam_same_errors = function(g)
         box.commit()
     end, {bids, uuid})
 
-    local msg = "Error during recovery of bucket"
+    local msg = "Error during recovery of bucket.*MISSING_MASTER"
     g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
                                               on_yield = function()
         ivshard.storage.recovery_wakeup()
@@ -118,7 +118,7 @@ test_group.test_rebalancer_do_not_spam_same_errors = function(g)
             error('TimedOut')
         end
     end)
-    local msg = "Error during downloading rebalancer states"
+    local msg = "Error during downloading rebalancer states.*TimedOut"
     g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
         on_yield = function() ivshard.storage.rebalancer_wakeup() end})
     g.replica_2_a:exec(function()

Comment thread vshard/replicaset.lua Outdated
-- where the error happened.
--
local function replicaset_map_call(replicaset, func, args, opts)
local function replicas_map_send_and_wait(replicas, func, args, opts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still can't reuse it or cluster_map_call in map_callrw on router?

Comment thread vshard/storage/init.lua
--------------------------------------------------------------------------------

-- Vclock comparing function
local function vclock_lesseq(vc1, vc2)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: util: introduce vclock_compare

Nit: I'd consider leaving the function vclock_lesseq, which from now on will start using the util.vclock_compare. It'll way easier to read in the functions, where it's used (e.g. wait_lsn), comparators are hard to parse, IMHO.

Comment thread vshard/storage/init.lua Outdated
-- superior to replcas' vclock components.
--
local function storage_wait_vclock_replicated(timeout, interval)
local comparator = function(c1, c2) return c1 <= (c2 or 0) end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually go through the vclock1 with loop, so the c1 cannot be nil, it doesn't depend on local or remote vclock (moreover, we skip the 0 component in vclock_compare)). But I don't mind adding the check for nil for c1` just for consistency, even though it's excessive

Comment thread vshard/storage/init.lua
}

local function recover_bucket_to_state(recovered_buckets, bucket_id, state)
if state == BACTIVE then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, let's take the commit from the #633

Comment thread vshard/storage/init.lua
-- there was found a bug, and reload fixes it.
while module_version == M.module_version do
service:next_iter()
if storage_check_bucket_is_synced(service, limiter) then
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's place the check for sync in recovery service below error injection. It's impossible to pause the recovery now, if buckets are not in sync.

Comment thread vshard/storage/init.lua
@@ -1215,6 +1248,21 @@ local function sync(timeout)
return wait_lsn(timeout or M.sync_timeout, 0.001)
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: storage: introduce on_master_enable service

Let's also block the gc service until the bucket is synchronized. The motivation: old leader could alredy delete the bucket and the new one will do the same and we'll get the replication conflict.

Comment thread vshard/storage/init.lua
local function master_on_disable()
log.info("Stepping down from the master role")
M.is_master = false
M.is_bucket_in_sync = false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: storage: introduce on_master_enable service

Let's place the is_bucket_in_sync in the master_role_update, where the service is started.

And we have a problem now:

  1. Service starts, calls the function on replicas
  2. Node stops being a master
  3. Node becomes master once again
  4. The service still works and waits for old vclocks

Comment thread vshard/storage/init.lua
'vshard.storage._call',
{'storage_wait_vclock_persisted', wait_interval}, opts)
if err then
log.warn('Error during on_master_enable: %s ' ..
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: log limiter, it may spam a lot

Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're moving fast! Great work! The patchset looks way better now, we just need a little bit more polishing


test_group.after_all(function(g)
g.cluster:drop()
g.cluster:stop()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

-- Version 0.1.40.0
--------------------------------------------------------------------------------
local version_0_1_40_0 = table.deepcopy(version_0_1_16_0)
version_0_1_40_0.version = '0.1.40.0'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 41, the 40th release doesn't need these changes

Comment thread vshard/storage/schema.lua
local _bucket = box.space._bucket
local bucket_space_id = _bucket.id
local bucket_space_format = box.space._space:get(bucket_space_id).format
require('vshard.util').table_extend(bucket_space_format,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's won't inline the requires. Moreover we already lutil in this file

Comment thread vshard/storage/init.lua
--
local function storage_wait_vclock_persisted(wait_interval)
-- A guarantee that no new transactions will be written to this storage.
while M.is_master or rebalancing_is_in_progress() do
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also wait for recovery and gc services to stop. To be sure, that none of the internal services can change the bucket space

Comment thread vshard/storage/init.lua
res_bucket_info.id, res_id)
return res_id, res_bucket_info
end
return M.this_replicaset.id, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, no need to return smth and nil at the end. Secondly, why do we need to return that M.this_replicaset? It's obvious, that the current one has it, since we're trying to recover it. Let's just return nil, returning M.this_replicaset complicates the code without any reason, IMHO

Comment thread vshard/storage/init.lua
M.replicasets, 'vshard.storage._call',
{'recovery_bucket_stat', bucket_id}, opts)
if err then
limiter:log_error(err, 'Error %s during scanning the master: %s',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may spam a lot. We have very sophisticated logic for logging in recovery to avoid message spamming. Let's just return the err and log the message iff is_step_empty

Comment thread vshard/storage/init.lua
err = lerror.make(string.format(
'Cluster is invalid. bucket %s can\'t have generation %s',
remote_bucket.id, remote_bucket.generation))
limiter:log_error(err, err.message)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just an assertion instead of the log message?

@Serpentian Serpentian assigned mrForza and unassigned Serpentian Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stray TCP message with big delay may duplicate a bucket

3 participants