storage: introduce on_master_enable service#646
storage: introduce on_master_enable service#646mrForza wants to merge 12 commits intotarantool:masterfrom
Conversation
be90d04 to
2f96b14
Compare
83458b3 to
396cc20
Compare
396cc20 to
78cf3e9
Compare
78cf3e9 to
4ba14a1
Compare
c2ca05c to
2740f3a
Compare
2fd0348 to
f94aad3
Compare
The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See tarantool/luatest#453 for details. Let's use 1.4.1 for now. NO_DOC=ci NO_TEST=ci
f94aad3 to
45fdc34
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Thank you for the patch! I'm impressed with the speed you do the patches. I've left some comments, the most critical ones:
- We loose data during recovery (last commit with recovery changes)
- We don't have upgrade process (commit with
generation, I wonder, how upgrade tests pass) - We must figure out, what to do in the #646 (comment), but for that we need an opinion from @Gerold103
| end, {bids, uuid}) | ||
|
|
||
| local msg = "Error during recovery of bucket" | ||
| local msg = "Error during recovery of bucket.*MISSING_MASTER" |
There was a problem hiding this comment.
Looks like this doesn't help with the flakiness fix, the catch_flaky and 1.10 workflow fails, let's try figure this out. And why did you decide to fix it in the first place? AFAICS, it's not changed in your patchset.
There was a problem hiding this comment.
It is fixed now, 100%
There was a problem hiding this comment.
That solution with suffixes looks very complicated. Why hardcoding the error.code doesn't help? For me the test is 100% stable with the following patch. Is it for you?
Details
diff --git a/test/storage-luatest/log_verbosity_2_2_test.lua b/test/storage-luatest/log_verbosity_2_2_test.lua
index 846d307..1e8a97a 100644
--- a/test/storage-luatest/log_verbosity_2_2_test.lua
+++ b/test/storage-luatest/log_verbosity_2_2_test.lua
@@ -59,7 +59,7 @@ test_group.test_recovery_do_not_spam_same_errors = function(g)
box.commit()
end, {bids, uuid})
- local msg = "Error during recovery of bucket"
+ local msg = "Error during recovery of bucket.*MISSING_MASTER"
g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
on_yield = function()
ivshard.storage.recovery_wakeup()
@@ -118,7 +118,7 @@ test_group.test_rebalancer_do_not_spam_same_errors = function(g)
error('TimedOut')
end
end)
- local msg = "Error during downloading rebalancer states"
+ local msg = "Error during downloading rebalancer states.*TimedOut"
g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
on_yield = function() ivshard.storage.rebalancer_wakeup() end})
g.replica_2_a:exec(function()| local function bucket_recovery_continue() | ||
| vshard.storage.internal.errinj.ERRINJ_RECOVERY_PAUSE = false | ||
| vshard.storage.garbage_collector_wakeup() | ||
| vshard.storage.recovery_wakeup() |
There was a problem hiding this comment.
And it's not refactoring, as it's stated in the commit msg. We're fixing the test, refactoring should not change the behavior
| } | ||
|
|
||
| local function recover_bucket_to_state(recovered_buckets, bucket_id, state) | ||
| if state == BACTIVE then |
There was a problem hiding this comment.
This one is a good commit (with refactoring of the refactory), I'll take it to my patchset in order to simplify the merging of our fixes in the future, I'm changing these lines too. And I'll refactor it a little bit to better fit there.
| -- where the error happened. | ||
| -- | ||
| local function replicaset_map_call(replicaset, func, args, opts) | ||
| local function replicas_map_send_and_wait(replicas, func, args, opts) |
There was a problem hiding this comment.
I agree with Vlad here. replicas_map_call is the better name for the function. Maybe even just map_call. You still can't reuse it or cluster_map_call in map_callrw on router?
c894811 to
5b5f664
Compare
This patch fixes the flakiness of the `log_verbosity_2_2_test`. In Vshard some logs have a composite structure, for example "Error during recovery: <ERR>", where ERR can change over the time and have a different type. In rare situations it can lead to bugs when we use `wait_log_exactly_once` function, because it ignores the volatile part of the log. To correctly determine whether a log with a composite structure was duplicated, we need to compare not only its common (unchanging) part, but also its volatile part. For this, in `wait_log_exactly_once` we compare not only the common substrings of logs, but also their suffixes, which in some scenarios can change. The suffixes are applicable only when we use this function with ".*" sign in the end of the log string. Otherwise its behaviour is default. Needed for tarantool#214 NO_DOC=test
Before this patch we woke up the GC service instead of recovery service in `bucket_recovery_pause`. It could lead to a longer tests' execution time. Now, we fix it by changing `garbage_collector_wakeup` to `recovery_wakeup`. Needed for tarantool#214 NO_DOC=test
Before this patch the `bucket_move` and `bucket_wait_transfer` helper functions were used only in `storage_1_1_1_test`. However in future patches these helpers can also be applicable (e.g. in tarantoolgh-214). This patch moves `bucket_move` and `bucket_wait_transfer` into `vtest` module so that we can use it in other tests. Needed for tarantool#214 NO_TEST=test NO_DOC=test
Before this patch we compared vclocks only in `wait_lsn` function in storage module. However in future patches (e.g. tarantoolgh-214) we will need to do this even in tests. Also in tarantoolgh-214 we will use very similar logic of waiting vclocks but with different sign (all vclock components of current storage should be "greater or equal" than components of replicas' vclocks instead of "less or equal") To avoid duplication of code we unify the process of vclocks' comparison and transform `vclock_lesseq` into more general `vclock_compare` function which can allow us to make different comparisons of vclocks by comparator. We move this function in `util` vshard module. Needed for tarantool#214 NO_DOC=internal
Before this patch the recovery service used functions, which determine if the bucket can be recovered, one by one in `recovery_step_by_type`. Since in further patches the number of bucket's statuses will be increased and the logic of recovery service will be more complex, it can lead to degradation of codebase readability. In order to fix it we decided to: 1) Used for-loop based checking instead of one by one based checking of buckets' recoverability. 2) Join the logic of saving recovered bucket ids and changing of `_bucket` space into one separate function - `recover_bucket_to_state`. Needed for tarantool#214 NO_TEST=refactoring NO_DOC=refactoring
In this patch we extract a block of code responsible for sending asynchronous requests and waiting corresponding responses from `replicaset_map_call` into the separate function - `replicas_map_call`. It was done because in the next patch - tarantoolgh-214 we will need to have a function which performs map-reduce along the masters of cluster. The `replicas_map_call` can be overused in it. Needed for tarantool#214 NO_TEST=refactoring NO_DOC=refactoring
In this patch we modify the api of `wait_masters_connect` function by changing the `timeout` parameter into `opts` parameter with `timeout` key. It was done in order to easier to use extra parameters in this function (e.g. - `except` in tarantoolgh-214). Needed for tarantool#214 NO_TEST=refactoring NO_DOC=refactoring
Before this patch the `rebalancer` and `recovery` service could start just right after master switch (by `auto` master detection or manual reconfiguration) before the master had time to sync its vclock with other replicas in replicaset. The newly elected master may not have received the `_bucket` space updats yes. And when the recovery service starts on another node it can mistakenly think that there is no bucket in this replicaset. This issue and other ones which will be fixed by further patches lead to doubled buckets in the cluster. To fix it we introduce a new storage service - `on_master_enable` service. If master is changed in replicaset, this service is triggered and waits until newly elected master syncs its vclock with other replicas. Other storage services - `rebalancer` and `recovery` can't start until `on_master_enable` set `M.is_bucket_in_sync`. Also we change `storage/storage.test`, `storage/recovery.test`, `storage-luatest/log_verbosity_2_2_test` and `router/router.test` so that they wouldn't failed. Now `rebalancer` and `recovery` services don't start immediately after master switch and it can shake some tests. Part of tarantool#214 NO_TEST=bugfix NO_DOC=bugfix
Before this patch the recovery service decided whether the bucket should be recovered only based on its status on sender and receiver node and `rebalancer_transfering_buckets` system table. This approach could lead to doubled buckets, when the recovery service didn't find a remote bucket on the destination node and recovered local bucket into "active" state. In order to partially fix this issue we extend the `_bucket` system space by adding a new field - `opts` of map type. We make it nullable for backward compatibility with old vshard versions and add one table key - `generation`. When the bucket is transferred from one node to another, the `bucket_send` increments the generation of bucket by 1. The `bucket_recv` should persist its generation. Part of tarantool#214 NO_TEST=internal NO_DOC=internal
In the patch tarantoolgh-214 the storage need to perform cluster map-reduce across masters of replicasets. In order to make it able we introduce a new function to `replicaset` module - `cluster_map_master_call`. This function waits until all given masters are connected then creates a table of them (except one master which shouldn't be used in map-reduce) which is passed to `replicas_map_send_and_wait`. It is crucial to maintain the `except` field because some storage-related services need to perform map-reduce excepting the own storage instance. Part of tarantool#214 NO_TEST=internal NO_DOC=internal
In this main part of tarantoolgh-214 patch we modify the `recovery_step_by_type` function to make it able to notice the missed bucket (the bucket which is not found on the destination node). Now the recovery service work as follows: 1) If the remote bucket is not present on the destination node, the recovery service searches it across all the masters except the own storage. If during the searching some of storage responds with error, the recovery service goes to next iteration. If the bucket is found on the destination node and have greater generation, it can be restored to `garbage` state. Otherwise, if the bucket is not found across all other masters, it is restored to `active` state. 2) If the remote bucket is present on the destination node and it had a greater generation, the local bucket is restored to `garbage` state. 3) In other cases we use the old logic of the recovery service. Closes tarantool#214 NO_DOC=bugfix
5b5f664 to
5d3bd46
Compare
Serpentian
left a comment
There was a problem hiding this comment.
I didn't review the last 3 commits yet. A little bit later)
| end, {bids, uuid}) | ||
|
|
||
| local msg = "Error during recovery of bucket" | ||
| local msg = "Error during recovery of bucket.*MISSING_MASTER" |
There was a problem hiding this comment.
That solution with suffixes looks very complicated. Why hardcoding the error.code doesn't help? For me the test is 100% stable with the following patch. Is it for you?
Details
diff --git a/test/storage-luatest/log_verbosity_2_2_test.lua b/test/storage-luatest/log_verbosity_2_2_test.lua
index 846d307..1e8a97a 100644
--- a/test/storage-luatest/log_verbosity_2_2_test.lua
+++ b/test/storage-luatest/log_verbosity_2_2_test.lua
@@ -59,7 +59,7 @@ test_group.test_recovery_do_not_spam_same_errors = function(g)
box.commit()
end, {bids, uuid})
- local msg = "Error during recovery of bucket"
+ local msg = "Error during recovery of bucket.*MISSING_MASTER"
g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
on_yield = function()
ivshard.storage.recovery_wakeup()
@@ -118,7 +118,7 @@ test_group.test_rebalancer_do_not_spam_same_errors = function(g)
error('TimedOut')
end
end)
- local msg = "Error during downloading rebalancer states"
+ local msg = "Error during downloading rebalancer states.*TimedOut"
g.replica_1_a:wait_log_exactly_once(msg, {timeout = 0.1,
on_yield = function() ivshard.storage.rebalancer_wakeup() end})
g.replica_2_a:exec(function()| -- where the error happened. | ||
| -- | ||
| local function replicaset_map_call(replicaset, func, args, opts) | ||
| local function replicas_map_send_and_wait(replicas, func, args, opts) |
There was a problem hiding this comment.
You still can't reuse it or cluster_map_call in map_callrw on router?
| -------------------------------------------------------------------------------- | ||
|
|
||
| -- Vclock comparing function | ||
| local function vclock_lesseq(vc1, vc2) |
There was a problem hiding this comment.
Commit: util: introduce vclock_compare
Nit: I'd consider leaving the function vclock_lesseq, which from now on will start using the util.vclock_compare. It'll way easier to read in the functions, where it's used (e.g. wait_lsn), comparators are hard to parse, IMHO.
| -- superior to replcas' vclock components. | ||
| -- | ||
| local function storage_wait_vclock_replicated(timeout, interval) | ||
| local comparator = function(c1, c2) return c1 <= (c2 or 0) end |
There was a problem hiding this comment.
We actually go through the vclock1 with loop, so the c1 cannot be nil, it doesn't depend on local or remote vclock (moreover, we skip the 0 component in vclock_compare)). But I don't mind adding the check for nil for c1` just for consistency, even though it's excessive
| } | ||
|
|
||
| local function recover_bucket_to_state(recovered_buckets, bucket_id, state) | ||
| if state == BACTIVE then |
| -- there was found a bug, and reload fixes it. | ||
| while module_version == M.module_version do | ||
| service:next_iter() | ||
| if storage_check_bucket_is_synced(service, limiter) then |
There was a problem hiding this comment.
Let's place the check for sync in recovery service below error injection. It's impossible to pause the recovery now, if buckets are not in sync.
| @@ -1215,6 +1248,21 @@ local function sync(timeout) | |||
| return wait_lsn(timeout or M.sync_timeout, 0.001) | |||
| end | |||
There was a problem hiding this comment.
Commit: storage: introduce on_master_enable service
Let's also block the gc service until the bucket is synchronized. The motivation: old leader could alredy delete the bucket and the new one will do the same and we'll get the replication conflict.
| local function master_on_disable() | ||
| log.info("Stepping down from the master role") | ||
| M.is_master = false | ||
| M.is_bucket_in_sync = false |
There was a problem hiding this comment.
Commit: storage: introduce on_master_enable service
Let's place the is_bucket_in_sync in the master_role_update, where the service is started.
And we have a problem now:
- Service starts, calls the function on replicas
- Node stops being a master
- Node becomes master once again
- The service still works and waits for old vclocks
| 'vshard.storage._call', | ||
| {'storage_wait_vclock_persisted', wait_interval}, opts) | ||
| if err then | ||
| log.warn('Error during on_master_enable: %s ' .. |
There was a problem hiding this comment.
Nit: log limiter, it may spam a lot
Serpentian
left a comment
There was a problem hiding this comment.
We're moving fast! Great work! The patchset looks way better now, we just need a little bit more polishing
|
|
||
| test_group.after_all(function(g) | ||
| g.cluster:drop() | ||
| g.cluster:stop() |
| -- Version 0.1.40.0 | ||
| -------------------------------------------------------------------------------- | ||
| local version_0_1_40_0 = table.deepcopy(version_0_1_16_0) | ||
| version_0_1_40_0.version = '0.1.40.0' |
There was a problem hiding this comment.
Should be 41, the 40th release doesn't need these changes
| local _bucket = box.space._bucket | ||
| local bucket_space_id = _bucket.id | ||
| local bucket_space_format = box.space._space:get(bucket_space_id).format | ||
| require('vshard.util').table_extend(bucket_space_format, |
There was a problem hiding this comment.
Nit: Let's won't inline the requires. Moreover we already lutil in this file
| -- | ||
| local function storage_wait_vclock_persisted(wait_interval) | ||
| -- A guarantee that no new transactions will be written to this storage. | ||
| while M.is_master or rebalancing_is_in_progress() do |
There was a problem hiding this comment.
Let's also wait for recovery and gc services to stop. To be sure, that none of the internal services can change the bucket space
| res_bucket_info.id, res_id) | ||
| return res_id, res_bucket_info | ||
| end | ||
| return M.this_replicaset.id, nil |
There was a problem hiding this comment.
Firstly, no need to return smth and nil at the end. Secondly, why do we need to return that M.this_replicaset? It's obvious, that the current one has it, since we're trying to recover it. Let's just return nil, returning M.this_replicaset complicates the code without any reason, IMHO
| M.replicasets, 'vshard.storage._call', | ||
| {'recovery_bucket_stat', bucket_id}, opts) | ||
| if err then | ||
| limiter:log_error(err, 'Error %s during scanning the master: %s', |
There was a problem hiding this comment.
This may spam a lot. We have very sophisticated logic for logging in recovery to avoid message spamming. Let's just return the err and log the message iff is_step_empty
| err = lerror.make(string.format( | ||
| 'Cluster is invalid. bucket %s can\'t have generation %s', | ||
| remote_bucket.id, remote_bucket.generation)) | ||
| limiter:log_error(err, err.message) |
There was a problem hiding this comment.
Maybe just an assertion instead of the log message?
Before this patch the
rebalancerandrecoveryservice could startjust right after master switch (by
automaster detection or manualreconfiguration) before the master had time to sync its vclock with
other replicas in replicaset. It could lead to doubled buckets according
to "Doubled buckets RFC".
To fix it we introduce a new storage service -
on_master_enableservice. If master is changed in replicaset, this service is triggered
and waits until newly elected master syncs its vclock with other
replicas. Other storage services -
rebalancerandrecoverycan'tstart until
on_master_enablesetM.buckets_are_in_sync.Closes #214
NO_TEST=bugfix
NO_DOC=bugfix