-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology change failure with concurrent index drop when repairing the dropped index #18452
Comments
@kbr-scylla - please take a look - I don't think it's raft/topology though? (please reassign of course if it isn't) |
@aleksbykov @gleb-cloudius isn't this the issue that you discussed on #raft Slack channel? I think you resolved it and declared it's a non-issue? |
Not sure. Here it looks like bootstrapping failed because streaming failed. And streaming failed because it was aborted. |
@kbr-scylla , it is another issue. Here decommission was expected to be aborted, but bootstrap of new node expected to be successfull, while it failed |
So what happened is that decommission succeeded but bootstrap was aborted instead? Why do you think it is not a test bug? |
@gleb-cloudius i checked the test log, and no actions performed that could abort the bootstrap. it happened while waiting the node to be bootstrapped. |
This log message:
Says that the failure reason is |
i rechecked the logs. i didn't find any actions in tests which could abort bootstrap. I can craete reproducer job with 2 nemesis and try to reproduce issue |
Yes please |
@aleksbykov any luck with the above? |
Similar failure: #18436
Also with concurrent index drop! Perhaps the index drop triggers abort somewhere somehow? |
Hypothesis: repair gets stuck on Raft read barrier ( Why the barrier times out? Perhaps the schema fails to sync, and somehow the index drop caused the schema mismatch EDIT: nope, that's not it, see below |
This may be a reincarnation of #15598 except we're getting abort_requested_exception instead of no_such_column_family for some reason |
Possible source of the future<table_dropped> table_sync_and_check(replica::database& db, service::migration_manager& mm, const table_id& uuid) {
if (mm.use_raft()) {
abort_on_expiry aoe(lowres_clock::now() + std::chrono::seconds{10});
auto& as = aoe.abort_source();
auto sub = mm.get_abort_source().subscribe([&as] () noexcept {
if (!as.abort_requested()) {
as.request_abort();
}
});
// Trigger read barrier to synchronize schema.
co_await mm.get_group0_barrier().trigger(as);
}
co_return !db.column_family_exists(uuid);
} |
Removed 6.0 milestone, it's not critical and most likely a reincarnation of existing issue which existed for more than a year already (#15598 (comment)) -- apparently the fix simply changed one exception type to another. In the meantime, I think we can learn something from this -- we could probably avoid some wasted time if we tested our bugfixes more thoroughly. Given that the issue was detected by SCT longevity tests, the fix should be verified by the same tests. Unless we have a good consistent test.py test which can reproduce the conditions obtained during longevity. I checked the PR (#17231), and it only has a test.py test running with RBNO disabled (while the previous reports I saw and the new ones used RBNO -- which is the default). And the reproducer is very artificial too, it doesn't really drop a table during decommission/repair, it injects a specific exception that might fly during table drop. Now of course it won't catch this new issue, because while the issue still happens due to table drop during repair, the exception flying is different... So suppose we have a longevity test which reproduces an issue in 60% of its runs. |
Another report but during removenode: #18436 |
Instead of building a list of tables up-front, all streaming code should use a stable iterator over the table names. We need an ordered map in the schema cache to implement that - then we can use std::upper_bound with hte previous table name to get to the next table. The table that is being streamed/repaired should be referenced/pinned while streaming is in progress. Then we won't have to deal with these exceptions, whatever they are. |
It doesn't work. We do want to allow table drop during repair/streaming, and it should abort the repair, without failing it. The people dropping tables and the people repairing are not necessarily the same (think of the cloud) so we cannot use mutual exclusion. |
What exactly doesn't work? Who "we" do you mean when writing "we do want"? Do you want to break an ongoing repair by dropping the table because it may get stuck due to a data corruption or a bug? I find it a rather unorthodox way to solve the problem if that's the case, maybe 1) fix the code so that we have no corruption 2) introduce a separate admin command to stop an ongoing repair. Anyway, since you provided no explanations for your reasons I am only left to guess. |
AFAIK the product team.
There is no corruption. The worst that happens today is that repair fails because it steps on
This command already exists for ages. But is of no use. Your |
Hm, I think there could be a misunderstanding here In @kostja proposal
There is no mention of mutual exclusion, @denesb. Pinning a table doesn't necessarily mean we would block |
This is what we used to do (kind-of), then we stopped doing it because doing work on behalf of a dropped table is wasted work. Also, "pinning" a table in such a way is potentially another can of worms (but maybe not), and it doesn't resolve races. |
Indeed, @denesb , the original direction you spent so much time on is a dead end. REPAIR needs to coordinate with other DDL, and it's not just about DROP, it's also about CREATE - if CREATE starts and finishes in the middle of repair we also need to do something about it. Now that all topology and schema operations go through group 0 we can provide such coordination. But regardless of the coordination on the raft leader, there should be coordination on the nodes, which I write about. |
First reproduced passed and issue was not reproduced with repeated parallel operations CreateDropSI and decommission. It is possible that not all stage of topology operation of decommisisoning node was randomly choosen. Create new reproduced, which go through each topology stage( decommission: become non-voter, start streamin, leaving token ring, bootstarping new node). |
There is no such coordination currently. For tablets, we plan to implement this, but not for vnodes. So as long as we support vnodes, we need to tolerate tables being yanked from under the feet of repair. |
@aleksbykov - did you look at the results of the job above? |
I looked at them. The issue was to reproduced. may the problem was that there periodcally happened segmentation fault on several nodes during decommission
|
This is a known issue, already fixed by #18816. |
last time seen in SCT tier1 2024-06-08 |
Packages
Scylla version:
5.5.0~dev-20240408.c01b19fcb3fa
with build-id772b12edbaea142ea2b9374cacfc409da5d3a8a5
Kernel Version:
5.15.0-1056-aws
Issue description
Job run in parallel schema changes and topology operations with multidc cluster configuration. 2 latest nemesis was
CreateSecondaryIndex and DecommissionStreamingErr.
CreateSecondaryINdex create and then drop secondary index, DecommissionStreamingErr starts decommission and the abort it with node reboot.
While Secondary index was creating and then deleteing, Decommission started on node4 and have to be aborted, after log message
left token ring
. Because operations delay, Decommission was finished succesfullyAfter that new node17 started to bootstrap. But it faile with next errors:
node17
IP of node 17
And node12 was a coordinator report this error:
Impact
Node bootstrap failed
How frequently does it reproduce?
rarely
Installation details
Cluster size: 12 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0d3f6dae1fec302f5 ami-0de462d2b7fb7af71
(aws: undefined_region)Test:
longevity-multidc-schema-topology-changes-12h-with-raft-test
Test id:
7c79549b-8cea-4a31-beb1-7ca409e7b83f
Test name:
scylla-master/raft/longevity-multidc-schema-topology-changes-12h-with-raft-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 7c79549b-8cea-4a31-beb1-7ca409e7b83f
$ hydra investigate show-logs 7c79549b-8cea-4a31-beb1-7ca409e7b83f
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: