New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coredump during creating instance on azure on single node #10494
Comments
The crash is from azure snitch.
Probably the regression is introduced by the following commits. @xemul Can you please check?
|
I cannot find logs that show how system behaved right before the crash. All that I see from it is that snitch initialized successfully. @aleksbykov , please point me to correct logs. |
The core happens even before we start collect any significant log data. |
But scylla logs itself somewhere on start. |
Actually, the azure snitch code loos weird from the very beginning (e44fa8d) future<> azure_snitch::load_config() {
...
_my_rack = azure_zone;
_my_dc = azure_region;
co_return co_await _my_distributed->invoke_on_all([this] (snitch_ptr& local_s) {
if (this_shard_id() != io_cpu_id()) {
local_s->set_my_dc(_my_dc);
local_s->set_my_rack(_my_rack);
}
});
}
future<> azure_snitch::start() {
return load_config().then(...);
} It sets |
IOW
|
@xemul - are you sending a patch to fix this ? If we run a machine and give you access to it would that help ? |
@slivne , I can prepare a patch, yes. I don't think I need a machine, it's a race + use-after-free that would be hard to trigger intentionally from my pov. |
All snitch drivers are supposed to snitch info on some shard and replicate the dc/rack info across others. All, but azure really do so. The azure one gets dc/rack on all shards, which's excessive but not terrible, but when all shards start to replicate their data to all the others, this may lead to use-after-frees. fixes: scylladb#10494 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@xemul ping |
|
All snitch drivers are supposed to snitch info on some shard and replicate the dc/rack info across others. All, but azure really do so. The azure one gets dc/rack on all shards, which's excessive but not terrible, but when all shards start to replicate their data to all the others, this may lead to use-after-frees. fixes: #10494 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit c6d0bc8)
All snitch drivers are supposed to snitch info on some shard and replicate the dc/rack info across others. All, but azure really do so. The azure one gets dc/rack on all shards, which's excessive but not terrible, but when all shards start to replicate their data to all the others, this may lead to use-after-frees. fixes: #10494 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit c6d0bc8)
Backported to 4.6, 5.0. |
Strangely, no "Backport candidate" label to remove. |
Installation details
Kernel Version: 5.13.0-1022-azure
Scylla version (or git commit hash):
5.1.dev-20220504.b26a3da584cc
with build-idab2a33a30756c1513f4c516cd272291e75acec0e
Cluster size: 6 nodes (Standard_L8s_v2)
Scylla Nodes used in this run:
OS / Image:
/subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/ScyllaDB-5.1.dev-0.20220504.b26a3da584cc-1-build-148
(azure: eastus)Test:
longevity-10gb-3h-azure-test
Test id:
835fbc85-2bdf-46aa-a87d-04348bbbc1f8
Test name:
scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):
Issue description
Coredump happened creating node2 even before cluster configuration:
$ hydra investigate show-monitor 835fbc85-2bdf-46aa-a87d-04348bbbc1f8
$ hydra investigate show-logs 835fbc85-2bdf-46aa-a87d-04348bbbc1f8
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: