New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coredump during nodetool refresh
#14475
Comments
couldn't add to the issue description, so here goes the coredump on node-11:
|
Something that has to do with metrics and allocation... ? |
@xemul can you please check this |
From node-9 logs:
|
|
|
|
The |
As if it's "correct" value of |
The
|
It can be a allocate-after-allocate-after-use-after-free
|
node-11 also crashes on shard-2 |
@fgelcer , mostly out-of-curiosity -- is it possible to re-run the same test over debug build of scylla? |
node-11 crashes in the same place, but the corruption pattern differs a bit:
|
sure it is... giving to the job any AMI you want will make the trick... if you want to start an AMI with specific code too, it is also possible, we will just need to build an image using one of the Releng jobs... @yaronkaikov may be able to assist in here |
Candidate for |
could be. see https://github.com/gcc-mirror/gcc/blob/104b09005229ef48a79a33511ea192bb3ec3c415/libstdc%2B%2B-v3/libsupc%2B%2B/hash_bytes.cc . but it massages the bit in a little bit different way. |
Root cause was already found at #14618 |
This reverts commit 2a58b4a, reversing changes made to dd63169. After patch 87c8d63, table_resharding_compaction_task_impl::run() performs the forbidden action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard, which is a data race that can cause a use-after-free, typically manifesting as allocator corruption. Note: before the bad patch, this was avoided by copying the _contents_ of the lw_shared_ptr into a new, local lw_shared_ptr. Fixes scylladb#14475 Fixes scylladb#14618
table_resharding_compaction_task_impl::run() performs the forbidden action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard, which is a data race that can cause a use-after-free, typically manifesting as allocator corruption. Content of _owned_ranges_ptr is copied to local lw_shared_ptrs. Fixes scylladb#14475 Fixes scylladb#14618
Fixed by a revert, not backporting. |
Issue description
during a nemesis called
NodetoolRefresh
we had a coredump, on 2 nodes:on node-9:
Impact
Scylla cored dump
How frequently does it reproduce?
first time seeing it, and not sure how reproducible it is for now
Installation details
Kernel Version: 5.15.0-1037-gcp
Scylla version (or git commit hash):
5.4.0~dev-20230702.1ab2bb69b8a6
with build-id748a09783d18c97d903f62dcf4d96a6d9db4b527
Cluster size: 6 nodes (n1-highmem-16)
Scylla Nodes used in this run:
OS / Image: `` (gce: undefined_region)
Test:
longevity-10gb-3h-gce-test
Test id:
b455fe7f-1f0b-4034-9075-f6e5de0e7d86
Test name:
scylla-master/longevity/longevity-10gb-3h-gce-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor b455fe7f-1f0b-4034-9075-f6e5de0e7d86
$ hydra investigate show-logs b455fe7f-1f0b-4034-9075-f6e5de0e7d86
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: