New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
heap-use-after-free hit in cross_shard_barrier_test #11303
Comments
@xemul I'm not sure if the bug is it the cross-shard barrier itself or in the test. |
It looks like the test worker doesn't "close" the cross_shard_barrier before it's destroyed. |
It's |
Yup, reproduced it with artificial delay in abort's completion fiber |
Normally it doesn't happen, because c.s.b.'s completion invoke-on-all is queued before worker's stop calls. But debug-mode re-orders the queue |
When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception. This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this: sharded<service> s; co_await s.invoke_on_all([] { ... barrier.abort(); }); ... co_await s.stop(); If abort happens, the invoke_on_all() will only resolve _after_ it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones. However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier. fixes: scylladb#11303 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Not sure whether to backport or not. @xemul please backport or not (and explain why). |
When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception. This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this: sharded<service> s; co_await s.invoke_on_all([] { ... barrier.abort(); }); ... co_await s.stop(); If abort happens, the invoke_on_all() will only resolve _after_ it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones. However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier. fixes: #11303 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11553
@xemul ^ |
Not needed: |
Removing label per above. |
Seen in https://jenkins.scylladb.com/job/scylla-master/job/build/1132/artifact/testlog/aarch64/debug/cross_shard_barrier_test.2404.log
Scylla version 055340a
Decoded:
Decoded:
Decoded:
The text was updated successfully, but these errors were encountered: