cross-shard-barrier: Capture shared barrier in complete #11553

xemul · 2022-09-15T10:35:49Z

When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this:

sharded<service> s;
co_await s.invoke_on_all([] {
    ...
    barrier.abort();
});
...
co_await s.stop();

If abort happens, the invoke_on_all() will only resolve after it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier.

fixes: #11303

Signed-off-by: Pavel Emelyanov xemul@scylladb.com

When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception. This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this: sharded<service> s; co_await s.invoke_on_all([] { ... barrier.abort(); }); ... co_await s.stop(); If abort happens, the invoke_on_all() will only resolve _after_ it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones. However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier. fixes: scylladb#11303 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

scylladb-promoter · 2022-09-15T14:59:49Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2254/

bhalevy

lgtm

When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception. This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this: sharded<service> s; co_await s.invoke_on_all([] { ... barrier.abort(); }); ... co_await s.stop(); If abort happens, the invoke_on_all() will only resolve _after_ it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones. However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier. fixes: #11303 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11553

denesb approved these changes Sep 16, 2022

View reviewed changes

bhalevy approved these changes Sep 16, 2022

View reviewed changes

scylladb-promoter closed this in fe48b66 Sep 16, 2022

xemul deleted the br-cross-shard-barrier-debug-reshuffle-fix branch April 5, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cross-shard-barrier: Capture shared barrier in complete #11553

cross-shard-barrier: Capture shared barrier in complete #11553

xemul commented Sep 15, 2022

scylladb-promoter commented Sep 15, 2022

bhalevy left a comment

cross-shard-barrier: Capture shared barrier in complete #11553

cross-shard-barrier: Capture shared barrier in complete #11553

Conversation

xemul commented Sep 15, 2022

scylladb-promoter commented Sep 15, 2022

bhalevy left a comment

Choose a reason for hiding this comment