Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-shard-barrier: Capture shared barrier in complete #11553

Closed

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Sep 15, 2022

When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this:

sharded<service> s;
co_await s.invoke_on_all([] {
    ...
    barrier.abort();
});
...
co_await s.stop();

If abort happens, the invoke_on_all() will only resolve after it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier.

fixes: #11303

Signed-off-by: Pavel Emelyanov xemul@scylladb.com

When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:

    sharded<service> s;
    co_await s.invoke_on_all([] {
        ...
        barrier.abort();
    });
    ...
    co_await s.stop();

If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.

fixes: scylladb#11303

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@scylladb-promoter
Copy link
Contributor

Copy link
Member

@bhalevy bhalevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

xemul added a commit that referenced this pull request Oct 3, 2022
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:

    sharded<service> s;
    co_await s.invoke_on_all([] {
        ...
        barrier.abort();
    });
    ...
    co_await s.stop();

If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.

fixes: #11303

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11553
@xemul xemul deleted the br-cross-shard-barrier-debug-reshuffle-fix branch April 5, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants