Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsynchronized cross-shard memory operations caused by incorrectly used updateable_value #7310

Closed
piodul opened this issue Sep 30, 2020 · 5 comments
Assignees

Comments

@piodul
Copy link
Contributor

piodul commented Sep 30, 2020

The bug was introduced in 4b856cf.

In transport/controller.cc, there is a function named controller::do_start_server() which creates a cql_server service. That function runs on shard 0, where it creates a cql_server_config object. This object is then used as an argument to start the cql_server service:

cserver->start(std::ref(cql3::get_query_processor()), std::ref(_auth_service), std::ref(_mnotifier), cql_server_config).get();

Calling distributed<cql_server>::start means that the configuration is copied to all shards and used as an argument for cql_server's constructor.

The commit mentioned at the beginning converted one of cql_server_config's parameters into an updateable_value. The updateable_value is an object which tracks the value of updateable_value_source (which can be a config parameter, for example) and is updated with source's value when it changes.

The problem is that the updateable_values sent to other shards are still associated with the source on shard 0. Because they are tracked through a std::vector of pointers which is a member of the source, it can cause unsynchronized cross-shard reads and writes which is unsafe and may lead to strange bugs.

The implementation of updateable_value is definitely not safe to have cross-shard references like that - unlike what the comment above updateable_value's definition would suggest:

// A T that can be updated at runtime; uses updateable_value_base to track
// the source as the object is moved or copied. Copying across shards is supported.
@piodul piodul self-assigned this Sep 30, 2020
piodul added a commit to piodul/scylla that referenced this issue Sep 30, 2020
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.

The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.

This fix can be reverted if updateable_value becomes safe to use across
shards.

Tests: unit(dev)

Fixes: scylladb#7310
@piodul
Copy link
Contributor Author

piodul commented Sep 30, 2020

@avikivity About the comment above updateable_value... It certainly is misleading. Should we change the comment so that it doesn't suggest that it's safe to copy across shards, or rather change the implementation so that the comment is true?

@psarna
Copy link
Contributor

psarna commented Oct 1, 2020

There's also another series of mine, already queued in next, which adds the same mistake to alternator. I'll temporarily dequeue it.

@avikivity
Copy link
Member

@avikivity About the comment above updateable_value... It certainly is misleading. Should we change the comment so that it doesn't suggest that it's safe to copy across shards, or rather change the implementation so that the comment is true?

updatable_value was explicitly written to support this use case (I know the author personally). It should be fixed, so it is easy and cheap to pass live configuration values across shards.

piodul added a commit to piodul/scylla that referenced this issue Oct 1, 2020
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.

The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.

This fix can be reverted if updateable_value becomes safe to use across
shards.

Tests: unit(dev)

Fixes: scylladb#7310
@piodul
Copy link
Contributor Author

piodul commented Oct 1, 2020

updatable_value was explicitly written to support this use case (I know the author personally). It should be fixed, so it is easy and cheap to pass live configuration values across shards.

@avikivity I made an issue: #7316

@avikivity
Copy link
Member

Not in any release branch, so no backports needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants