New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternator: concurrent TagResource from multiple nodes may race #6389
Comments
The "current compatibility with DynamoDB" section in alternator.md is where we should list very briefly our state of compatibility - it's not the right place to explain implementation details or track obscure bugs. I've significantly shortened the "Tags" section because, in brief, we do fully support tags and should say that we do. I moved the two bugs mentioned in the text into the bug tracker: Refs #6389 Refs #6391 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200507125022.22608-1-nyh@scylladb.com>
This problem will be solved automatically when we switch to using Raft for schema operations. |
At this point, most Alternator table operations like But this issue still exists because in our The last "writes them back" step is done in a function |
This patch adds an Alternator test reproducing issue scylladb#6389 - that concurrent TagResource and/or UntagResource operations was broken and some of the concurrent modifications were lost. The test has two threads, one loops adds and removes a tag A, the other adds and removes a tag B. After we add tag A, we expect tag A to be there - but due to issue scylladb#6389 this modification was sometimes lost when it raced with an operation on B. This test consistently failed before issue scylladb#6389 was fixed, and passes now after the issue was fixed by the previous patches. The bug reproduces by chance, so it requires a fairly long loop (a few seconds) to be sure it reproduces - so is marked a "veryslow" test and will not run in CI, but can be used to manually reproduce this issue with: test/alternator/run --runveryslow test_tag.py::test_concurrent_tag Refs scylladb#6389. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
… from Nadav Har'El Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed. The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected. Fixes #6389. Closes #13150 * github.com:scylladb/scylladb: test/alternator: test concurrent TagResource / UntagResource db/tags: drop unsafe update_tags() utility function alternator: isolate concurrent modification to tags db/tags: add safe modify_tags() utility functions migration_manager: expose access to storage_proxy
Alternator modifies tags in three operations - TagResource, UntagResource and UpdateTimeToLive (the latter uses a tag to store the TTL configuration). All three operations were implemented by three separate steps: 1. Read the current tags. 2. Modify the tags according to the desired operation. 3. Write the modified tags back with update_tags(). This implementation was not safe for concurrent operations - some modifications may be be lost. We fix this in this patch by using the new modify_tags() function introduced in the previous patch, which performs all three steps under one lock so the tag operations are serialized and correctly isolated. Fixes #6389 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit c196bd7)
Backported to 5.2, 5.4 already has it. |
Backport to 5.2. dequeued, breaks the build. @nyh please open a backport PR. |
Doing the backport to 5.2 now. |
… from Nadav Har'El Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed. The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected. Fixes scylladb#6389. Closes scylladb#13150 * github.com:scylladb/scylladb: test/alternator: test concurrent TagResource / UntagResource db/tags: drop unsafe update_tags() utility function alternator: isolate concurrent modification to tags db/tags: add safe modify_tags() utility functions migration_manager: expose access to storage_proxy (cherry picked from commit dba1d36)
Created backport PR for 5.2 in #16453 |
… from Nadav Har'El Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed. The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected. Fixes #6389. Closes #13150 * github.com:scylladb/scylladb: test/alternator: test concurrent TagResource / UntagResource db/tags: drop unsafe update_tags() utility function alternator: isolate concurrent modification to tags db/tags: add safe modify_tags() utility functions migration_manager: expose access to storage_proxy (cherry picked from commit dba1d36) Closes #16453
This was queued, so I'm removing the backport labels. |
This patch adds an Alternator test reproducing issue scylladb#6389 - that concurrent TagResource and/or UntagResource operations was broken and some of the concurrent modifications were lost. The test has two threads, one loops adds and removes a tag A, the other adds and removes a tag B. After we add tag A, we expect tag A to be there - but due to issue scylladb#6389 this modification was sometimes lost when it raced with an operation on B. This test consistently failed before issue scylladb#6389 was fixed, and passes now after the issue was fixed by the previous patches. The bug reproduces by chance, so it requires a fairly long loop (a few seconds) to be sure it reproduces - so is marked a "veryslow" test and will not run in CI, but can be used to manually reproduce this issue with: test/alternator/run --runveryslow test_tag.py::test_concurrent_tag Refs scylladb#6389. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Tags are stored in a schema table (
system_schema.tables.extensions['tags']
).This means means that concurrent adding of tags for a single table on more than a single node may result in a race, until Scylla schema agreement is reimplemented to avoid them, or we use some other mechanism to serialize the operation.
The text was updated successfully, but these errors were encountered: