-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069
Conversation
CI state |
d0ce86a
to
359bdfd
Compare
CI state |
// We rely on the fact that should_preempt_balancing() does not release the guard | ||
// so that tablet metadata reading and updates are atomic. | ||
on_internal_error(slogger, "should_preempt_balancing() retook the guard"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we not give the move guard to should_preempt_balancing then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be better. That would involve more refactoring because get_node_to_work_on_opt() want to own the guard. I hoped to avoid this refactoring now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can schedule the refactoring for later (but add a fixme)
service/storage_service.cc
Outdated
@@ -1455,6 +1481,9 @@ class topology_coordinator { | |||
co_return; | |||
} | |||
|
|||
if (!guard) { | |||
guard = co_await start_operation(); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if (!guard) guard = looks like a recipe for things working accidentally, no?
We should have clearer transaction boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we always start a new guard here, things may work accidentally in the absence of concurrency, so I don't see that as a reason to avoid the optimization.
This if (!guard)
line marks the start of transaction. Reuse of the guard is an optimization in case it wasn't consumed by the execution of updates to avoid needless start_operation(), which is relatively expensive as it involves a raft read barrier. In this case, taking the guard is just to humor the exec_global_command()
API, there is no actual transaction going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment, or encapsulate in a commented start_or_adopt_transaction()
… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061
359bdfd
to
6c2da62
Compare
In V2:
|
CI state |
… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061 Closes scylladb#15069
… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061 Closes scylladb#15069
@@ -53,6 +53,8 @@ public: | |||
// Until the upgrade procedure finishes, we will perform operations such as schema changes using the old way, | |||
// but still pass the guard around to synchronize operations with the upgrade procedure. | |||
bool with_raft() const; | |||
|
|||
explicit operator bool() const { return bool(_impl); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Making guard an optional object makes reasoning about it much harder.
Now whenever reviewing code that takes a guard I need to think --- but what if it's disengaged? Can it be disengaged?
Why not use std::optional<group0_guard>
instead? This basically imposes the complexity of reasoning with optional
s on every code path that uses group0_guard
, even when it's non-optional!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a big mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _impl
was done for pimplification and only for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, having group0_guard
object in hand should give a type-level guarantee that it's safe to do linearizable group 0 operation and concurrent modifications will be detected.
Using a guard after a move... is a use-after-move, it's a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the guard is still non-optional, the operator bool() checks if it was moved-from. Using a moved-from guard is still incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But calling public interface on a moved-from object is basically using that moved-from object
Maybe we could find better tools to deal with such cases, something that would protect us on type level...
// In order to keep the cluster saturated, ask the load balancer for more transitions. | ||
// Unless there is a pending topology change operation. | ||
auto ts = guard.write_timestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With guard
basically being optional object now, even innocuous write_timestamp()
call may be unsafe :(
Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error:
missing_column (missing column: tablet_count)
Like in this log line:
storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exceptionraft::state_machine_error (State machine error at raft/server.cc:1206): std::_Nested_exceptionstd::runtime_error (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")
The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update.
Fixes #15061