Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

Closed

Conversation

tgrabiec
Copy link
Contributor

Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error:

missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exceptionraft::state_machine_error (State machine error at raft/server.cc:1206): std::_Nested_exceptionstd::runtime_error (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update.

Fixes #15061

@github-actions github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 16, 2023
@scylladb-promoter
Copy link
Contributor

@tgrabiec tgrabiec changed the title storage_service, tablets: Fix corrupting tablet metadata on migration concurrent with table drop storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop Aug 16, 2023
@github-actions github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 16, 2023
@scylladb-promoter
Copy link
Contributor

// We rely on the fact that should_preempt_balancing() does not release the guard
// so that tablet metadata reading and updates are atomic.
on_internal_error(slogger, "should_preempt_balancing() retook the guard");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we not give the move guard to should_preempt_balancing then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be better. That would involve more refactoring because get_node_to_work_on_opt() want to own the guard. I hoped to avoid this refactoring now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can schedule the refactoring for later (but add a fixme)

@@ -1455,6 +1481,9 @@ class topology_coordinator {
co_return;
}

if (!guard) {
guard = co_await start_operation();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if (!guard) guard = looks like a recipe for things working accidentally, no?

We should have clearer transaction boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we always start a new guard here, things may work accidentally in the absence of concurrency, so I don't see that as a reason to avoid the optimization.

This if (!guard) line marks the start of transaction. Reuse of the guard is an optimization in case it wasn't consumed by the execution of updates to avoid needless start_operation(), which is relatively expensive as it involves a raft read barrier. In this case, taking the guard is just to humor the exec_global_command() API, there is no actual transaction going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment, or encapsulate in a commented start_or_adopt_transaction()

… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061
@github-actions github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 18, 2023
@tgrabiec
Copy link
Contributor Author

In V2:

  • Added FIXME above should_preempt_balancing():
    // FIXME: Don't take the ownership of the guard to make the above guarantee explicit.
    
  • Documented guard re-taking before the barrier:
            // If has_updates is true then we have dropped the guard and need to re-obtain it.
            // It's fine to start an independent operation here. The barrier doesn't have to be executed
            // atomically with the read which set needs_barrier, because it's fine if the global barrier
            // works with a more recent set of nodes and it's fine if it propagates a more recent topology.
    
  • Dropped unnecessary guard re-taking before the final update which ends all tablet migration

@scylladb-promoter
Copy link
Contributor

raphaelsc pushed a commit to raphaelsc/scylla that referenced this pull request Aug 22, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061

Closes scylladb#15069
raphaelsc pushed a commit to raphaelsc/scylla that referenced this pull request Aug 29, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061

Closes scylladb#15069
@@ -53,6 +53,8 @@ public:
// Until the upgrade procedure finishes, we will perform operations such as schema changes using the old way,
// but still pass the guard around to synchronize operations with the upgrade procedure.
bool with_raft() const;

explicit operator bool() const { return bool(_impl); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Making guard an optional object makes reasoning about it much harder.

Now whenever reviewing code that takes a guard I need to think --- but what if it's disengaged? Can it be disengaged?

Why not use std::optional<group0_guard> instead? This basically imposes the complexity of reasoning with optionals on every code path that uses group0_guard, even when it's non-optional!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a big mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _impl was done for pimplification and only for that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, having group0_guard object in hand should give a type-level guarantee that it's safe to do linearizable group 0 operation and concurrent modifications will be detected.

Using a guard after a move... is a use-after-move, it's a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the guard is still non-optional, the operator bool() checks if it was moved-from. Using a moved-from guard is still incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But calling public interface on a moved-from object is basically using that moved-from object

Maybe we could find better tools to deal with such cases, something that would protect us on type level...

// In order to keep the cluster saturated, ask the load balancer for more transitions.
// Unless there is a pending topology change operation.
auto ts = guard.write_timestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With guard basically being optional object now, even innocuous write_timestamp() call may be unsafe :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_table_drop_with_auto_snapshot fails with "missing column: tablet_count"
4 participants