Skip to content

Commit

Permalink
storage_service, tablets: Fix corrupting tablet metadata on migration…
Browse files Browse the repository at this point in the history
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes #15061

Closes #15069
  • Loading branch information
tgrabiec authored and avikivity committed Aug 20, 2023
1 parent a4e7f9b commit 1552044
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 11 deletions.
2 changes: 2 additions & 0 deletions service/raft/raft_group0_client.hh
Expand Up @@ -53,6 +53,8 @@ public:
// Until the upgrade procedure finishes, we will perform operations such as schema changes using the old way,
// but still pass the guard around to synchronize operations with the upgrade procedure.
bool with_raft() const;

explicit operator bool() const { return bool(_impl); }
};

void release_guard(group0_guard guard);
Expand Down
55 changes: 44 additions & 11 deletions service/storage_service.cc
Expand Up @@ -1275,6 +1275,7 @@ class topology_coordinator {
// Next migration of the same tablet is guaranteed to use a different instance.
struct tablet_migration_state {
background_action_holder streaming;
std::unordered_map<locator::tablet_transition_stage, background_action_holder> barriers;
};

std::unordered_map<locator::global_tablet_id, tablet_migration_state> _tablets;
Expand Down Expand Up @@ -1358,6 +1359,13 @@ class topology_coordinator {
bool needs_barrier = false;
bool has_transitions = false;

shared_promise barrier;
auto fail_barrier = seastar::defer([&] {
if (needs_barrier) {
barrier.set_exception(seastar::broken_promise());
}
});

_tablets_ready = false;
co_await for_each_tablet_transition([&] (const locator::tablet_map& tmap,
schema_ptr s,
Expand All @@ -1377,8 +1385,12 @@ class topology_coordinator {
};

auto transition_to_with_barrier = [&] (locator::tablet_transition_stage stage) {
needs_barrier = true;
transition_to(stage);
if (advance_in_background(gid, tablet_state.barriers[stage], "barrier", [&] {
needs_barrier = true;
return barrier.get_shared_future();
})) {
transition_to(stage);
}
};

switch (trinfo.stage) {
Expand Down Expand Up @@ -1419,29 +1431,47 @@ class topology_coordinator {
}
});

if (needs_barrier) {
guard = co_await global_tablet_token_metadata_barrier(std::move(guard));
}

// In order to keep the cluster saturated, ask the load balancer for more transitions.
// Unless there is a pending topology change operation.
auto ts = guard.write_timestamp();
auto [preempt, new_guard] = should_preempt_balancing(std::move(guard));
guard = std::move(new_guard);
if (ts != guard.write_timestamp()) {
// We rely on the fact that should_preempt_balancing() does not release the guard
// so that tablet metadata reading and updates are atomic.
on_internal_error(slogger, "should_preempt_balancing() retook the guard");
}
if (!preempt) {
auto plan = co_await _tablet_allocator.balance_tablets(get_token_metadata_ptr());
co_await generate_migration_updates(updates, guard, plan);
}

// It's ok to execute planned updates after retaking the guard because as long
// as topology is in tablet_migration state only this coordinator has a right
// to advance the state machine of tablets.

if (!updates.empty()) {
// The updates have to be executed under the same guard which was used to read tablet metadata
// to ensure that we don't reinsert tablet rows which were concurrently deleted by schema change
// which happens outside the topology coordinator.
bool has_updates = !updates.empty();
if (has_updates) {
updates.emplace_back(
topology_mutation_builder(guard.write_timestamp())
.set_version(_topo_sm._topology.version + 1)
.build());
co_await update_topology_state(std::move(guard), std::move(updates), format("Tablet migration"));
}

if (needs_barrier) {
// If has_updates is true then we have dropped the guard and need to re-obtain it.
// It's fine to start an independent operation here. The barrier doesn't have to be executed
// atomically with the read which set needs_barrier, because it's fine if the global barrier
// works with a more recent set of nodes and it's fine if it propagates a more recent topology.
if (!guard) {
guard = co_await start_operation();
}
guard = co_await global_tablet_token_metadata_barrier(std::move(guard));
barrier.set_value();
fail_barrier.cancel();
}

if (has_updates) {
co_return;
}

Expand All @@ -1463,6 +1493,9 @@ class topology_coordinator {
co_await update_topology_state(std::move(guard), std::move(updates), "Finished tablet migration");
}

// This function must not release and reacquire the guard, callers rely
// on the fact that the block which calls this is atomic.
// FIXME: Don't take the ownership of the guard to make the above guarantee explicit.
std::pair<bool, group0_guard> should_preempt_balancing(group0_guard guard) {
auto node_or_guard = get_node_to_work_on_opt(std::move(guard));
if (auto* node = std::get_if<node_to_work_on>(&node_or_guard)) {
Expand Down
3 changes: 3 additions & 0 deletions test/topology_experimental_raft/test_tablets.py
Expand Up @@ -127,6 +127,9 @@ async def test_table_drop_with_auto_snapshot(manager: ManagerClient):

cql = manager.get_cql()

# Increases the chance of tablet migration concurrent with schema change
await inject_error_on(manager, "tablet_allocator_shuffle", servers)

for i in range(3):
await cql.run_async("DROP KEYSPACE IF EXISTS test;")
await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1, 'initial_tablets': 8 };")
Expand Down

0 comments on commit 1552044

Please sign in to comment.