storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

tgrabiec · 2023-08-16T16:29:02Z

Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error:

missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exceptionraft::state_machine_error (State machine error at raft/server.cc:1206): std::_Nested_exceptionstd::runtime_error (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update.

Fixes #15061

scylladb-promoter · 2023-08-16T19:18:28Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3110/

scylladb-promoter · 2023-08-17T01:06:31Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3113/

avikivity · 2023-08-17T17:34:51Z

service/storage_service.cc

+            // We rely on the fact that should_preempt_balancing() does not release the guard
+            // so that tablet metadata reading and updates are atomic.
+            on_internal_error(slogger, "should_preempt_balancing() retook the guard");
+        }


Shouldn't we not give the move guard to should_preempt_balancing then?

Yes, I think that would be better. That would involve more refactoring because get_node_to_work_on_opt() want to own the guard. I hoped to avoid this refactoring now.

We can schedule the refactoring for later (but add a fixme)

avikivity · 2023-08-17T17:39:17Z

service/storage_service.cc

@@ -1455,6 +1481,9 @@ class topology_coordinator {
            co_return;
        }

+        if (!guard) {
+            guard = co_await start_operation();
+        }


This if (!guard) guard = looks like a recipe for things working accidentally, no?

We should have clearer transaction boundaries.

Even if we always start a new guard here, things may work accidentally in the absence of concurrency, so I don't see that as a reason to avoid the optimization.

This if (!guard) line marks the start of transaction. Reuse of the guard is an optimization in case it wasn't consumed by the execution of updates to avoid needless start_operation(), which is relatively expensive as it involves a raft read barrier. In this case, taking the guard is just to humor the exec_global_command() API, there is no actual transaction going on.

Please add a comment, or encapsulate in a commented start_or_adopt_transaction()

… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061

tgrabiec · 2023-08-18T15:16:35Z

In V2:

Added FIXME above should_preempt_balancing():

// FIXME: Don't take the ownership of the guard to make the above guarantee explicit.

Documented guard re-taking before the barrier:

        // If has_updates is true then we have dropped the guard and need to re-obtain it.
        // It's fine to start an independent operation here. The barrier doesn't have to be executed
        // atomically with the read which set needs_barrier, because it's fine if the global barrier
        // works with a more recent set of nodes and it's fine if it propagates a more recent topology.

Dropped unnecessary guard re-taking before the final update which ends all tablet migration

scylladb-promoter · 2023-08-18T17:42:36Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3141/

… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061 Closes scylladb#15069

kbr-scylla · 2024-02-16T15:57:05Z

service/raft/raft_group0_client.hh

@@ -53,6 +53,8 @@ public:
    // Until the upgrade procedure finishes, we will perform operations such as schema changes using the old way,
    // but still pass the guard around to synchronize operations with the upgrade procedure.
    bool with_raft() const;
+
+    explicit operator bool() const { return bool(_impl); }


Ugh. Making guard an optional object makes reasoning about it much harder.

Now whenever reviewing code that takes a guard I need to think --- but what if it's disengaged? Can it be disengaged?

Why not use std::optional<group0_guard> instead? This basically imposes the complexity of reasoning with optionals on every code path that uses group0_guard, even when it's non-optional!

I think this is a big mistake.

The _impl was done for pimplification and only for that

Ideally, having group0_guard object in hand should give a type-level guarantee that it's safe to do linearizable group 0 operation and concurrent modifications will be detected.

Using a guard after a move... is a use-after-move, it's a bug.

Well, the guard is still non-optional, the operator bool() checks if it was moved-from. Using a moved-from guard is still incorrect.

But calling public interface on a moved-from object is basically using that moved-from object

Maybe we could find better tools to deal with such cases, something that would protect us on type level...

kbr-scylla · 2024-02-16T16:01:12Z

service/storage_service.cc

        // In order to keep the cluster saturated, ask the load balancer for more transitions.
        // Unless there is a pending topology change operation.
+        auto ts = guard.write_timestamp();


With guard basically being optional object now, even innocuous write_timestamp() call may be unsafe :(

tgrabiec requested a review from avikivity August 16, 2023 16:29

github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 16, 2023

tgrabiec changed the title ~~storage_service, tablets: Fix corrupting tablet metadata on migration concurrent with table drop~~ storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop Aug 16, 2023

tgrabiec force-pushed the tablets-fix-schema-chg-race branch from d0ce86a to 359bdfd Compare August 16, 2023 22:27

github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 16, 2023

avikivity reviewed Aug 17, 2023

View reviewed changes

tgrabiec force-pushed the tablets-fix-schema-chg-race branch from 359bdfd to 6c2da62 Compare August 18, 2023 15:03

github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 18, 2023

scylladb-promoter closed this in 1552044 Aug 22, 2023

kbr-scylla reviewed Feb 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

tgrabiec commented Aug 16, 2023

scylladb-promoter commented Aug 16, 2023

scylladb-promoter commented Aug 17, 2023

avikivity Aug 17, 2023

tgrabiec Aug 17, 2023

avikivity Aug 18, 2023

avikivity Aug 17, 2023

tgrabiec Aug 18, 2023

avikivity Aug 18, 2023

tgrabiec commented Aug 18, 2023

scylladb-promoter commented Aug 18, 2023

kbr-scylla Feb 16, 2024

kbr-scylla Feb 16, 2024

kbr-scylla Feb 16, 2024

kbr-scylla Feb 16, 2024

tgrabiec Feb 16, 2024

kbr-scylla Feb 16, 2024

kbr-scylla Feb 16, 2024

storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

Conversation

tgrabiec commented Aug 16, 2023

scylladb-promoter commented Aug 16, 2023

scylladb-promoter commented Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgrabiec commented Aug 18, 2023

scylladb-promoter commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment