test_table_drop_with_auto_snapshot fails with "missing column: tablet_count" #15061

gleb-cloudius · 2023-08-16T06:28:35Z

Installation details
Scylla version (or git commit hash): d57a951

Seen in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3079/

ERROR 2023-08-15 17:57:49,724 [shard 0] raft - [32b70bf9-78d4-4290-9286-4994438f277e] applier fiber stopped because of the error: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)
INFO  2023-08-15 17:57:49,735 [shard 0] storage_service - raft_state_monitor_fiber aborted with raft::stopped_error (Raft instance is stopped)
ERROR 2023-08-15 17:57:49,738 [shard 0] storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)")

The text was updated successfully, but these errors were encountered:

… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061

… concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes scylladb#15061 Closes scylladb#15069

denesb · 2023-12-18T11:02:22Z

No branches affected, no backport needed.

gleb-cloudius assigned tgrabiec Aug 16, 2023

gleb-cloudius added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Aug 16, 2023

tgrabiec mentioned this issue Aug 16, 2023

storage_service, tablets: Fix corruption of tablet metadata on migration concurrent with table drop #15069

Closed

scylladb-promoter closed this as completed in 1552044 Aug 22, 2023

scylladb-promoter added the Backport candidate label Aug 22, 2023

DoronArazii added this to the 5.4 milestone Aug 29, 2023

denesb removed the Backport candidate label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_table_drop_with_auto_snapshot fails with "missing column: tablet_count" #15061

test_table_drop_with_auto_snapshot fails with "missing column: tablet_count" #15061

gleb-cloudius commented Aug 16, 2023

denesb commented Dec 18, 2023

test_table_drop_with_auto_snapshot fails with "missing column: tablet_count" #15061

test_table_drop_with_auto_snapshot fails with "missing column: tablet_count" #15061

Comments

gleb-cloudius commented Aug 16, 2023

denesb commented Dec 18, 2023