Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_table_drop_with_auto_snapshot fails with "missing column: tablet_count" #15061

Closed
gleb-cloudius opened this issue Aug 16, 2023 · 1 comment
Assignees
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@gleb-cloudius
Copy link
Contributor

Installation details
Scylla version (or git commit hash): d57a951

Seen in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3079/

ERROR 2023-08-15 17:57:49,724 [shard 0] raft - [32b70bf9-78d4-4290-9286-4994438f277e] applier fiber stopped because of the error: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)
INFO  2023-08-15 17:57:49,735 [shard 0] storage_service - raft_state_monitor_fiber aborted with raft::stopped_error (Raft instance is stopped)
ERROR 2023-08-15 17:57:49,738 [shard 0] storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)")
@gleb-cloudius gleb-cloudius added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Aug 16, 2023
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Aug 16, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Aug 16, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Aug 18, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061
raphaelsc pushed a commit to raphaelsc/scylla that referenced this issue Aug 22, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061

Closes scylladb#15069
raphaelsc pushed a commit to raphaelsc/scylla that referenced this issue Aug 29, 2023
… concurrent with table drop

Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes scylladb#15061

Closes scylladb#15069
@DoronArazii DoronArazii added this to the 5.4 milestone Aug 29, 2023
@denesb
Copy link
Contributor

denesb commented Dec 18, 2023

No branches affected, no backport needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
5 participants