Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair should handle abort_requested_exception mode gracefully #15710

Closed
bhalevy opened this issue Oct 15, 2023 · 9 comments
Closed

repair should handle abort_requested_exception mode gracefully #15710

bhalevy opened this issue Oct 15, 2023 · 9 comments
Assignees
Labels
P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Oct 15, 2023

As seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/258/testReport/bootstrap_test/TestBootstrap/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split004___test_cluster_become_unavailable_when_gracefully_kill_node_during_bootstrap/
The abort_requested_exception, infiltrates

scylladb/main.cc

Lines 1904 to 1909 in 055f061

} catch (const abort_requested_exception&) {
startlog.info("Startup interrupted");
// This happens when scylla gets SIGINT in the middle of join_cluster(), so
// just ignore it and exit normally
_exit(0);
return 0;

which is supposed to handle it gracefully, since it's wrapped in a std::runtime_error, apparently coming from repair (RBNO bootstrap)
See https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/258/artifact/logs-full.debug.004/1697351776674_bootstrap_test.py%3A%3ATestBootstrap%3A%3Atest_cluster_become_unavailable_when_gracefully_kill_node_during_bootstrap/node3.log

Scylla version 5.4.0~dev-0.20231013.055f0617064d with build-id 8c0ed99e7240eedb64eba4950ea5e3c03340b404 starting ...
...
WARN  2023-10-15 06:33:30,482 [shard 0:stre] repair - repair[40e36696-e80d-4d0c-9b77-68afd071b71a]: sync data for keyspace=keyspace, status=failed: std::runtime_error ({shard 0: seastar::abort_requested_exception (abort requested), shard 1: seastar::abort_requested_exception (abort requested)})
ERROR 2023-10-15 06:33:30,484 [shard 0:stre] node_ops - bootstrap[6cee15ba-90df-4e94-8d7e-ba81e649bfa1]: Operation failed, sync_nodes={127.0.13.1, 127.0.13.2, 127.0.13.3}: std::runtime_error ({shard 0: seastar::abort_requested_exception (abort requested), shard 1: seastar::abort_requested_exception (abort requested)})
...
ERROR 2023-10-15 06:33:39,271 [shard 0:main] init - Startup failed: std::runtime_error ({shard 0: seastar::abort_requested_exception (abort requested), shard 1: seastar::abort_requested_exception (abort requested)})
@bhalevy
Copy link
Member Author

bhalevy commented Oct 15, 2023

@denesb please assign

@bhalevy bhalevy added P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Oct 15, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Oct 16, 2023

Also, I'm not sure if it's the same issue, but the symptoms are related, so there's a good chance they could be fixed together:

see https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/258/artifact/logs-full.debug.048/1697354882267_alternator_stream_tests.py%3A%3ATestAlternatorStreams%3A%3Atest_updated_shards_during_add_decommission_node/node5.log

INFO  2023-10-15 07:04:57,867 [shard 0:main] init - Scylla version 5.4.0~dev-0.20231013.055f0617064d with build-id 8c0ed99e7240eedb64eba4950ea5e3c03340b404 starting ...
...
INFO  2023-10-15 07:27:01,199 [shard 0:main] compaction_manager - Asked to stop
...
WARN  2023-10-15 07:27:01,415 [shard 0:stre] repair - repair[08303fc2-86d1-4584-814b-5b0b62fae889]: sync data for keyspace=alternator_user_table, status=failed: std::runtime_error ({shard 0: std::runtime_error (repair[08303fc2-86d1-4584-814b-5b0b62fae889]: 0 out of 1540 ranges failed, keyspace=alternator_user_table, tables={user_table_scylla_cdc_log, user_table}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})
ERROR 2023-10-15 07:27:01,417 [shard 0:stre] node_ops - bootstrap[2bc530ff-0dc7-4873-8601-12207b9f5371]: Operation failed, sync_nodes={127.0.14.3, 127.0.14.2, 127.0.14.1, 127.0.14.5}: std::runtime_error ({shard 0: std::runtime_error (repair[08303fc2-86d1-4584-814b-5b0b62fae889]: 0 out of 1540 ranges failed, keyspace=alternator_user_table, tables={user_table_scylla_cdc_log, user_table}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})
...
ERROR 2023-10-15 07:27:21,013 [shard 0:main] init - Startup failed: std::runtime_error ({shard 0: std::runtime_error (repair[08303fc2-86d1-4584-814b-5b0b62fae889]: 0 out of 1540 ranges failed, keyspace=alternator_user_table, tables={user_table_scylla_cdc_log, user_table}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})

Note aborted_by_user=true above.
In this case repair should abort gracefully.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 16, 2023

Same in https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/258/artifact/logs-full.debug.047/1697353534672_lwt_random_test.py%3A%3ATestRandomPaxos%3A%3Atest_topology_grow/node4.log

ERROR 2023-10-15 07:05:25,059 [shard 0:main] init - Startup failed: std::runtime_error ({shard 0: std::runtime_error (repair[ee05ad8a-412f-430d-a94e-1a2e31502d31]: 0 out of 769 ranges failed, keyspace=ks, tables={test}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A), shard 1: std::runtime_error (repair[ee05ad8a-412f-430d-a94e-1a2e31502d31]: 1 out of 769 ranges failed, keyspace=ks, tables={test}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})

Deexie added a commit to Deexie/scylla that referenced this issue Oct 16, 2023
If abort is requsted during bootstrap then a node should exit normally.
To achieve so, abort_requested_exception should be thrown as main
handles it gracefully.

In data_sync_repair_task_impl::run exceptions from all shards are
wrapped together into std::runtime_exception and so they aren't
handled as they are supposed to.

Throw abort_requested_exception when shutdown was requested.
Throw abort_requested_exception also if repair::task_manager_module::is_aborted,
so that force_terminate_all_repair_sessions acts the same regardless
the state of the repair.

To maintain consistency do the same for user_requested_repair_task_impl.

Fixes: scylladb#15710.
Deexie added a commit to Deexie/scylla that referenced this issue Oct 16, 2023
If abort is requsted during bootstrap then a node should exit normally.
To achieve so, abort_requested_exception should be thrown as main
handles it gracefully.

In data_sync_repair_task_impl::run exceptions from all shards are
wrapped together into std::runtime_exception and so they aren't
handled as they are supposed to.

Throw abort_requested_exception when shutdown was requested.
Throw abort_requested_exception also if repair::task_manager_module::is_aborted,
so that force_terminate_all_repair_sessions acts the same regardless
the state of the repair.

To maintain consistency do the same for user_requested_repair_task_impl.

Fixes: scylladb#15710.
@mykaul mykaul added this to the 5.4 milestone Oct 19, 2023
@avikivity
Copy link
Member

Looks like a cosmetic issue, not backporting.

@bhalevy
Copy link
Member Author

bhalevy commented Nov 5, 2023

https://argus.scylladb.com/workspace?state=WyJkN2RkNDJmZS00ZTM3LTRiNWQtYmEwNS02ZDg2N2UzY2IyNzMiXQ
5.4.0~rc1-20231029.24efacf90d0b

2023-10-30 14:25:41.762 <2023-10-30 14:25:41.164>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=3b9c6e45-9e5c-4cce-b581-21b2a64f4dc3 during_nemesis=RemoveNodeThenAddNode: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=5944 node=longevity-10gb-3h-5-4-db-node-a6e3dbbb-7
2023-10-30T14:25:41.164+00:00 longevity-10gb-3h-5-4-db-node-a6e3dbbb-7      !ERR | scylla[5875]:  [shard 0:stre] node_ops - decommission[b4d763c9-698e-492e-8e2c-8e4c75609e5f]: Operation failed, sync_nodes={10.12.3.12, 10.12.1.189, 10.12.0.178, 10.12.3.217, 10.12.0.109, 10.12.0.226}: seastar::abort_requested_exception (abort requested)

@bhalevy
Copy link
Member Author

bhalevy commented Nov 6, 2023

Looks like a cosmetic issue, not backporting.

@avikivity, we have tests failing on this in 5.4 SCT
Since it's low risk I'd rather we backport it please.

@mykaul mykaul added Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Nov 9, 2023
@mykaul
Copy link
Contributor

mykaul commented Nov 9, 2023

Looks like a cosmetic issue, not backporting.

@avikivity, we have tests failing on this in 5.4 SCT Since it's low risk I'd rather we backport it please.

Adding back the labels, so we don't lose track of this request.

@mykaul
Copy link
Contributor

mykaul commented Dec 13, 2023

@scylladb/scylla-maint - please backport to 5.4.

@denesb
Copy link
Contributor

denesb commented Dec 13, 2023

Already in 5.4.

@mykaul mykaul removed Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants