repair should handle abort_requested_exception mode gracefully #15710

bhalevy · 2023-10-15T07:57:32Z

As seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/258/testReport/bootstrap_test/TestBootstrap/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split004___test_cluster_become_unavailable_when_gracefully_kill_node_during_bootstrap/
The abort_requested_exception, infiltrates

scylladb/main.cc

Lines 1904 to 1909 in 055f061

    
           } catch (const abort_requested_exception&) { 
        
             startlog.info("Startup interrupted"); 
        
             // This happens when scylla gets SIGINT in the middle of join_cluster(), so 
        
             // just ignore it and exit normally 
        
             _exit(0); 
        
             return 0;

which is supposed to handle it gracefully, since it's wrapped in a std::runtime_error, apparently coming from repair (RBNO bootstrap)
See https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/258/artifact/logs-full.debug.004/1697351776674_bootstrap_test.py%3A%3ATestBootstrap%3A%3Atest_cluster_become_unavailable_when_gracefully_kill_node_during_bootstrap/node3.log

Scylla version 5.4.0~dev-0.20231013.055f0617064d with build-id 8c0ed99e7240eedb64eba4950ea5e3c03340b404 starting ...
...
WARN  2023-10-15 06:33:30,482 [shard 0:stre] repair - repair[40e36696-e80d-4d0c-9b77-68afd071b71a]: sync data for keyspace=keyspace, status=failed: std::runtime_error ({shard 0: seastar::abort_requested_exception (abort requested), shard 1: seastar::abort_requested_exception (abort requested)})
ERROR 2023-10-15 06:33:30,484 [shard 0:stre] node_ops - bootstrap[6cee15ba-90df-4e94-8d7e-ba81e649bfa1]: Operation failed, sync_nodes={127.0.13.1, 127.0.13.2, 127.0.13.3}: std::runtime_error ({shard 0: seastar::abort_requested_exception (abort requested), shard 1: seastar::abort_requested_exception (abort requested)})
...
ERROR 2023-10-15 06:33:39,271 [shard 0:main] init - Startup failed: std::runtime_error ({shard 0: seastar::abort_requested_exception (abort requested), shard 1: seastar::abort_requested_exception (abort requested)})

The text was updated successfully, but these errors were encountered:

bhalevy · 2023-10-15T07:57:42Z

@denesb please assign

bhalevy · 2023-10-16T06:33:48Z

Also, I'm not sure if it's the same issue, but the symptoms are related, so there's a good chance they could be fixed together:

see https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/258/artifact/logs-full.debug.048/1697354882267_alternator_stream_tests.py%3A%3ATestAlternatorStreams%3A%3Atest_updated_shards_during_add_decommission_node/node5.log

INFO  2023-10-15 07:04:57,867 [shard 0:main] init - Scylla version 5.4.0~dev-0.20231013.055f0617064d with build-id 8c0ed99e7240eedb64eba4950ea5e3c03340b404 starting ...
...
INFO  2023-10-15 07:27:01,199 [shard 0:main] compaction_manager - Asked to stop
...
WARN  2023-10-15 07:27:01,415 [shard 0:stre] repair - repair[08303fc2-86d1-4584-814b-5b0b62fae889]: sync data for keyspace=alternator_user_table, status=failed: std::runtime_error ({shard 0: std::runtime_error (repair[08303fc2-86d1-4584-814b-5b0b62fae889]: 0 out of 1540 ranges failed, keyspace=alternator_user_table, tables={user_table_scylla_cdc_log, user_table}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})
ERROR 2023-10-15 07:27:01,417 [shard 0:stre] node_ops - bootstrap[2bc530ff-0dc7-4873-8601-12207b9f5371]: Operation failed, sync_nodes={127.0.14.3, 127.0.14.2, 127.0.14.1, 127.0.14.5}: std::runtime_error ({shard 0: std::runtime_error (repair[08303fc2-86d1-4584-814b-5b0b62fae889]: 0 out of 1540 ranges failed, keyspace=alternator_user_table, tables={user_table_scylla_cdc_log, user_table}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})
...
ERROR 2023-10-15 07:27:21,013 [shard 0:main] init - Startup failed: std::runtime_error ({shard 0: std::runtime_error (repair[08303fc2-86d1-4584-814b-5b0b62fae889]: 0 out of 1540 ranges failed, keyspace=alternator_user_table, tables={user_table_scylla_cdc_log, user_table}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})

Note aborted_by_user=true above.
In this case repair should abort gracefully.

bhalevy · 2023-10-16T06:41:30Z

Same in https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/258/artifact/logs-full.debug.047/1697353534672_lwt_random_test.py%3A%3ATestRandomPaxos%3A%3Atest_topology_grow/node4.log

ERROR 2023-10-15 07:05:25,059 [shard 0:main] init - Startup failed: std::runtime_error ({shard 0: std::runtime_error (repair[ee05ad8a-412f-430d-a94e-1a2e31502d31]: 0 out of 769 ranges failed, keyspace=ks, tables={test}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A), shard 1: std::runtime_error (repair[ee05ad8a-412f-430d-a94e-1a2e31502d31]: 1 out of 769 ranges failed, keyspace=ks, tables={test}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)})

If abort is requsted during bootstrap then a node should exit normally. To achieve so, abort_requested_exception should be thrown as main handles it gracefully. In data_sync_repair_task_impl::run exceptions from all shards are wrapped together into std::runtime_exception and so they aren't handled as they are supposed to. Throw abort_requested_exception when shutdown was requested. Throw abort_requested_exception also if repair::task_manager_module::is_aborted, so that force_terminate_all_repair_sessions acts the same regardless the state of the repair. To maintain consistency do the same for user_requested_repair_task_impl. Fixes: scylladb#15710.

avikivity · 2023-10-25T16:47:47Z

Looks like a cosmetic issue, not backporting.

bhalevy · 2023-11-05T09:53:41Z

https://argus.scylladb.com/workspace?state=WyJkN2RkNDJmZS00ZTM3LTRiNWQtYmEwNS02ZDg2N2UzY2IyNzMiXQ
5.4.0~rc1-20231029.24efacf90d0b

2023-10-30 14:25:41.762 <2023-10-30 14:25:41.164>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=3b9c6e45-9e5c-4cce-b581-21b2a64f4dc3 during_nemesis=RemoveNodeThenAddNode: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=5944 node=longevity-10gb-3h-5-4-db-node-a6e3dbbb-7
2023-10-30T14:25:41.164+00:00 longevity-10gb-3h-5-4-db-node-a6e3dbbb-7      !ERR | scylla[5875]:  [shard 0:stre] node_ops - decommission[b4d763c9-698e-492e-8e2c-8e4c75609e5f]: Operation failed, sync_nodes={10.12.3.12, 10.12.1.189, 10.12.0.178, 10.12.3.217, 10.12.0.109, 10.12.0.226}: seastar::abort_requested_exception (abort requested)

bhalevy · 2023-11-06T08:34:53Z

Looks like a cosmetic issue, not backporting.

@avikivity, we have tests failing on this in 5.4 SCT
Since it's low risk I'd rather we backport it please.

mykaul · 2023-11-09T15:25:29Z

Looks like a cosmetic issue, not backporting.

@avikivity, we have tests failing on this in 5.4 SCT Since it's low risk I'd rather we backport it please.

Adding back the labels, so we don't lose track of this request.

mykaul · 2023-12-13T12:35:45Z

@scylladb/scylla-maint - please backport to 5.4.

denesb · 2023-12-13T14:52:17Z

Already in 5.4.

bhalevy added P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Oct 15, 2023

denesb assigned Deexie Oct 16, 2023

Deexie mentioned this issue Oct 16, 2023

repair: throw abort_requested_exception when abort is requested #15722

Closed

scylladb-promoter closed this as completed in 0ce9db2 Oct 17, 2023

scylladb-promoter added the Backport candidate label Oct 17, 2023

mykaul added this to the 5.4 milestone Oct 19, 2023

avikivity removed the Backport candidate label Oct 25, 2023

mykaul added Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Nov 9, 2023

mykaul removed Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repair should handle abort_requested_exception mode gracefully #15710

repair should handle abort_requested_exception mode gracefully #15710

bhalevy commented Oct 15, 2023

bhalevy commented Oct 15, 2023

bhalevy commented Oct 16, 2023

bhalevy commented Oct 16, 2023

avikivity commented Oct 25, 2023

bhalevy commented Nov 5, 2023

bhalevy commented Nov 6, 2023

mykaul commented Nov 9, 2023

mykaul commented Dec 13, 2023

denesb commented Dec 13, 2023

repair should handle abort_requested_exception mode gracefully #15710

repair should handle abort_requested_exception mode gracefully #15710

Comments

bhalevy commented Oct 15, 2023

bhalevy commented Oct 15, 2023

bhalevy commented Oct 16, 2023

bhalevy commented Oct 16, 2023

avikivity commented Oct 25, 2023

bhalevy commented Nov 5, 2023

bhalevy commented Nov 6, 2023

mykaul commented Nov 9, 2023

mykaul commented Dec 13, 2023

denesb commented Dec 13, 2023