init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

juliayakovlev · 2023-02-16T14:03:00Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Problem
Startup failed on a new node that was created as replacement for terminated/decommissioned node:

Feb 09 18:18:25 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

Description

This falure happens twice during this test.
Test runs with workload prioritization (I am not sure if it is connected to the issue).

First time it happens during decommission nemesis. longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-1 has been decommissioned and new longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 (3.250.19.195 | 10.4.0.18) node was added and starting:

Feb 09 18:18:06 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Scylla version 2022.1.5-0.20230108.8c2c21866 with build-id 90676755bb7af26527b54cf1a5afb6498162afba starting ...

12 seconds later system_auth.roles table was created and then waiting for gossip:

Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] schema_tables - Creating system_auth.roles id=5bc52802-de25-35ed-aeab-188eecebb090 version=ea493725-02da-35db-a811-8c3d56
02f317
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] schema_tables - Schema version changed to f337bf8c-8ae7-3fb1-8736-765babb83237
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting batchlog manager
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting load meter
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting cf cache hit rate calculator
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting view update backlog broker
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Waiting for gossip to settle before accepting client requests...

Immediatelly after that Scylla was stopped:

Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] compaction - [Compact system_schema.keyspaces 21a9ff60-a8a6-11ed-9d76-7a1eec6b1387] Compacted 7 sstables to [/var/lib/scy
lla/data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/md-280-big-Data.db:level=0]. 93kB to 14kB (~15% of original) in 6ms = 2MB/s. ~896 total partitions merged to 6.
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: Stopping Scylla JMX...
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: Stopped Scylla JMX.
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] compaction_manager - Asked to stop
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: Stopping Scylla Server...
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 6] compaction_manager - Stopped
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes=[]: seastar::sleep_aborted (Sleep is aborted)
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] gossip - failure_detector_loop: Finished main loop

And startup failed:

Feb 09 18:18:25 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

Second time it happened during terminate and replace nemesis. longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-4 was terminated and new longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 (34.244.86.108 | 10.4.1.190) was srationg:

Feb 09 19:43:13 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - Scylla version 2022.1.5-0.20230108.8c2c21866 with build-id 90676755bb7af26527b54cf1a5afb6498162afba starting ...

12 seconds later system_auth.roles table creation and waiting for gossip:

Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] schema_tables - Creating system_auth.roles id=5bc52802-de25-35ed-aeab-188eecebb090 version=ad4e69b2-6508-37a7-a378-0d04d8
2bdaca
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] schema_tables - Schema version changed to 0ea3d465-4a91-30ab-9fef-1bb0a71e60e6
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting batchlog manager
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting load meter
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting cf cache hit rate calculator
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting view update backlog broker
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - Waiting for gossip to settle before accepting client requests...

Immediatelly the Scylla was stopped:

Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] compaction - [Compact system_schema.scylla_tables 05507f40-a8b2-11ed-8da4-635f0950790e] Compacted 4 sstables to [/var/lib/scylla/data/system_schema/scylla_tables-5d912ff1f7593665b2c88042ab5103dd/md-182-big-Data.db:level=0]. 54kB to 14kB (~26% of original) in 9ms = 1MB/s. ~512 total partitions merged to 6.
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopping Scylla JMX...
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopped Scylla JMX.
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] compaction_manager - Asked to stop
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopping Scylla Server...
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes=[]: seastar::sleep_aborted (Sleep is aborted)
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] gossip - failure_detector_loop: Finished main loop

Startup failed:

Feb 09 19:43:31 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
Feb 09 19:44:44 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
Feb 09 19:44:44 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-server.service: Failed with result 'exit-code'.
Feb 09 19:44:44 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopped Scylla Server.
Feb 09 19:44:47 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Starting Scylla Server...

No errors/coredump/aborting/faults. Scylla was started successfully after that

Impact

Scylla was started successfully after that

How frequently does it reproduce?

twice during same test run

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 2022.1.5-20230108.8c2c21866 with build-id 90676755bb7af26527b54cf1a5afb6498162afba
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.1/scylla-enterprise-x86_64-package-2022.1.5.0.20230108.8c2c21866.tar.gz

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 (34.244.86.108 | 10.4.1.190) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 (3.250.19.195 | 10.4.0.18) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-6 (18.203.111.207 | 10.4.0.81) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-5 (52.18.185.36 | 10.4.0.76) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-4 (34.241.207.150 | 10.4.0.35) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-3 (34.245.22.37 | 10.4.0.113) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-2 (34.245.97.254 | 10.4.2.60) (shards: 14)
longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-1 (3.252.194.109 | 10.4.3.61) (shards: 14)

OS / Image: ami-0287067067652acb9 (aws: eu-west-1)

Test: longevity-sla-100gb-4h-test
Test id: 910e5c8f-b2de-44e8-bf21-de82c605135d
Test name: enterprise-2022.1/SCT_Enterprise_Features/Workload_Prioritization/longevity-sla-100gb-4h-test
Test config file(s):

longevity-sla-100gb-4h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 910e5c8f-b2de-44e8-bf21-de82c605135d
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 910e5c8f-b2de-44e8-bf21-de82c605135d

Logs:

db-cluster-910e5c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/910e5c8f-b2de-44e8-bf21-de82c605135d/20230209_195531/db-cluster-910e5c8f.tar.gz
sct-runner-910e5c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/910e5c8f-b2de-44e8-bf21-de82c605135d/20230209_195531/sct-runner-910e5c8f.tar.gz
monitor-set-910e5c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/910e5c8f-b2de-44e8-bf21-de82c605135d/20230209_195531/monitor-set-910e5c8f.tar.gz
loader-set-910e5c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/910e5c8f-b2de-44e8-bf21-de82c605135d/20230209_195531/loader-set-910e5c8f.tar.gz
parallel-timelines-report-910e5c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/910e5c8f-b2de-44e8-bf21-de82c605135d/20230209_195531/parallel-timelines-report-910e5c8f.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

fruch · 2023-04-23T20:10:11Z

@fgelcer @yaronkaikov
the issue you are facing isn't related to this issue, it's an SCT issue
I'm fixing it in scylladb/scylla-cluster-tests#6062

fgelcer · 2023-04-23T20:53:24Z

in this case, i just deleted my commento to not mess up with this one

yaronkaikov · 2023-04-23T21:33:17Z

in this case, i just deleted my commento to not mess up with this one

me2

bhalevy · 2023-04-27T11:21:44Z

What exactly is the issue?

For example, for the first (decommission) case, I see:

2023-02-09T18:18:23+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7   !NOTICE | sudo[9769]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service

Why is the service stopped?
Is it part of the test?

Is the issue that scylla is stopped, or that it doesn't stop cleanly, and it prints this error?

2023-02-09T18:18:26+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7      !ERR | scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

DoronArazii · 2023-05-01T10:55:05Z

@juliayakovlev ^^

juliayakovlev · 2023-05-01T12:26:08Z

@bhalevy

What exactly is the issue?

For example, for the first (decommission) case, I see:
2023-02-09T18:18:23+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7   !NOTICE | sudo[9769]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
Why is the service stopped? Is it part of the test?

Is the issue that scylla is stopped, or that it doesn't stop cleanly, and it prints this error?

I re-checked, we really restart the scylla. So the issue that Scylla was stopped with "Startup failed" error

2023-02-09T18:18:26+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7      !ERR | scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

bhalevy · 2023-05-01T15:05:05Z

So this is very minor.
We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

bhalevy · 2023-05-01T15:05:44Z

@xemul ^^

juliayakovlev · 2023-05-01T15:09:41Z

So this is very minor. We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy When did you plan to fix it? Now we can get test failed with this error despite it is minor
@roydahan should we decrease error severity / ignore this error?

fruch · 2023-05-01T19:06:13Z

So this is very minor. We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy When did you plan to fix it? Now we can get test failed with this error despite it is minor
@roydahan should we decrease error severity / ignore this error?

We can't ignore such an error, if scylla fails to boot, it's not something we easily can ignore

juliayakovlev · 2023-05-02T06:33:24Z

So this is very minor. We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy When did you plan to fix it? Now we can get test failed with this error despite it is minor
@roydahan should we decrease error severity / ignore this error?

We can't ignore such an error, if scylla fails to boot, it's not something we easily can ignore

@fruch
Startup did not failed. We configure a new node and restart Scylla. This error arrear when we (from SCT) stop Scylla. This is wrong error diuring stopping service

When scylla starts it may go to sleep along the way before the "serving" message appears. If SIGINT is sent at that time the whole thing unrolls and the main code ends up catching the sleep_aborted exception, printing the error in logs and exiting with non-zero code. However, that's not an error, just the start was interrupted earlier than it was expected by the stop_signal thing. fixes: scylladb#12898 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

When scylla starts it may go to sleep along the way before the "serving" message appears. If SIGINT is sent at that time the whole thing unrolls and the main code ends up catching the sleep_aborted exception, printing the error in logs and exiting with non-zero code. However, that's not an error, just the start was interrupted earlier than it was expected by the stop_signal thing. fixes: #12898 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14034

denesb · 2023-12-18T08:43:05Z

This issue is not relevant to production deployments, not backporting.

DoronArazii added this to the 5.x milestone Feb 20, 2023

fgelcer added the triage/master Looking for assignee label Apr 23, 2023

yaronkaikov added the P1 Urgent label Apr 23, 2023

DoronArazii removed the P1 Urgent label Apr 27, 2023

DoronArazii assigned bhalevy Apr 27, 2023

DoronArazii added the P2 High Priority label Apr 27, 2023

DoronArazii modified the milestones: 5.x, 5.3 Apr 27, 2023

DoronArazii added the status/missing information Some details are missing to handle the case label May 1, 2023

bhalevy assigned xemul May 1, 2023

xemul mentioned this issue May 25, 2023

main: Ignore sleep_aborted exception in main #14034

Closed

scylladb-promoter closed this as completed in b0525e2 May 29, 2023

scylladb-promoter added the Backport candidate label May 29, 2023

DoronArazii removed triage/master Looking for assignee status/missing information Some details are missing to handle the case labels May 30, 2023

DoronArazii removed this from the 5.3 milestone May 30, 2023

DoronArazii added this to the 5.4 milestone May 30, 2023

denesb removed the Backport candidate label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

juliayakovlev commented Feb 16, 2023

Logs:

fruch commented Apr 23, 2023

fgelcer commented Apr 23, 2023

yaronkaikov commented Apr 23, 2023

bhalevy commented Apr 27, 2023

DoronArazii commented May 1, 2023

juliayakovlev commented May 1, 2023

bhalevy commented May 1, 2023

bhalevy commented May 1, 2023

juliayakovlev commented May 1, 2023

fruch commented May 1, 2023

juliayakovlev commented May 2, 2023 •

edited

denesb commented Dec 18, 2023

init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

Comments

juliayakovlev commented Feb 16, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Apr 23, 2023

fgelcer commented Apr 23, 2023

yaronkaikov commented Apr 23, 2023

bhalevy commented Apr 27, 2023

DoronArazii commented May 1, 2023

juliayakovlev commented May 1, 2023

bhalevy commented May 1, 2023

bhalevy commented May 1, 2023

juliayakovlev commented May 1, 2023

fruch commented May 1, 2023

juliayakovlev commented May 2, 2023 • edited

denesb commented Dec 18, 2023

juliayakovlev commented May 2, 2023 •

edited