Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

Closed
1 of 2 tasks
juliayakovlev opened this issue Feb 16, 2023 · 12 comments
Closed
1 of 2 tasks

init - Startup failed: seastar::sleep_aborted (Sleep is aborted) #12898

juliayakovlev opened this issue Feb 16, 2023 · 12 comments
Assignees
Labels
P2 High Priority
Milestone

Comments

@juliayakovlev
Copy link

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Problem
Startup failed on a new node that was created as replacement for terminated/decommissioned node:

Feb 09 18:18:25 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

Description

This falure happens twice during this test.
Test runs with workload prioritization (I am not sure if it is connected to the issue).

  1. First time it happens during decommission nemesis. longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-1 has been decommissioned and new longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 (3.250.19.195 | 10.4.0.18) node was added and starting:
Feb 09 18:18:06 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Scylla version 2022.1.5-0.20230108.8c2c21866 with build-id 90676755bb7af26527b54cf1a5afb6498162afba starting ...

12 seconds later system_auth.roles table was created and then waiting for gossip:

Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] schema_tables - Creating system_auth.roles id=5bc52802-de25-35ed-aeab-188eecebb090 version=ea493725-02da-35db-a811-8c3d56
02f317
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] schema_tables - Schema version changed to f337bf8c-8ae7-3fb1-8736-765babb83237
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting batchlog manager
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting load meter
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting cf cache hit rate calculator
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - starting view update backlog broker
Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Waiting for gossip to settle before accepting client requests...

Immediatelly after that Scylla was stopped:

Feb 09 18:18:20 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] compaction - [Compact system_schema.keyspaces 21a9ff60-a8a6-11ed-9d76-7a1eec6b1387] Compacted 7 sstables to [/var/lib/scy
lla/data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/md-280-big-Data.db:level=0]. 93kB to 14kB (~15% of original) in 6ms = 2MB/s. ~896 total partitions merged to 6.
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: Stopping Scylla JMX...
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: Stopped Scylla JMX.
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] compaction_manager - Asked to stop
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 systemd[1]: Stopping Scylla Server...
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 6] compaction_manager - Stopped
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes=[]: seastar::sleep_aborted (Sleep is aborted)
Feb 09 18:18:23 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] gossip - failure_detector_loop: Finished main loop

And startup failed:

Feb 09 18:18:25 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
  1. Second time it happened during terminate and replace nemesis. longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-4 was terminated and new longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 (34.244.86.108 | 10.4.1.190) was srationg:
Feb 09 19:43:13 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - Scylla version 2022.1.5-0.20230108.8c2c21866 with build-id 90676755bb7af26527b54cf1a5afb6498162afba starting ...

12 seconds later system_auth.roles table creation and waiting for gossip:

Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] schema_tables - Creating system_auth.roles id=5bc52802-de25-35ed-aeab-188eecebb090 version=ad4e69b2-6508-37a7-a378-0d04d8
2bdaca
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] schema_tables - Schema version changed to 0ea3d465-4a91-30ab-9fef-1bb0a71e60e6
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting batchlog manager
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting load meter
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting cf cache hit rate calculator
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - starting view update backlog broker
Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - Waiting for gossip to settle before accepting client requests...

Immediatelly the Scylla was stopped:

Feb 09 19:43:26 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] compaction - [Compact system_schema.scylla_tables 05507f40-a8b2-11ed-8da4-635f0950790e] Compacted 4 sstables to [/var/lib/scylla/data/system_schema/scylla_tables-5d912ff1f7593665b2c88042ab5103dd/md-182-big-Data.db:level=0]. 54kB to 14kB (~26% of original) in 9ms = 1MB/s. ~512 total partitions merged to 6.
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopping Scylla JMX...
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopped Scylla JMX.
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] compaction_manager - Asked to stop
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopping Scylla Server...
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes=[]: seastar::sleep_aborted (Sleep is aborted)
Feb 09 19:43:29 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] gossip - failure_detector_loop: Finished main loop

Startup failed:

Feb 09 19:43:31 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 scylla[9279]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
Feb 09 19:44:44 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
Feb 09 19:44:44 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: scylla-server.service: Failed with result 'exit-code'.
Feb 09 19:44:44 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Stopped Scylla Server.
Feb 09 19:44:47 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 systemd[1]: Starting Scylla Server...

No errors/coredump/aborting/faults. Scylla was started successfully after that

Impact

Scylla was started successfully after that

How frequently does it reproduce?

twice during same test run

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 2022.1.5-20230108.8c2c21866 with build-id 90676755bb7af26527b54cf1a5afb6498162afba
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.1/scylla-enterprise-x86_64-package-2022.1.5.0.20230108.8c2c21866.tar.gz

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-8 (34.244.86.108 | 10.4.1.190) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7 (3.250.19.195 | 10.4.0.18) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-6 (18.203.111.207 | 10.4.0.81) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-5 (52.18.185.36 | 10.4.0.76) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-4 (34.241.207.150 | 10.4.0.35) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-3 (34.245.22.37 | 10.4.0.113) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-2 (34.245.97.254 | 10.4.2.60) (shards: 14)
  • longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-1 (3.252.194.109 | 10.4.3.61) (shards: 14)

OS / Image: ami-0287067067652acb9 (aws: eu-west-1)

Test: longevity-sla-100gb-4h-test
Test id: 910e5c8f-b2de-44e8-bf21-de82c605135d
Test name: enterprise-2022.1/SCT_Enterprise_Features/Workload_Prioritization/longevity-sla-100gb-4h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 910e5c8f-b2de-44e8-bf21-de82c605135d
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 910e5c8f-b2de-44e8-bf21-de82c605135d

Logs:

Jenkins job URL

@DoronArazii DoronArazii added this to the 5.x milestone Feb 20, 2023
@fgelcer fgelcer added the triage/master Looking for assignee label Apr 23, 2023
@yaronkaikov yaronkaikov added the P1 Urgent label Apr 23, 2023
@fruch
Copy link
Contributor

fruch commented Apr 23, 2023

@fgelcer @yaronkaikov
the issue you are facing isn't related to this issue, it's an SCT issue
I'm fixing it in scylladb/scylla-cluster-tests#6062

@fgelcer
Copy link

fgelcer commented Apr 23, 2023

in this case, i just deleted my commento to not mess up with this one

@yaronkaikov
Copy link
Contributor

in this case, i just deleted my commento to not mess up with this one

me2

@DoronArazii DoronArazii removed the P1 Urgent label Apr 27, 2023
@DoronArazii DoronArazii added the P2 High Priority label Apr 27, 2023
@DoronArazii DoronArazii modified the milestones: 5.x, 5.3 Apr 27, 2023
@bhalevy
Copy link
Member

bhalevy commented Apr 27, 2023

What exactly is the issue?

For example, for the first (decommission) case, I see:

2023-02-09T18:18:23+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7   !NOTICE | sudo[9769]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service

Why is the service stopped?
Is it part of the test?

Is the issue that scylla is stopped, or that it doesn't stop cleanly, and it prints this error?

2023-02-09T18:18:26+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7      !ERR | scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

@DoronArazii DoronArazii added the status/missing information Some details are missing to handle the case label May 1, 2023
@DoronArazii
Copy link

@juliayakovlev ^^

@juliayakovlev
Copy link
Author

@bhalevy

What exactly is the issue?

For example, for the first (decommission) case, I see:

2023-02-09T18:18:23+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7   !NOTICE | sudo[9769]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service

Why is the service stopped? Is it part of the test?

Is the issue that scylla is stopped, or that it doesn't stop cleanly, and it prints this error?

I re-checked, we really restart the scylla. So the issue that Scylla was stopped with "Startup failed" error

2023-02-09T18:18:26+00:00 longevity-sla-100gb-4h-2023-1-db-node-910e5c8f-7      !ERR | scylla[9336]:  [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

@bhalevy
Copy link
Member

bhalevy commented May 1, 2023

So this is very minor.
We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy
Copy link
Member

bhalevy commented May 1, 2023

@xemul ^^

@juliayakovlev
Copy link
Author

So this is very minor. We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy When did you plan to fix it? Now we can get test failed with this error despite it is minor
@roydahan should we decrease error severity / ignore this error?

@fruch
Copy link
Contributor

fruch commented May 1, 2023

So this is very minor. We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy When did you plan to fix it? Now we can get test failed with this error despite it is minor
@roydahan should we decrease error severity / ignore this error?

We can't ignore such an error, if scylla fails to boot, it's not something we easily can ignore

@juliayakovlev
Copy link
Author

juliayakovlev commented May 2, 2023

So this is very minor. We just need to attenuate the sleep_aborted exception in

scylladb/main.cc

Line 1765 in 1cefb66

startlog.error("Startup failed: {}", std::current_exception());

and exit cleanly in this case.

@bhalevy When did you plan to fix it? Now we can get test failed with this error despite it is minor
@roydahan should we decrease error severity / ignore this error?

We can't ignore such an error, if scylla fails to boot, it's not something we easily can ignore

@fruch
Startup did not failed. We configure a new node and restart Scylla. This error arrear when we (from SCT) stop Scylla. This is wrong error diuring stopping service

xemul added a commit to xemul/scylla that referenced this issue May 25, 2023
When scylla starts it may go to sleep along the way before the "serving"
message appears. If SIGINT is sent at that time the whole thing unrolls
and the main code ends up catching the sleep_aborted exception, printing
the error in logs and exiting with non-zero code. However, that's not an
error, just the start was interrupted earlier than it was expected by
the stop_signal thing.

fixes: scylladb#12898

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
denesb pushed a commit that referenced this issue May 29, 2023
When scylla starts it may go to sleep along the way before the "serving"
message appears. If SIGINT is sent at that time the whole thing unrolls
and the main code ends up catching the sleep_aborted exception, printing
the error in logs and exiting with non-zero code. However, that's not an
error, just the start was interrupted earlier than it was expected by
the stop_signal thing.

fixes: #12898

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14034
@DoronArazii DoronArazii removed triage/master Looking for assignee status/missing information Some details are missing to handle the case labels May 30, 2023
@DoronArazii DoronArazii removed this from the 5.3 milestone May 30, 2023
@DoronArazii DoronArazii added this to the 5.4 milestone May 30, 2023
@denesb
Copy link
Contributor

denesb commented Dec 18, 2023

This issue is not relevant to production deployments, not backporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 High Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants