Scylla stuck during `systemctl stop scylla-server.service` (while running TerminateAndRemoveNodeMonkey) #8191

fruch · 2021-03-01T10:20:27Z

Installation details
Scylla version (or git commit hash): 4.5.dev-0.20210216.2f3b265da with build-id 6a7412abfaf73ef877a8e6ae22759227e89cac13 (ami-0208ab84477ace351)
Cluster size: 6 (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0208ab84477ace351

Summary

During TerminateAndRemoveNodeMonkey as part of longevity-50gb-3days weekly run on master
While calling systemctl stop scylla-server.service (on longevity-tls-50gb-3d-master-db-node-147194be-4) scylla get stuck, and last seen log is from the hints_manager being stopped

2021-02-26T14:04:15+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !NOTICE  | sudo: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/bin/systemctl stop scylla-server.service
...
2021-02-26T14:04:15+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | systemd: Stopping Scylla Server...
2021-02-26T14:04:15+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 0] compaction_manager - Asked to stop
...
2021-02-26T14:04:26+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 13] hints_manager - Stopped
2021-02-26T14:04:26+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 6] hints_manager - Stopped
2021-02-26T14:04:26+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 9] hints_manager - Stopped
2021-02-26T14:04:27+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 4] hints_manager - Stopped

Logs

The text was updated successfully, but these errors were encountered:

fruch · 2021-03-01T10:34:07Z

seems like it happened also during the weekly longevity-200gb-48h#104
but while running sudo systemctl restart scylla-server.service time outed after 10mins

Logs

db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/ad480eeb-4248-4276-b93a-c88ef24cf6f9/20210228_051940/db-cluster-ad480eeb.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/ad480eeb-4248-4276-b93a-c88ef24cf6f9/20210228_051940/loader-set-ad480eeb.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/ad480eeb-4248-4276-b93a-c88ef24cf6f9/20210228_051940/monitor-set-ad480eeb.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/ad480eeb-4248-4276-b93a-c88ef24cf6f9/20210228_051940/sct-runner-ad480eeb.zip
Restore Monitor Stack command: $ hydra investigate show-monitor ad480eeb-4248-4276-b93a-c88ef24cf6f9
Show all stored logs command: $ hydra investigate show-logs ad480eeb-4248-4276-b93a-c88ef24cf6f9

Look like it stuck on the same place:

Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 3] hints_manager - Asked to stop
Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 5] hints_manager - Asked to stop
Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 5] hints_manager - Stopped
Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 5] hints_manager - Asked to stop

And after 15min, seem like systemd timeouted, and scylla-server started again:

Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: Stopped Scylla Server.
Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: Unit scylla-server.service entered failed state.
Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: scylla-server.service failed.
Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: Starting Scylla Server...

slivne · 2021-03-01T15:47:38Z

can we get a stuck node ?

fruch · 2021-03-01T16:15:10Z

can we get a stuck node ?

one, systemd seem to timeout after 15min, regardless (i.e. in sudo systemctl restart scylla-server.service example at least)

we might be able to run the 48h case only with StopStartMonkey, but it might be deepened on other nemesis taking place before.

fruch · 2021-03-01T16:20:42Z

can we get a stuck node ?

one, systemd seem to timeout after 15min, regardless (i.e. in sudo systemctl restart scylla-server.service example at least)

we might be able to run the 48h case only with StopStartMonkey, but it might be deepened on other nemesis taking place before.

trying it here:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/reproducers/job/8191_repoducer/1/

fruch · 2021-03-02T07:11:22Z

@slivne @bhalevy
It reproduced after ~3h:
longevity-200gb-48h-verify-limited--db-node-bef564a2-4 [13.49.69.246 | 10.0.1.181]

the cluster is all yours.

but again, systemd is wiping the evidence and forcefully killing the process:

Mar 01 19:21:06 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: Stopping Scylla Server...
...
Mar 01 19:36:06 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: scylla-server.service stop-sigterm timed out. Killing.
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: scylla-server.service: main process exited, code=killed, status=9/KILL
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: Stopped Scylla Server.
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: Unit scylla-server.service entered failed state.
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: scylla-server.service failed.

bhalevy · 2021-03-02T07:50:38Z

@xemul please look into this

xemul · 2021-03-02T07:54:28Z

I believe this is duplicate of #8079 -- the suspect be8c359 is there, the node stucks on shutting down the hints manager.

slivne · 2021-03-02T10:20:27Z

@xemul strange the version referenced above 2f3b265 does include the patch be8c359 you pointed to

xemul · 2021-03-02T10:25:40Z

@slivne , yes, the patch be8c359 is the one that seems to cause hints manager stop to hang and it's present in scyllas from both bugs, this and that

slivne · 2021-03-02T11:00:12Z

Hm ... when we shutdown we will flush memtables - this is not supposed to increase commitlog usage - its supposed actually to clear the commitlogs.

fruch · 2021-03-02T12:00:25Z

@elcallio so the cluster with the reproducer isn't needed ? (I'm going to terminate it)

elcallio · 2021-03-09T12:17:21Z

If this is a hanging hints manager, it should be addressed by added53

roydahan · 2021-03-12T02:36:15Z

I have this node where it reproduced if someone needs it: 13.49.78.133.

slivne · 2021-03-16T12:19:53Z

@elcallio can you please check if this is the issue

roydahan · 2021-03-17T00:06:35Z

It seems that I faced this issue in 4.4.rc3-0.20210304.c2d924757 as part of longevirt-counters-multidc-test.

During an ENOSPC nemesis, after node-4 was filled and hit the expected "storage I/O error", the nemesis clear the space and restarting the service by: "sudo systemctl restart scylla-server.service".
However, in this case the node failed to stop in expected time with the same behaviour as described above:

2021-03-10 22:38:28  Running command "sudo systemctl restart scylla-server.service"...

2021-03-10T22:38:28+00:00  systemd: Stopping Scylla Server...

021-03-10T22:38:31+00:00   scylla: [shard 5] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 4] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 7] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 6] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 0] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 1] hints_manager - Stopped

This is the last message from scylla until:

021-03-10T22:54:26+00:00   systemd: scylla-server.service: main process exited, code=killed, status=9/KILL
2021-03-10T22:54:26+00:00 systemd: Stopped Scylla Server.
2021-03-10T22:54:26+00:00 systemd: Unit scylla-server.service entered failed state.
2021-03-10T22:54:26+00:00 systemd: scylla-server.service failed.

Logs available in:
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/db-cluster-6be1383c.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/loader-set-6be1383c.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/monitor-set-6be1383c.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/sct-runner-6be1383c.zip

bhalevy · 2021-03-17T08:13:39Z

Looks like #8079
@elcallio / @xemul is 4.4 exposed?

bhalevy · 2021-03-21T16:12:41Z

Duplicate of #8079

fruch assigned fruch and unassigned fruch Mar 1, 2021

slivne assigned bhalevy Mar 1, 2021

slivne added the bug label Mar 1, 2021

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Mar 1, 2021

Trying to reproducer for scylladb/scylladb#8191

1878f52

bhalevy assigned xemul Mar 2, 2021

slivne assigned elcallio and unassigned xemul and bhalevy Mar 2, 2021

bhalevy marked this as a duplicate of #8079 Mar 21, 2021

bhalevy closed this as completed Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scylla stuck during `systemctl stop scylla-server.service` (while running TerminateAndRemoveNodeMonkey) #8191

Scylla stuck during `systemctl stop scylla-server.service` (while running TerminateAndRemoveNodeMonkey) #8191

fruch commented Mar 1, 2021 •

edited

fruch commented Mar 1, 2021 •

edited

slivne commented Mar 1, 2021

fruch commented Mar 1, 2021

fruch commented Mar 1, 2021

fruch commented Mar 2, 2021 •

edited

bhalevy commented Mar 2, 2021

xemul commented Mar 2, 2021

slivne commented Mar 2, 2021

xemul commented Mar 2, 2021

slivne commented Mar 2, 2021

fruch commented Mar 2, 2021

elcallio commented Mar 9, 2021

roydahan commented Mar 12, 2021

slivne commented Mar 16, 2021

roydahan commented Mar 17, 2021

bhalevy commented Mar 17, 2021

bhalevy commented Mar 21, 2021

Scylla stuck during systemctl stop scylla-server.service (while running TerminateAndRemoveNodeMonkey) #8191

Scylla stuck during systemctl stop scylla-server.service (while running TerminateAndRemoveNodeMonkey) #8191

Comments

fruch commented Mar 1, 2021 • edited

Summary

Logs

fruch commented Mar 1, 2021 • edited

Logs

slivne commented Mar 1, 2021

fruch commented Mar 1, 2021

fruch commented Mar 1, 2021

fruch commented Mar 2, 2021 • edited

bhalevy commented Mar 2, 2021

xemul commented Mar 2, 2021

slivne commented Mar 2, 2021

xemul commented Mar 2, 2021

slivne commented Mar 2, 2021

fruch commented Mar 2, 2021

elcallio commented Mar 9, 2021

roydahan commented Mar 12, 2021

slivne commented Mar 16, 2021

roydahan commented Mar 17, 2021

bhalevy commented Mar 17, 2021

bhalevy commented Mar 21, 2021

Scylla stuck during `systemctl stop scylla-server.service` (while running TerminateAndRemoveNodeMonkey) #8191

Scylla stuck during `systemctl stop scylla-server.service` (while running TerminateAndRemoveNodeMonkey) #8191

fruch commented Mar 1, 2021 •

edited

fruch commented Mar 1, 2021 •

edited

fruch commented Mar 2, 2021 •

edited