Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla stuck during systemctl stop scylla-server.service (while running TerminateAndRemoveNodeMonkey) #8191

Closed
fruch opened this issue Mar 1, 2021 · 17 comments
Assignees
Labels

Comments

@fruch
Copy link
Contributor

fruch commented Mar 1, 2021

Installation details
Scylla version (or git commit hash): 4.5.dev-0.20210216.2f3b265da with build-id 6a7412abfaf73ef877a8e6ae22759227e89cac13 (ami-0208ab84477ace351)
Cluster size: 6 (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0208ab84477ace351

Summary

During TerminateAndRemoveNodeMonkey as part of longevity-50gb-3days weekly run on master
While calling systemctl stop scylla-server.service (on longevity-tls-50gb-3d-master-db-node-147194be-4) scylla get stuck, and last seen log is from the hints_manager being stopped

2021-02-26T14:04:15+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !NOTICE  | sudo: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/bin/systemctl stop scylla-server.service
...
2021-02-26T14:04:15+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | systemd: Stopping Scylla Server...
2021-02-26T14:04:15+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 0] compaction_manager - Asked to stop
...
2021-02-26T14:04:26+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 13] hints_manager - Stopped
2021-02-26T14:04:26+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 6] hints_manager - Stopped
2021-02-26T14:04:26+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 9] hints_manager - Stopped
2021-02-26T14:04:27+00:00  longevity-tls-50gb-3d-master-db-node-147194be-4 !INFO    | scylla: [shard 4] hints_manager - Stopped

Logs

@fruch
Copy link
Contributor Author

fruch commented Mar 1, 2021

seems like it happened also during the weekly longevity-200gb-48h#104
but while running sudo systemctl restart scylla-server.service time outed after 10mins

Logs

Look like it stuck on the same place:

Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 3] hints_manager - Asked to stop
Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 5] hints_manager - Asked to stop
Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 5] hints_manager - Stopped
Feb 27 02:12:58 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 scylla[117391]:  [shard 5] hints_manager - Asked to stop

And after 15min, seem like systemd timeouted, and scylla-server started again:

Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: Stopped Scylla Server.
Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: Unit scylla-server.service entered failed state.
Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: scylla-server.service failed.
Feb 27 02:28:53 longevity-200gb-48h-verify-limited--db-node-ad480eeb-4 systemd[1]: Starting Scylla Server...

@fruch fruch assigned fruch and unassigned fruch Mar 1, 2021
@slivne slivne added the bug label Mar 1, 2021
@slivne
Copy link
Contributor

slivne commented Mar 1, 2021

can we get a stuck node ?

@fruch
Copy link
Contributor Author

fruch commented Mar 1, 2021

can we get a stuck node ?

one, systemd seem to timeout after 15min, regardless (i.e. in sudo systemctl restart scylla-server.service example at least)

we might be able to run the 48h case only with StopStartMonkey, but it might be deepened on other nemesis taking place before.

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Mar 1, 2021
@fruch
Copy link
Contributor Author

fruch commented Mar 1, 2021

can we get a stuck node ?

one, systemd seem to timeout after 15min, regardless (i.e. in sudo systemctl restart scylla-server.service example at least)

we might be able to run the 48h case only with StopStartMonkey, but it might be deepened on other nemesis taking place before.

trying it here:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/reproducers/job/8191_repoducer/1/

@fruch
Copy link
Contributor Author

fruch commented Mar 2, 2021

@slivne @bhalevy
It reproduced after ~3h:
longevity-200gb-48h-verify-limited--db-node-bef564a2-4 [13.49.69.246 | 10.0.1.181]

the cluster is all yours.

but again, systemd is wiping the evidence and forcefully killing the process:

Mar 01 19:21:06 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: Stopping Scylla Server...
...
Mar 01 19:36:06 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: scylla-server.service stop-sigterm timed out. Killing.
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: scylla-server.service: main process exited, code=killed, status=9/KILL
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: Stopped Scylla Server.
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: Unit scylla-server.service entered failed state.
Mar 01 19:37:04 longevity-200gb-48h-verify-limited--db-node-bef564a2-4 systemd[1]: scylla-server.service failed.

@bhalevy
Copy link
Member

bhalevy commented Mar 2, 2021

@xemul please look into this

@xemul
Copy link
Contributor

xemul commented Mar 2, 2021

I believe this is duplicate of #8079 -- the suspect be8c359 is there, the node stucks on shutting down the hints manager.

@slivne
Copy link
Contributor

slivne commented Mar 2, 2021

@xemul strange the version referenced above 2f3b265 does include the patch be8c359 you pointed to

@xemul
Copy link
Contributor

xemul commented Mar 2, 2021

@slivne , yes, the patch be8c359 is the one that seems to cause hints manager stop to hang and it's present in scyllas from both bugs, this and that

@slivne
Copy link
Contributor

slivne commented Mar 2, 2021

Hm ... when we shutdown we will flush memtables - this is not supposed to increase commitlog usage - its supposed actually to clear the commitlogs.

@slivne slivne assigned elcallio and unassigned xemul and bhalevy Mar 2, 2021
@fruch
Copy link
Contributor Author

fruch commented Mar 2, 2021

@elcallio so the cluster with the reproducer isn't needed ? (I'm going to terminate it)

@elcallio
Copy link
Contributor

elcallio commented Mar 9, 2021

If this is a hanging hints manager, it should be addressed by added53

@roydahan
Copy link

I have this node where it reproduced if someone needs it: 13.49.78.133.

@slivne
Copy link
Contributor

slivne commented Mar 16, 2021

@elcallio can you please check if this is the issue

@roydahan
Copy link

It seems that I faced this issue in 4.4.rc3-0.20210304.c2d924757 as part of longevirt-counters-multidc-test.

During an ENOSPC nemesis, after node-4 was filled and hit the expected "storage I/O error", the nemesis clear the space and restarting the service by: "sudo systemctl restart scylla-server.service".
However, in this case the node failed to stop in expected time with the same behaviour as described above:

2021-03-10 22:38:28  Running command "sudo systemctl restart scylla-server.service"...

2021-03-10T22:38:28+00:00  systemd: Stopping Scylla Server...

021-03-10T22:38:31+00:00   scylla: [shard 5] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 4] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 7] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 6] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 0] hints_manager - Stopped
2021-03-10T22:38:31+00:00  scylla: [shard 1] hints_manager - Stopped

This is the last message from scylla until:

021-03-10T22:54:26+00:00   systemd: scylla-server.service: main process exited, code=killed, status=9/KILL
2021-03-10T22:54:26+00:00 systemd: Stopped Scylla Server.
2021-03-10T22:54:26+00:00 systemd: Unit scylla-server.service entered failed state.
2021-03-10T22:54:26+00:00 systemd: scylla-server.service failed.

Logs available in:
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/db-cluster-6be1383c.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/loader-set-6be1383c.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/monitor-set-6be1383c.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/6be1383c-4391-4e14-ad9f-48dc17120baa/20210310_230053/sct-runner-6be1383c.zip

@bhalevy
Copy link
Member

bhalevy commented Mar 17, 2021

Looks like #8079
@elcallio / @xemul is 4.4 exposed?

@bhalevy
Copy link
Member

bhalevy commented Mar 21, 2021

Duplicate of #8079

@bhalevy bhalevy marked this as a duplicate of #8079 Mar 21, 2021
@bhalevy bhalevy closed this as completed Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants