Delay coredump.service shutdown after scylla.service shutdown #436

soyacz · 2023-02-15T09:32:39Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Recently in a test when executing a soft reboot node, scylla had an error 'aborting on shard'. It didn't create coredump due

!ERR | systemd-coredump[8354]: Failed to connect to coredump service: Connection refused

It looks like the reboot triggers shutdown of coredump service before waiting for scylla service to stop - can we make it to stop after scylla is down?

Impact

No coredump - harder issues investigation

Installation details

Kernel Version: 5.15.0-1028-aws
Scylla version (or git commit hash): 5.2.0~rc1-20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

longevity-5gb-1h-SoftRebootNodeMonk-db-node-249f30ed-3 (3.252.166.103 | 10.4.3.176) (shards: 2)
longevity-5gb-1h-SoftRebootNodeMonk-db-node-249f30ed-2 (34.245.61.44 | 10.4.1.56) (shards: 2)
longevity-5gb-1h-SoftRebootNodeMonk-db-node-249f30ed-1 (34.240.130.155 | 10.4.1.211) (shards: 2)

OS / Image: ami-05e1d6aa4f71f3f25 (aws: eu-west-1)

Test: longevity-5gb-1h-SoftRebootNodeMonkey-aws-test
Test id: 249f30ed-7007-4b8e-a320-1207ebca5e5d
Test name: scylla-5.2/nemesis/longevity-5gb-1h-SoftRebootNodeMonkey-aws-test
Test config file(s):

longevity-5gb-1h-SoftRebootNodeMonkey.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 249f30ed-7007-4b8e-a320-1207ebca5e5d
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 249f30ed-7007-4b8e-a320-1207ebca5e5d

Logs:

db-cluster-249f30ed.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/249f30ed-7007-4b8e-a320-1207ebca5e5d/20230214_215652/db-cluster-249f30ed.tar.gz
sct-runner-249f30ed.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/249f30ed-7007-4b8e-a320-1207ebca5e5d/20230214_215652/sct-runner-249f30ed.tar.gz
monitor-set-249f30ed.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/249f30ed-7007-4b8e-a320-1207ebca5e5d/20230214_215652/monitor-set-249f30ed.tar.gz
loader-set-249f30ed.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/249f30ed-7007-4b8e-a320-1207ebca5e5d/20230214_215652/loader-set-249f30ed.tar.gz
parallel-timelines-report-249f30ed.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/249f30ed-7007-4b8e-a320-1207ebca5e5d/20230214_215652/parallel-timelines-report-249f30ed.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

mykaul · 2023-02-15T14:13:27Z

Is this different than https://github.com/scylladb/scylla-enterprise/issues/2648 which is solved via scylladb/scylladb#12757 ?

fruch · 2023-02-15T14:18:05Z

Is this different than https://github.com/scylladb/scylla-enterprise/issues/2648 which is solved via scylladb/scylladb#12757 ?

No, this case is a coredump happening during shutdown of a node after systemd-coredump.socket is closed

Luckily for us this coredump happened in more cases.

syuu1228 · 2023-07-04T19:26:35Z

I finally able to reproduce this after patching scylla to delay shutdown and cause SIGSEGV:

diff --git a/main.cc b/main.cc
index 35c4c25caa..1d167e3152 100644
--- a/main.cc
+++ b/main.cc
@@ -103,6 +103,8 @@
 
 #include <boost/algorithm/string/join.hpp>
 
+#include <signal.h>
+
 namespace fs = std::filesystem;
 
 seastar::metrics::metric_groups app_metrics;
@@ -471,6 +473,8 @@ static auto defer_verbose_shutdown(const char* what, Func&& func) {
         startlog.info("Shutting down {}", what);
         try {
             func();
+            seastar::sleep(std::chrono::minutes(1)).get();
+            raise(SIGSEGV);
             startlog.info("Shutting down {} was successful", what);
         } catch (...) {
             auto ex = std::current_exception();

So I tried to delay coredump.service shutdown after scylla-server.service, by following drop-in conf:

$ cat /etc/systemd/system/scylla-server.service.d/dependencies.conf 
[Unit]
After=local-fs.target network-online.target systemd-coredump.socket var-lib-systemd-coredump.mount
Requires=local-fs.target network-online.target systemd-coredump.socket var-lib-systemd-coredump.mount

Also, I found that there are GH issue which says systemd-coredump@.service may get terminate when shutdown (systemd/systemd#7176), so I also added a workaround for this:

$ cat /etc/systemd/system/systemd-coredump@.service.d/killsignal.conf 
[Service]
KillSignal=SIGCONT

I thought now we can capture coredump correctly on systemd-coredump, but it's not.
It still cause error on systemd-coredump, even systemd-coredump.socket is shutdown after scylla-server.service:

Jul 01 00:27:41 ubuntu-jammy scylla[1054]: Segmentation fault on shard 0.
Jul 01 00:27:41 ubuntu-jammy scylla[1054]: Backtrace:
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x56b8e88
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x56ed416
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   /opt/scylladb/libreloc/libc.so.6+0x3cb1f
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   /opt/scylladb/libreloc/libc.so.6+0x8ce5b
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   /opt/scylladb/libreloc/libc.so.6+0x3ca75
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x13c1bb0
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x13c21b3
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x127f256
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x1272375
Jul 01 00:27:41 ubuntu-jammy scylla[1054]:   0x59440f6
Jul 01 00:27:41 ubuntu-jammy systemd-coredump[2271]: Failed to send coredump fd: Broken pipe
Jul 01 00:28:07 ubuntu-jammy systemd[1]: scylla-server.service: Main process exited, code=dumped, status=11/SEGV
Jul 01 00:28:07 ubuntu-jammy systemd[1]: scylla-server.service: Failed with result 'core-dump'.
Jul 01 00:28:07 ubuntu-jammy systemd[1]: Stopped Scylla Server.
Jul 01 00:28:07 ubuntu-jammy systemd[1]: scylla-server.service: Consumed 14min 45.846s CPU time.
...
Jul 01 00:28:07 ubuntu-jammy systemd[1]: systemd-coredump.socket: Deactivated successfully.
Jul 01 00:28:07 ubuntu-jammy systemd[1]: Closed Process Core Dump Socket.

I tried again and again with bit different configuration, but systemd-coredump never worked during shutdown.
So I decided only possible workaround is stop using systemd-coredump handler on kernel.core_pattern and set filepath on kernel.core_pattern directly.
I will send a workaround for that.

…down We found that systemd-coredump does not correctly capturing coredump during system reboot or shutdown. As a workaround of this issue, set coredump file path to kernel.core_pattern during system reboot or shutdown. It will save core to /var/tmp/core.scylla.$PID.$TIMESTAMP. Fixes scylladb/scylla-machine-image#436

syuu1228 · 2023-07-08T14:57:37Z

@avikivity Do you have any idea with this issue?

syuu1228 · 2023-07-10T11:56:28Z

Opened issue on systemd GH systemd/systemd#28338

syuu1228 · 2023-07-20T14:55:36Z

@avikivity ping, do you have any idea?

…down We found that systemd-coredump does not correctly capturing coredump during system reboot or shutdown. As a workaround of this issue, set coredump file path to kernel.core_pattern during system reboot or shutdown. It will save core to /var/lib/scylla/shutdown-coredump/. Fixes scylladb/scylla-machine-image#436

avikivity · 2023-10-24T16:38:53Z

Sorry for missing the issue. I often skip over scylla-machine-image because I don't maintain it. I'll look over it now.

avikivity · 2023-10-24T16:42:43Z

I guess the problem is that, even with the dependency, systemd thinks the process is done (not sure why - the PID still exists while dumping code) so it stops systemd-coredumpd while the code dump is in progress. Very funky.

syuu1228 · 2023-10-27T23:08:17Z

@avikivity since we decided to not merging workaround, what else can we do for this?
Maybe we should document it?

yaronkaikov assigned syuu1228 Feb 15, 2023

syuu1228 mentioned this issue Jul 4, 2023

dist: add workaround to capture coredump during system reboot or shutdown scylladb/scylladb#14510

Closed

syuu1228 mentioned this issue Jul 10, 2023

systemd-coredump does not work during shutdown systemd/systemd#28338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay coredump.service shutdown after scylla.service shutdown #436

Delay coredump.service shutdown after scylla.service shutdown #436

soyacz commented Feb 15, 2023

Logs:

mykaul commented Feb 15, 2023

fruch commented Feb 15, 2023

syuu1228 commented Jul 4, 2023 •

edited

syuu1228 commented Jul 8, 2023

syuu1228 commented Jul 10, 2023

syuu1228 commented Jul 20, 2023

avikivity commented Oct 24, 2023

avikivity commented Oct 24, 2023

syuu1228 commented Oct 27, 2023

Delay coredump.service shutdown after scylla.service shutdown #436

Delay coredump.service shutdown after scylla.service shutdown #436

Comments

soyacz commented Feb 15, 2023

Issue description

Impact

Installation details

Logs:

mykaul commented Feb 15, 2023

fruch commented Feb 15, 2023

syuu1228 commented Jul 4, 2023 • edited

syuu1228 commented Jul 8, 2023

syuu1228 commented Jul 10, 2023

syuu1228 commented Jul 20, 2023

avikivity commented Oct 24, 2023

avikivity commented Oct 24, 2023

syuu1228 commented Oct 27, 2023

syuu1228 commented Jul 4, 2023 •

edited