fix(run_nodetool): fix coredump on timeout #5311

soyacz · 2022-09-23T10:17:48Z

There are 2 reasons of coredump not creating on timeout error when running nodetool (e.g. in upgrade test when we run: node.run_nodetool("drain", timeout=3600, coredump_on_timeout=True):

kind of race condition when SSHReaderThread got end event (due timeout) but reader didn't acknowledge that it occured yet. This even lead to sometimes freeze of whole ssh command until CriticalError killed ssh connection.
wrong exception handling when running nodetool

This commit fixes both problems.

PR pre-checks (self review)

I followed KISS principle and best practices
I didn't leave commented-out/debugging code
I added the relevant backport labels
New configuration option are added and documented (in sdcm/sct_config.py)
I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
All new and existing unit tests passed (CI)
I have updated the Readme/doc folder accordingly (if needed)

There are 2 reasons of coredump not creating on timeout error when running nodetool (e.g. in upgrade test when we run: `node.run_nodetool("drain", timeout=3600, coredump_on_timeout=True)`: 1. kind of race condition when `SSHReaderThread` got end event (due timeout) but reader didn't acknowledge that it occured yet. This even lead to sometimes freeze of whole ssh command until CriticalError killed ssh connection. 2. wrong exception handling when running nodetool This commit fixes both problems.

soyacz · 2022-09-23T10:19:24Z

@roydahan this is fix for missing coredump when nodetool drain freezes: scylladb/scylladb#11468

btw. Is there a reason we wait 1h before timing out on node drain?

fgelcer

LGTM

roydahan · 2022-09-25T10:11:42Z

@roydahan this is fix for missing coredump when nodetool drain freezes: scylladb/scylladb#11468

btw. Is there a reason we wait 1h before timing out on node drain?

I don't think so, it sounds way too much to me.
Please check if any previous commit message explains it, and if not lower it to something more reasonable. (15 mins?)

roydahan · 2022-11-15T11:29:43Z

This PR introduced issue #5418.
The parameter of "coredump_on_timeout" is now completely ignored.

soyacz requested review from enaydanov, fruch, roydahan and fgelcer September 23, 2022 10:17

github-actions bot assigned soyacz Sep 23, 2022

soyacz added backport/5.1 Need backport to 5.1 backport/2022.2 Need to backport to 2022.2 labels Sep 23, 2022

fgelcer approved these changes Sep 25, 2022

View reviewed changes

roydahan approved these changes Sep 25, 2022

View reviewed changes

soyacz merged commit 37a53dc into scylladb:master Sep 26, 2022

soyacz added backport/2022.2-done Commit backported to 2022.2 backport/5.1-done Commit backported to 5.1 labels Sep 26, 2022

soyacz added backport/2022.1-done Commit backported to 2022.1 5.0-backported labels Oct 10, 2022

soyacz mentioned this pull request Oct 17, 2022

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(run_nodetool): fix coredump on timeout #5311

fix(run_nodetool): fix coredump on timeout #5311

soyacz commented Sep 23, 2022

soyacz commented Sep 23, 2022

fgelcer left a comment

roydahan commented Sep 25, 2022

roydahan commented Nov 15, 2022

fix(run_nodetool): fix coredump on timeout #5311

fix(run_nodetool): fix coredump on timeout #5311

Conversation

soyacz commented Sep 23, 2022

PR pre-checks (self review)

soyacz commented Sep 23, 2022

fgelcer left a comment

Choose a reason for hiding this comment

roydahan commented Sep 25, 2022

roydahan commented Nov 15, 2022