Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(run_nodetool): fix coredump on timeout #5311

Merged
merged 1 commit into from
Sep 26, 2022

Conversation

soyacz
Copy link
Contributor

@soyacz soyacz commented Sep 23, 2022

There are 2 reasons of coredump not creating on timeout error when running nodetool (e.g. in upgrade test when we run: node.run_nodetool("drain", timeout=3600, coredump_on_timeout=True):

  1. kind of race condition when SSHReaderThread got end event (due timeout) but reader didn't acknowledge that it occured yet. This even lead to sometimes freeze of whole ssh command until CriticalError killed ssh connection.
  2. wrong exception handling when running nodetool

This commit fixes both problems.

PR pre-checks (self review)

  • I followed KISS principle and best practices
  • I didn't leave commented-out/debugging code
  • I added the relevant backport labels
  • New configuration option are added and documented (in sdcm/sct_config.py)
  • I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
  • All new and existing unit tests passed (CI)
  • I have updated the Readme/doc folder accordingly (if needed)

There are 2 reasons of coredump not creating on timeout error when
running nodetool (e.g. in upgrade test when we run:
`node.run_nodetool("drain", timeout=3600, coredump_on_timeout=True)`:
1. kind of race condition when `SSHReaderThread` got end event (due
timeout) but reader didn't acknowledge that it occured yet. This even
lead to sometimes freeze of whole ssh command until
CriticalError killed ssh connection.
2. wrong exception handling when running nodetool

This commit fixes both problems.
@soyacz soyacz added backport/5.1 Need backport to 5.1 backport/2022.2 Need to backport to 2022.2 labels Sep 23, 2022
@soyacz
Copy link
Contributor Author

soyacz commented Sep 23, 2022

@roydahan this is fix for missing coredump when nodetool drain freezes: scylladb/scylladb#11468

btw. Is there a reason we wait 1h before timing out on node drain?

Copy link
Contributor

@fgelcer fgelcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@roydahan
Copy link
Contributor

@roydahan this is fix for missing coredump when nodetool drain freezes: scylladb/scylladb#11468

btw. Is there a reason we wait 1h before timing out on node drain?

I don't think so, it sounds way too much to me.
Please check if any previous commit message explains it, and if not lower it to something more reasonable. (15 mins?)

@soyacz soyacz merged commit 37a53dc into scylladb:master Sep 26, 2022
@soyacz soyacz added backport/2022.2-done Commit backported to 2022.2 backport/5.1-done Commit backported to 5.1 labels Sep 26, 2022
@soyacz soyacz added backport/2022.1-done Commit backported to 2022.1 5.0-backported labels Oct 10, 2022
@roydahan
Copy link
Contributor

This PR introduced issue #5418.
The parameter of "coredump_on_timeout" is now completely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/5.1-done Commit backported to 5.1 backport/5.1 Need backport to 5.1 backport/2022.1-done Commit backported to 2022.1 backport/2022.2-done Commit backported to 2022.2 backport/2022.2 Need to backport to 2022.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants