Skip to content

Eliminate signal monitoring shutdown lag #2807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tobim
Copy link
Member

@tobim tobim commented Dec 18, 2022

This replaces the signal monitor with an adapter approach that translates POSIX signals into CAF messages.
The feature is intentionally limited to SIGINT and SIGTERM and should always lead to termination. Other signals are not supported in order to avert the temptation of using signals for poor mans IPC.

To the reviewer:

  • verify that client commands exit normally upon completion.
  • verify that client commands can be interrupted with a single CTRL-C.
  • verify that commands that start a node attempt a graceful shutdown with CTRL-C but can be terminated immediately with a second CTRL-C.
  • verify that client commands with the -N option exit regularly upon completion.

@tobim tobim added the performance Improvements or regressions of performance label Dec 18, 2022
@tobim tobim requested a review from a team December 18, 2022 11:31
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach does not exit gracefully for me.

@tobim tobim force-pushed the story/sc-39891/eliminate-signal-monitor-shutdown-lag branch from 9ce7c27 to b89cb1f Compare December 21, 2022 00:35
tobim added 13 commits December 21, 2022 06:29
This replaces the signal monitor with an adapter approach that
translates POSIX signals into CAF messages.
The feature is intentionally limited to SIGINT and SIGTERM and
should always lead to termination. Other signals are not supported
in order to avert the temptation of using signals for poor mans IPC.
Instead of sending a signal to the process we use the
`pthread_cancel` function. This works because the `sigwait` is
a cancelation point.
With the faster termination thanks to the new signal handling logic
the symptoms of the not quite working blocking import become much
more pronounced. This is especially problematic for the integration
tests which rely on blocking import a lot.

In order to preserve the blocking semantic the flag now implies a
flush to disk of in memory data. This ensures that active partitions
are written to disk and added catalog before the command returns.
@tobim tobim force-pushed the story/sc-39891/eliminate-signal-monitor-shutdown-lag branch from cdd7cd0 to ebda73a Compare December 21, 2022 06:44
@tobim
Copy link
Member Author

tobim commented Dec 21, 2022

macOS CI fails because it doesn't have jthread, but homebrew libc++ should have it. I'm not sure what's wrong here.
Could it be that homebrew clang uses apples libc++ by default?

@tobim tobim requested a review from dominiklohmann December 21, 2022 08:25
@dominiklohmann
Copy link
Member

macOS CI fails because it doesn't have jthread, but homebrew libc++ should have it. I'm not sure what's wrong here.
Could it be that homebrew clang uses apples libc++ by default?

Homebrew libc++ doesn't have it, so you'll have to use std::thread instead and join manually.

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still sometimes see the 30s delay in the integration tests:

2022-12-21 10:38:28 DEBUG    stopping server fixture: [PosixPath('/Users/dominiklohmann/Desktop/vast/master/build/bin/vast'), '--bare-mode', '--config=/Users/dominiklohmann/Desktop/vast/master/plugins/web/integration/vast.yaml', '-e', ':42024', 'stop']
2022-12-21 10:38:59 INFO     ran all 5 steps successfully

For the remainder, I left some questions, but I think this is mostly ready to go. Works nicely locally.

Co-authored-by: Dominik Lohmann <mail@dominiklohmann.de>
@dominiklohmann
Copy link
Member

I still sometimes see the 30s delay in the integration tests:

2022-12-21 10:38:28 DEBUG    stopping server fixture: [PosixPath('/Users/dominiklohmann/Desktop/vast/master/build/bin/vast'), '--bare-mode', '--config=/Users/dominiklohmann/Desktop/vast/master/plugins/web/integration/vast.yaml', '-e', ':42024', 'stop']
2022-12-21 10:38:59 INFO     ran all 5 steps successfully

We identified this to be a separate bug specific to the web plugin, where the web server command does not shut down properly. This must not necessarily be fixed by this PR.

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving modulo the ongoing open discussions.

@dominiklohmann
Copy link
Member

I enabled auto-merge here since this was really annoying over the past few weeks and shouldn't be blocking further work while most of everyone is on vacation.

@dominiklohmann dominiklohmann merged commit f889fe5 into master Dec 23, 2022
@dominiklohmann dominiklohmann deleted the story/sc-39891/eliminate-signal-monitor-shutdown-lag branch December 23, 2022 06:08
dominiklohmann added a commit that referenced this pull request Jan 17, 2023
This fixes a bug that was only surfaced by #2807, so it's not really
worth a separate changelog entry.
Dakostu pushed a commit that referenced this pull request Jan 26, 2023
This fixes a bug that was only surfaced by #2807, so it's not really
worth a separate changelog entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements or regressions of performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants