Skip to content

Fix serve exiting prematurely #3562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 19, 2023
Merged

Conversation

dominiklohmann
Copy link
Member

Important

We discussed internally that we want to merge this fix only after the Tenzir v4.3 release, as we want to have some time to double-check whether there are any other fallouts of the change.

Pipelines ending with serve should wait until all events are fetched from the corresponding /serve endpoint. This was not always the case because of an incorrectly set scope in the operator.

Fixing that scope, however, led to the operator deadlocking the node under some circumstances. This revealed a bug at the very core of CAF's scheduling logic, which incorrectly blocked not just ordinary but also system messages from arriving when a request response was awaited. I got that bug fixed upstream and backported the fix to our bundled fork of CAF 0.18. That in turn revealead us relying on the broken behavior in our execution nodes accidentally. I've included a fix for that as well.

@dominiklohmann dominiklohmann added the bug Incorrect behavior label Oct 9, 2023
@dominiklohmann dominiklohmann force-pushed the topic/serve-premature-exit branch from 4eb3df9 to a4d6680 Compare October 9, 2023 15:10
@dominiklohmann dominiklohmann requested a review from lava October 9, 2023 15:11
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me from the logical perspective, and we did a lot of manual testing to verify that this is indeed the source of the error.

@dominiklohmann dominiklohmann force-pushed the topic/serve-premature-exit branch from b4f7863 to a2af8bc Compare October 19, 2023 11:26
Pipelines ending with `serve` should wait until all events are fetched
from the corresponding `/serve` endpoint. This was not always the case
because of an incorrectly set scope in the operator.

Fixing that scope, however, led to the operator deadlocking the node
under some circumstances. This revealed a bug at the very core of CAF's
scheduling logic, which incorrectly blocked not just ordinary but also
system messages from arriving when a request response was awaited. I
got that bug fixed upstream and backported the fix to our bundled fork
of CAF 0.18. That in turn revealead us relying on the broken behavior in
our execution nodes accidentally. I've included a fix for that as well.
We will fix this in a follow-up in the near future; in the meantime, the
comment in the file explains why the tests are disabled.
@dominiklohmann dominiklohmann force-pushed the topic/serve-premature-exit branch from a2af8bc to 0b9c3d9 Compare October 19, 2023 12:02
@dominiklohmann dominiklohmann merged commit c32cf04 into main Oct 19, 2023
@dominiklohmann dominiklohmann deleted the topic/serve-premature-exit branch October 19, 2023 13:45
dominiklohmann added a commit that referenced this pull request Oct 26, 2023
This is five fixes in one PR:
1. Predicate pushdown did not work in v4.3; this supersedes my previous
attempt in #3590
2. The catalog expected expressions to be already normalized and
validated, which for non-normalized expressions caused all partitions to
be returned
3. We introduced a race in #3562 that caused operators to sometimes hang
on shutdown; this mostly occurred in release builds with high-volume
pipelines at the process boundary (no changelog entry for this one as we
never released with this bug)
4. The long option `--skip-empty` for `read lines` now works as expected
5. We now include the feature flags in `show build`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants