Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a timeout to the UDS metric sink #2635

Merged
merged 11 commits into from
Oct 21, 2022

Conversation

tobim
Copy link
Member

@tobim tobim commented Oct 14, 2022

We now wait for up to one second while trying to write metrics, but only if the previous line was not dropped.
Additionally, the file descriptor used for sending is put into non-blocking mode to avoid endless blocking in case the listening socket hangs up but isn't cleaned up fully.

📝 Reviewer Checklist

Review this pull request by ensuring the following items:

  • All user-facing changes have changelog entries
  • User-facing changes are reflected on vast.io

tobim added 4 commits October 14, 2022 16:12
The accountant does IO, which can occupy a thead pool worker and
prevent it from doing something useful instead.
In some rare cases `sendto` could block when sending datagrams. We
now set the corresponding file descriptor to non blocking mode and
discard the metrics event instead.
We now wait for up to one second while trying to write metrics,
but only if the previous line was not dropped.
@tobim tobim added the bug Incorrect behavior label Oct 14, 2022
@tobim tobim requested a review from mavam October 14, 2022 14:21
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably works, but I don't understand the reason behind polling the socket conditionally. Could you elaborate?

Why not poll unconditionally? It seems we're now at least dropping one line with higher probability.

libvast/include/vast/detail/posix.hpp Outdated Show resolved Hide resolved
libvast/src/detail/posix.cpp Outdated Show resolved Hide resolved
libvast/src/system/accountant.cpp Outdated Show resolved Hide resolved
libvast/src/system/accountant.cpp Outdated Show resolved Hide resolved
libvast/src/system/accountant.cpp Show resolved Hide resolved
@tobim
Copy link
Member Author

tobim commented Oct 14, 2022

Why not poll unconditionally? It seems we're now at least dropping one line with higher probability.

If we wait every time and run into timeouts constantly the message inbox of the accountant will grow as soon as the average time between messages is lesser than the timeout. Since the inbox is an unbounded buffer it would run out of memory eventually, which is exactly the situation that this change is supposed to fix.

tobim and others added 2 commits October 14, 2022 17:12
Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
* Explain why we don't wait if the previous send timed out
* Explain the 1 second timeout value
* Document the UDS send function
libvast/src/detail/posix.cpp Outdated Show resolved Hide resolved
@mavam
Copy link
Member

mavam commented Oct 14, 2022

If we wait every time and run into timeouts constantly the message inbox of the accountant will grow as soon as the average time between messages is lesser than the timeout. Since the inbox is an unbounded buffer it would run out of memory eventually, which is exactly the situation that this change is supposed to fix.

My mental model is this:

  1. The common case is that VAST produces less that than the other side can consume
  2. We want to be able to tolerate outages on the other end, for brief amounts of time
  3. If the socket becomes unavailable, we accept a small amount of wait time
  4. In the common case, wpoll is a no-op because the socket is always ready

By this logic, it doesn't hurt to poll unconditionally. It seems this scenario of intermittent outages is not reflected, rather, we're punishing directly by dropping a message on the floor immediately.

So only for the non-common case of continuous outage, we need to drop messages. Does that make sense?

@tobim
Copy link
Member Author

tobim commented Oct 17, 2022

If we wait every time and run into timeouts constantly the message inbox of the accountant will grow as soon as the average time between messages is lesser than the timeout. Since the inbox is an unbounded buffer it would run out of memory eventually, which is exactly the situation that this change is supposed to fix.

My mental model is this:

0. The common case is that VAST produces less that than the other side can consume

1. We want to be able to tolerate outages on the other end, for brief amounts of time

2. If the socket becomes unavailable, we accept a small amount of wait time

That is what we do.

3. In the common case, `wpoll` is a no-op because the socket is always ready

That doesn't make it a no-op. In fac t, I'm going to change it to try to send first, and only wpoll if we get EAGAIN. No need to introduce an unforced pessimization here.

By this logic, it doesn't hurt to poll unconditionally. It seems this scenario of intermittent outages is not reflected, rather, we're punishing directly by dropping a message on the floor immediately.

Intermittent outages are covered. "Intermittent" in the sense that we can't send every so often, and it recovers relatively quickly. The logic to wait only in the good state has no effect in this scenario. What it does cover is the case of being unable to send for a longer time.

So only for the non-common case of continuous outage, we need to drop messages. Does that make sense?

Well, yes, that is exactly what this code does.

Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
@tobim tobim force-pushed the story/sc-37951/fix-blocking-uds-metrics branch from 7ca00b1 to 0ed47fa Compare October 17, 2022 09:01
Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless @dominiklohmann has some mechanical feedback on this PR, I'm giving this greenlight modulo pending discussion threads.

@dominiklohmann
Copy link
Member

Unless @dominiklohmann has some mechanical feedback on this PR, I'm giving this greenlight modulo pending discussion threads.

I don't know that much about this mechanism as you do, my only concern is that expected<bool> is a really bad return type that we should never use anywhere, but I already commented that.

@mavam
Copy link
Member

mavam commented Oct 17, 2022

I don't know that much about this mechanism as you do, my only concern is that expected<bool> is a really bad return type that we should never use anywhere, but I already commented that.

Yes, I also think we should revert this to caf::error with timeout being a specific error code.

@dominiklohmann
Copy link
Member

Personal opinion: caf::expected<void> is a strictly better caf::error as a function return type, because it has bool-esque semantics, so I think that'd be even better.

@tobim
Copy link
Member Author

tobim commented Oct 20, 2022

Here is what the call sites look like when using caf::error or caf::expected<void> respectively:

  • error
    if (auto err = dest.send(
          std::span<char>{reinterpret_cast<char*>(buf.data()), buf.size()},
          timeout_usec)) {
      if (err == ec::timeout)
        uds_datagram_sink_dropping = true;
      else {
        VAST_WARN("{} failed to write metrics to UDS sink: {}", *self,
                  success.error());
        VAST_WARN("{} disables the UDS metrics sink", *self);
        uds_datagram_sink.reset();
        return;
      }
    }
  • expected<void>
    if (auto success = dest.send(
          std::span<char>{reinterpret_cast<char*>(buf.data()), buf.size()},
          timeout_usec);
        !success) {
      if (success.error() == ec::timeout)
        uds_datagram_sink_dropping = true;
      else {
        VAST_WARN("{} failed to write metrics to UDS sink: {}", *self,
                  success.error());
        VAST_WARN("{} disables the UDS metrics sink", *self);
        uds_datagram_sink.reset();
        return;
      }
    }

Both seem to be downgrades over caf::expected<bool>. The main issue I have with it is that I consider timeout handling for a function that takes a timeout as regular control flow, not error handling.

@mavam
Copy link
Member

mavam commented Oct 20, 2022

The main issue I have with it is that I consider timeout handling for a function that takes a timeout as regular control flow, not error handling.

Why not switch-case on caf::error as return value? That's idiomatic and doesn't require shoehorning the logic into a binary if-else.

@tobim
Copy link
Member Author

tobim commented Oct 21, 2022

Why not switch-case on caf::error as return value? That's idiomatic and doesn't require shoehorning the logic into a binary if-else.

I implemented that now so we can move on.
I'll still push another updated to try to send immediately so we only have to go through wpoll if that didn't work.

We only need to poll when we can't send immediately, so we
do it that way.
@tobim tobim enabled auto-merge October 21, 2022 10:56
@tobim tobim merged commit e1eea55 into master Oct 21, 2022
@tobim tobim deleted the story/sc-37951/fix-blocking-uds-metrics branch October 21, 2022 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants