-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a timeout to the UDS metric sink #2635
Conversation
The accountant does IO, which can occupy a thead pool worker and prevent it from doing something useful instead.
In some rare cases `sendto` could block when sending datagrams. We now set the corresponding file descriptor to non blocking mode and discard the metrics event instead.
We now wait for up to one second while trying to write metrics, but only if the previous line was not dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably works, but I don't understand the reason behind polling the socket conditionally. Could you elaborate?
Why not poll unconditionally? It seems we're now at least dropping one line with higher probability.
If we wait every time and run into timeouts constantly the message inbox of the accountant will grow as soon as the average time between messages is lesser than the timeout. Since the inbox is an unbounded buffer it would run out of memory eventually, which is exactly the situation that this change is supposed to fix. |
Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
* Explain why we don't wait if the previous send timed out * Explain the 1 second timeout value * Document the UDS send function
My mental model is this:
By this logic, it doesn't hurt to poll unconditionally. It seems this scenario of intermittent outages is not reflected, rather, we're punishing directly by dropping a message on the floor immediately. So only for the non-common case of continuous outage, we need to drop messages. Does that make sense? |
That is what we do.
That doesn't make it a no-op. In fac t, I'm going to change it to try to send first, and only
Intermittent outages are covered. "Intermittent" in the sense that we can't send every so often, and it recovers relatively quickly. The logic to wait only in the good state has no effect in this scenario. What it does cover is the case of being unable to send for a longer time.
Well, yes, that is exactly what this code does. |
Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
7ca00b1
to
0ed47fa
Compare
Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless @dominiklohmann has some mechanical feedback on this PR, I'm giving this greenlight modulo pending discussion threads.
I don't know that much about this mechanism as you do, my only concern is that |
Yes, I also think we should revert this to |
Personal opinion: |
Here is what the call sites look like when using
if (auto err = dest.send(
std::span<char>{reinterpret_cast<char*>(buf.data()), buf.size()},
timeout_usec)) {
if (err == ec::timeout)
uds_datagram_sink_dropping = true;
else {
VAST_WARN("{} failed to write metrics to UDS sink: {}", *self,
success.error());
VAST_WARN("{} disables the UDS metrics sink", *self);
uds_datagram_sink.reset();
return;
}
}
if (auto success = dest.send(
std::span<char>{reinterpret_cast<char*>(buf.data()), buf.size()},
timeout_usec);
!success) {
if (success.error() == ec::timeout)
uds_datagram_sink_dropping = true;
else {
VAST_WARN("{} failed to write metrics to UDS sink: {}", *self,
success.error());
VAST_WARN("{} disables the UDS metrics sink", *self);
uds_datagram_sink.reset();
return;
}
} Both seem to be downgrades over |
Why not switch-case on |
I implemented that now so we can move on. |
We only need to poll when we can't send immediately, so we do it that way.
We now wait for up to one second while trying to write metrics, but only if the previous line was not dropped.
Additionally, the file descriptor used for sending is put into non-blocking mode to avoid endless blocking in case the listening socket hangs up but isn't cleaned up fully.
📝 Reviewer Checklist
Review this pull request by ensuring the following items: