-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(throttle transform): Allow throttling by bytes #14280
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Soak Test ResultsBaseline: 2586b52 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
Soak Test ResultsBaseline: c745d25 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to my comments on the unit/mode enums below, this needs to update the documentation in website/cue/reference/components/transforms/throttle.cue
, along with a big warning about the increased CPU requirements the extra encoding step will cause.
src/transforms/throttle.rs
Outdated
#[serde(default = "default_unit")] | ||
unit: ThrottleUnit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mode
would describe this setting better, as it is changing between different modes of operation, with the "bytes" mode turning on encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I originally named it, but I decided against it because I can see "mode" becoming overloaded. There is another ticket about throttling having different modes for when capacities are hit (dropping data vs. applying backpressure). I moved to "unit" because it was a bit more precise in terms of what the threshold
ends up being a measure of.
I'm totally okay with making the change to unit
if you still prefer that, but I wanted to offer the counterargument.
src/transforms/throttle.rs
Outdated
/// The throttling unit to use. | ||
#[serde(default = "default_unit")] | ||
unit: ThrottleUnit, | ||
|
||
/// The encoding to use if throttling by bytes | ||
encoding: Option<EncodingConfig>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be done with an enum
using serde's internally_tagged
mode, which would make it impossible to configure the "bytes" mode without an encoding:
enum ThrottleMode {
Events,
Bytes {
encoding: EncodingConfig,
},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, this is possible, with a caveat. If we want these fields at the top-level, then I believe we need to use the flatten
serde config. However, this cannot be used in conjunction with deny_unknown_fields
. Removing deny_unknown_fields
seems to spoil the certainty that this change is supposed to introduce. It is very possible I am missing something though.
Alternatively, we can avoid flattening. Nothing wrong with this, but the name gets a little tricky because we may end up with structs that look like this. I can try to experiment with options, but any further suggestions are appreciated. I am learning as I am going.
mode:
mode: events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, serde enum encoding is tricky and, yes, dropping deny_unknown_fields
is undesirable. I don't have a good suggestion here other than a custom deserializer, so probably leaving the encoding as a separate parameter is the most reasonable path forward. Edit: Since you have already made the change and we have other components with deny_unknown_fields
, I'd say leave it here and we'll see what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I happened to come across a similar scenario with the syslog source. It has a mode
field which is an enum with a lot of configuration options. It uses flattening and internal tagging while accepting unknown fields, just as this PR's current state is doing. I think this is a sign that this is the preferred approach.
https://github.com/vectordotdev/vector/blob/master/src/sources/syslog.rs#L36-L37
Soak Test ResultsBaseline: 2c58589 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
fbe5764
to
e3d346b
Compare
I believe this PR is now ready for a more proper review. I have updated the cue docs and swapped out the naming from |
Hi @jutley, We discussed this contribution during our team check-in. We are excited for this transform to receive this feature. One concern we have with this PR, is that the requirement to serialize the data to count the bytes can introduce a significant, and unexpected overhead to this transform. While our concern is based on assumptions, those assumptions are based on our experiences dealing with– and past benchmarking of event serialization in the hot path. One potential solution we could see is using work that is currently in-flight (and should land this week) to estimate the JSON encoded size of an event, with minimal overhead. There are two downsides to this:
We were wondering if any of these two downsides would be an issue for your use-case? If not, we would probably prefer that direction for this PR, while keeping the possibility open (thus, baking this into the config API) to add exact byte size counting in the future, if enough users have a need for it, and accept the trade-off of the reduced throughput. Additionally, we could introduce a new soak test in this PR, that takes an existing test, and adds the throttle transform in the middle, using byte-size throttling, to measure the impact on the overall throughput. |
@JeanMertz These ideas sound totally reasonable. Using a true encoder has always felt simultaneously elegant and incredibly clumsy to me. I'll wait for this encoding estimation work to land and then continue from there. I think the direction you are suggesting will work. That said, there is a bit of flexibility that the encoding provides that I was planning on exploring. In particular, I think it will be useful to be able to choose the fields to use in the encoding, or to choose the If this feature doesn't offer much configuration, I may look to implement different methods for getting a byte count for an event. If nothing else, I think it'll be useful (at least for me) to support the full JSON and just the message. |
23c74f1
to
0910f7e
Compare
✅ Deploy Preview for vrl-playground ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
0910f7e
to
3e5eea5
Compare
@JeanMertz I managed to find a little time to get back to this PR. I rewrote it to use the estimated json size, and am also supporting the bytes of the |
@JeanMertz anything I can do to help move this along? |
f0460ad
to
6c04814
Compare
Regression Test Results
Run ID: 9c2dc1af-ed9f-4eff-b0cf-ef77391eb92f Explanation
A regression test is an integrated performance test for
The table below, if present, lists those experiments that have experienced a
statistically significant change in their No interesting changes in Fine details of change detection per experiment.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jutley !
Thanks for this contribution! Picking this up since @JeanMertz is out on leave.
I think this is a great feature, but think we could model this a bit better by allowing specification of all three limits at once rather than a mode switch. This would let users, for example, throttle by both the number of events or their JSON size, which otherwise wouldn't be possible. I also think it makes understanding the threshold
configuration field a little less confusing.
My suggested configuration UX would be something like:
[transforms.foo]
type = "throttle"
key_field = "service"
window_secs = 60
threshold.events = 1000 # default to unlimited
threshold.json_bytes = 100000 # default to unlimited
threshold.message_bytes = 1000000 # default to unlimited
Where the first threshold that was hit would start throttling the incoming messages. This would mean maintaining 3 quotas rather than just one.
What do you think of this UX? I realize you are new to Rust and so adding this functionality may be difficult but we can help guide.
@jszwedko That's a great idea. Seems like an obvious choice now that its been said! I think I can figure out the implementation by referencing what already exists. I'll reach out if I need help! |
@jszwedko I started working on this (slowly), and I am running into a limitation with the RateLimiters. If one RateLimiter allows the message but the next RateLimiter does not allow the message, then the message should get dropped. However, in checking this, the message still counted against the first RateLimiter. Looking at the RateLimiter docs, I'm not seeing any kind of dry-run functionality. Any suggestions? |
Aha, yes, you appear to be right. There isn't a way to check the quota without consuming it or to replace tokens outside of the refresh interval 🤔 Given that, I'm not seeing an easy way to maintain multiple quotas for the same input stream. I opened boinkor-net/governor#167 on the upstream crate to see if there is something we are missing. Absent that support, I think the best we can do is something like the current implementation where you can only choose one of the characteristics to throttle on. However! I was also thinking of a potentially more flexible interface for this though which would be to allow providing a VRL expression to express how many tokens are used by a given incoming event. The interface for this would be something like: [transforms.foo]
type = "throttle"
key_field = "service"
window_secs = 60
threshold = 1000
tokens = "1" # the default, take one token for each event
# or
tokens = "len(.message)" # take N tokens based on the length of the message
# or
tokens = "len(encode_json(.))" # take N tokens based on the JSON-encoded length of the message
# or
tokens = """
if .status == "500" {
0
} else {
1
}
""" # for 500s take a different amount of tokens The JSON one would be rather expensive since it involves encoding and I know that was the motivation for using the estimated JSON encoding in this PR. I think we could resolve that by adding a new This also again has the advantage of having |
That approach makes sense. I suspect that if someone wants to use the "multi-mode" approach, they can chain throttle transforms together. That should allow us to guarantee something like "no more then x events and y bytes per second per key". There are probably some subtle mathematical differences between the chained approach and the multi-mode approach, but I suspect this isn't going to be particularly important for most real world scenarios. We're dropping data intentionally either way. It'll take me a little longer to get around to exploring this approach. I think the |
Yeah, agreed, we can split that bit off into its own PR since it'll involve an addition to VRL. |
Looks like there was a governor feature request that I missed that discusses our exact issue: boinkor-net/governor#156 |
Just leaving a note to say that I took a look at the recommendation approach, and while I do believe it should be possible, I am just not proficient enough in Rust to handle this in a timely manner. Small things take me a long time, and I just don't have the time for this. If anyone else is interested in picking this up, feel free! |
Implements #11854.
This is my first time programming in rust so some things may be strange. Feedback is greatly appreciated.
I started looking into updating the .cue files for documentation. I saw there are reusable components here, such as the
encoding
configuration I would like to use. However, this currently is only scoped to sinks. The fact that I am using it here in a transform is totally new. I'm not sure whether this is a sign that I am doing something in a very strange way, or if I should just update the necessary pieces to support theencoding
feature within transforms. If someone can help me answer this, it would be a big help for me.