-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector drops large # of UDP packets in statsd source without warning #15583
Comments
HI @derekhuizhang ! Thanks for this report. This seems likely to be because Vector cannot process the incoming UDP packets fast enough. Unfortunately Vector also can't know that since the packets are dropped by the OS. We have a couple of issues to improve performance of UDP sources:
Also relevant: I think this issue is covered by the others so I'll close it as a duplicate, but feel free to subscribe and add additional details to those others! There was some discussion around this in discord. |
Actually, I see you weren't able to observe dropped packets. I'll re-open this for investigation. |
Are there any available workarounds?
Why can't Vector know how many packets were dropped by OS? |
I revise my previous statement 馃槄 Apologies. I've been scattered this morning. Reading your issue again it sounds like you are observing packet drops with high volume. I misread the bit about using The simplest workaround is to horizontally scale Vector by either running additional instances are running multiple
That's a fair point. I meant the source couldn't know as part of normal processing since the packets are dropped before Vector has seen them. Vector could certainly read it from |
|
You can scale up the number of replicas. You will want to put a UDP load balancer in front. The helm chart has an HAProxy image included, but we generally recommend bringing your own load balancer (e.g. if you are in AWS, using an NLB). The HAProxy configuration is provided as a sort of "quick start".
Right, you could have multiple
That is definitely an option to run a sidecar. Vector's TCP handling is generally more performant as it can balance across multiple incoming TCP connections. I would recommend this sidecar establishing multiple TCP connections to Vector (maybe scaling that up or down automatically depending on the queue of messages to be sent?). Using a completely different medium, e.g. RabbitMQ or NATS, is definitely another option. |
Interesting, curious if there are any other sources which have been load tested with very high throughputs (100k+/sec metrics) |
We have performance tests covering some cases. You can see them here: https://github.com/vectordotdev/vector/tree/master/regression/cases. We measure bytes rather than events, but you should be able to extrapolate. We don't have any UDP-based ones just yet (#12215). |
A note for the community
Problem
For testing, I'm running Vector in
sandbox
namespace with stateless-aggregator mode helm chart as a Deployment, config shown belowI also run a firehose service (
kubectl apply -f <file_name>
with this config saved as a file):then I created a new pod (
kubectl run firehose -it --image=debian -n sandbox
) and cp'ed and ran a statsd-firehose on it to generate sample metrics: https://github.com/derekhuizhang/statsd-firehose./statsd-firehose -countcount 10 -distcount 10 -gaugecount 10 -statsd firehose:8126
(this sends 10 counts, dists, gauges every second to the service)Then I execed into the Vector pod and found that if we increase the counts/dists/gauges past a certain level, we get a very high # of UDP packet drops
If we do
./statsd-firehose -countcount 10 -distcount 10 -gaugecount 10 -statsd firehose:8126
for instance, the # of drops stays at 0If we do
./statsd-firehose -countcount 20000 -distcount 10 -gaugecount 10 -statsd firehose:8126
for instance, the # of drops increases by ~100 every secondIf we do
./statsd-firehose -countcount 100000 -distcount 10 -gaugecount 10 -statsd firehose:8126
, the # of drops increases rapidly every secondThe high # of drops indicates lots of metrics are being dropped. We ran this test bc we noticed a large # of metrics were being dropped in our prod envs, so we're pretty sure these metrics just aren't being processed by Vector. We've tried this with different sinks and transforms, so this isn't a sink/transform issue.
There is plenty of memory and CPU allocation available the pod, so we don't think there's any back-propagation happening
These dropped UDP packets are not surfaced in vector internal metrics, so there's no way to tell this is happening except by exec-ing into the Vector pod
We also noticed if we run
nc -ulp 8126 > /dev/null 2>&1
and ran./statsd-firehose -countcount 100000 -distcount 10 -gaugecount 10 -statsd firehose:8126
we don't see the massive # of UDP packet drops if we runnetstat -u -s
(which we do when running with Vector), so we don't think it's an inherent OS limitation, but happy to be proven wrong.Is there anything that we can do to stop Vector from dropping these UDP packets?
Configuration
Version
0.26.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: