udp stats gone missing? #318

gingerlime · 2013-07-30T17:38:01Z

I'm playing around with statsd for a presentation I'm planning to give on statsd/graphite/ruby later this week, and I've bumped into a strange issue.

I noticed that when I send a big number of statsd request, some appear to 'go missing'.

I used tcpdump to see whether the actual UDP datagrams arrive, and I can see all messages received by the statsd host, but on the debug log I get only around 1/3 of the messages. I tested a few times sending 10,000 stats, tcpdump shows all of them, but the log file shows only around 3000-4000 messages.

The machine is not super-fast, but not particularly busy either. Any suggestions on how to ensure statsd can cope with the amount of messages it receives? Or are there some rules about capacity for each statsd daemon in terms of throughput / latency??

gingerlime · 2013-07-30T17:59:27Z

quick-update: if I fire 1,000 messages each time I observe a similar behaviour (around 300-400 messages received out of 1,000), but sending a 'batch' of 100 each time, it seems like all messages are received... maybe some internal buffer gets filled-up and starts discarding messages??

kppullin · 2013-07-30T18:45:43Z

There are couple things I've found effective in lessening the number of dropped packet (well... at least they seem to help):

First run netstat -s --udp to see if there are any receive errors. Lost stats correlate with receive errors in my test cases.
Bump the values of net.ipv4.udp_rmem_min and net.ipv4.udp_wmem_min. A value of 131072 helps, but be sure to experiment yourself.

What's interesting is that tcpdump sees the incoming packets. I'm not too familiar where tcpdump sniffs from in the network stack; it's possible that tcpdump sees the packets before they're dropped later on due to full OS buffers.

gingerlime · 2013-07-30T19:19:11Z

Thanks for the detailed info @kppullin ! looks like it's definitely the right direction, but still no joy.

I tried updating those settings using sysctl -w net.ipv4.udp_rmem_min=... and also various other recommended UDP tuning, e.g. from here. One setting that I was hopeful about was sysctl -w net.core.netdev_max_backlog=20000, but unfortunately it didn't make a difference either. (I assume that using sysctl -w takes immediate effect and does not require a reboot... right?)
tcpdump still appears to receive all packets pretty much (perhaps a very small fraction goes missing)
the debug log shows only around 30-40% of the messages...
another interesting thing is that netstat -s --udp does show packet receive errors, and running it before and after firing the statsd messages, it seems like the number of errors correlate more or less with the number of messages that goes missing...

I still wonder how come tcpdump does receive all of those then?? or does it sit 'above' the OS udp handler... Very strange.

It does look like an OS thing rather than specific to statsd... would appreciate any help or suggestions, but would otherwise close this. Thanks for the help so far!

gingerlime · 2013-07-31T05:14:52Z

another update with more things I've tried:

running statsd on two different hosts - similar results. The first server was a DigitalOcean virtual server with 2Gb RAM and 2 CPUs with receive rate of around 30%. The second one was AWS EC2 m1.small (1.7Gb RAM, 1 CPU) - with receive rate of around 20%. The third one was AWS EC2 c1.medium (1.7Gb RAM, 2 CPUs) - receive rate increased somewhat to around 45-50%, but still many lost messages.
using a different statsd client on the remote host - running the python client seems to increase the drop rate compared to ruby. I received only around 20-25% of messages. I guess the python statsd client sends stats slightly faster than ruby.
I then wrote a very basic python UDP listener/logger, running on the same port as statsd on the server. It received all messages without any packet drops.

So I suspect that it might be specific to node.js / statsd after all - maybe it's purely a question of how fast the server can process every single datagram, and then move on to processing the next one?

gingerlime · 2013-07-31T05:45:58Z

Looks like I've found a solution / workaround !

Following this page, I increased the receive buffers, and noticed some improvement. The recommended change was to 1mb, but this wasn't enough. Changing from the default of 224kb to 20mb seems to avoid any drops :)

root@statsd-host:~# sysctl net.core.rmem_default
net.core.rmem_default = 229376
root@statsd-host:~# sysctl -w net.core.rmem_default=20971520
net.core.rmem_default = 20971520

Not sure what potential side-effects this might have, or what's the "sweet-spot" value, but this seems to improve things considerably for this particular case. YMMV.

Thanks again for pointing me in the right direction @kppullin and hope this will be useful for others.

gingerlime closed this as completed Jul 31, 2013

kobynet mentioned this issue Oct 28, 2015

Add TCP support Pereingo/statsd-csharp-client#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

udp stats gone missing? #318

udp stats gone missing? #318

gingerlime commented Jul 30, 2013

gingerlime commented Jul 30, 2013

kppullin commented Jul 30, 2013

gingerlime commented Jul 30, 2013

gingerlime commented Jul 31, 2013

gingerlime commented Jul 31, 2013

udp stats gone missing? #318

udp stats gone missing? #318

Comments

gingerlime commented Jul 30, 2013

gingerlime commented Jul 30, 2013

kppullin commented Jul 30, 2013

gingerlime commented Jul 30, 2013

gingerlime commented Jul 31, 2013

gingerlime commented Jul 31, 2013