Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Veneur retry failed datadog flushes? #560

Closed
volfco opened this issue Oct 10, 2018 · 6 comments
Closed

Does Veneur retry failed datadog flushes? #560

volfco opened this issue Oct 10, 2018 · 6 comments

Comments

@volfco
Copy link

volfco commented Oct 10, 2018

I'm not that good at Go so I couldn't find my answer from the code.

Does Veneur retry sending failed datadog flushes? I've seen fair amount of flushes fail due to http errors (timeouts) and I'm wondering if those are just warnings or could be dataloss.
...: time="2018-10-10T10:22:44-05:00" level=warning msg="Could not execute request" action=flush error="net/http: request canceled (Client.Timeout exceeded while awaiting headers)" host=app.datadoghq.com path=/api/v1/series

Are these requests retried? And if so, how many retries before the segment is lost?

@ChimeraCoder
Copy link
Contributor

Veneur itself doesn't retry flushes to Datadog (though you could use an http proxy for that, if you wanted). The entire pipeline is assumed to be mildly lossy, given that metrics are themselves received over UDP, which provides no delivery guarantees. Sporadic, occasional metric failures are tolerated.

That said, if you're seeing a lot of timeouts, something is probably up. We ourselves don't see many timeouts running Veneur at scale, so I'm curious what's going on here. Is your outbound network connection spotty? Are you sending a particularly large payload with each flush (a lot of metrics, or a long flush cycle)?

@volfco
Copy link
Author

volfco commented Oct 10, 2018

We're seeing a small number of sustained errors. I've got flush_max_per_body set as 25000- which is the default in the example. I don't know if this is inline with what you're seeing, but 10 to 15 errors every 15 minutes across the various DCs I've deployed Veneur to.

image

(Broken down by DC)

These are servers from all over the world going to aws us-east-1, so I'm expecting some errors just not sure how many.

@volfco
Copy link
Author

volfco commented Oct 10, 2018

Digging into the native datadog agent, it does look like it has some retry logic here: https://github.com/DataDog/datadog-agent/blob/d3e74927d78a5982d9978ed8540bd6b2c61ab437/pkg/forwarder/transaction.go#L144 under certain failure cases- namely request errors such as timeouts.

@ChimeraCoder
Copy link
Contributor

ChimeraCoder commented Oct 10, 2018

Yeah, that's definitely not in line with what we've experienced. We're not using Datadog ourselves at the moment, so I can't compare against current data, but timeouts in Veneur are quite rare - less than one per day - except during a Datadog outage (and their status page is green right now).

Just to clarify: when you say that this is from servers all around the world going to aws us-east-1, that's from tracing the location of where app.datadoghq.com resolves (us-east-1)?

We do use haproxy for external egress from our network, and haproxy does have built in retrial. So it's possible that we wouldn't have noticed the connection timeouts within Veneur, if haproxy was retrying and the success rate of the retried request was high enough. As a quick test, I'd recommend trying running requests through a proxy like haproxy and seeing if that fixes the issue.

@volfco
Copy link
Author

volfco commented Oct 10, 2018

Yep. Every DC we have resolved to us-east-1 ELBs. We're talking directly to datadog without a proxy.

It seems the path forward is for me to add some basic retry logic into the datadog requests. We're not moving away from Datadog anytime soon so retry logic is desired. I've put in a fair amount of effort to get our dogstatsd pipeline reliable so I'm not stopping now.

@volfco
Copy link
Author

volfco commented Oct 10, 2018

Opened #561 for a I'm-new-to-go-and-I-think-this-is-a-valid-fix fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants