Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RetryAttempts to AccessLog in JSON format #1793

Merged
merged 1 commit into from Aug 28, 2017

Conversation

mjeri
Copy link
Contributor

@mjeri mjeri commented Jun 27, 2017

This PR adds the amount of retries happened to the access log. This information is very useful for us when debugging requests.

For now this PR only adds the retry attempts to the JSON format. I am happy to extend the common log format as well, but I am not an expert in the CLF format and so I am not sure how to add this information meaningfully without breaking backwards compatibility of parsers.

@ldez ldez added area/logs kind/enhancement a new or improved feature. labels Jun 27, 2017
@traefik traefik deleted a comment from mjeri Jun 27, 2017
@ldez
Copy link
Member

ldez commented Jun 27, 2017

@marco-jantke please rebase to fix CI.

@mjeri
Copy link
Contributor Author

mjeri commented Jun 28, 2017

@ldez still some error :/

@ldez
Copy link
Member

ldez commented Jun 28, 2017

you have missed your rebase, try again 😉

@mjeri mjeri force-pushed the retries-in-access-log branch 2 times, most recently from 2d804b2 to 663b3cb Compare June 28, 2017 08:16
@mjeri
Copy link
Contributor Author

mjeri commented Jun 28, 2017

Sorry for that one.. I was just amending because I thought I just have to trigger the build again.

BTW now I have to run the make targets like DOCKER_VERSION=17.03.1 make validate to get them working locally. Do you want to improve the local workflow here again or add some documentation?

@ldez
Copy link
Member

ldez commented Jun 28, 2017

You don't need to add DOCKER_VERSION=17.03.1: in the Docker file there is a DOCKER_VERSION with the good version.

@mjeri
Copy link
Contributor Author

mjeri commented Jul 12, 2017

Any feedback for this one?

@emilevauge
Copy link
Member

@marco-jantke honestly, I'm a bit concerned with this one. It seems it solves a relatively rare use case and makes server.go more and more complex. @containous/traefik WDYT?

@timoreimann
Copy link
Contributor

@emilevauge disclaimer: I'm obviously somewhat biased as I work with Marco. Still, I hope I can give some motivation stemming from our real-world experience with a large-ish proxy setup.

We are currently running large parts of our micro-services infrastructure off of HAProxy where support for retry indicators on a per-access level turned out to be very helpful for us. When you run hundreds of HTTP-driven services, connectivity can fail due to numerous reasons: Services could be slow; intermittently failing; be calling broken endpoints/paths; fail to terminate/bootstrap orderly; or just have bugs, all potentially leading to retries. On the other end, the infrastructure could have problems: proxies being overloaded; network being slow; net-split happening; and a myriad of other potential things. We can't afford to expose ephemeral failures to the client right away, so we have retries enabled by default. At the same time, we strive to minimize the number of retries we see due to the latency imposed and the fact that a higher-than-usual number is a signal for some issue.

We use monitoring (namely, Prometheus) to detect retry anomalies. More often than not, however, it's impossible to understand why retries took place without analyzing on the access log level. For instance, we are regularly able to correlate retry counters with request latency at the various stages of a request to see whether a request, say, failed on the initial connection attempt, or it took the server too long to serve the first byte. Another example that we have seen is that some service endpoints caused errors while others did not, leading to retries in one case but not the other. With HAProxy, we can put the necessary pieces together since we have all relevant information presented in the access log.

Moving such information into a monitoring system is usually not feasible: You cannot track highly dynamic aspects of your requests in something like Prometheus because the number of time series (and with it, your monitoring system memory) would quickly explode. We happen to have first-hand experience with such scenarios (both intentional and by accident).

I hope that provides some insights. Let me know what you think.

@emilevauge
Copy link
Member

emilevauge commented Jul 20, 2017

@timoreimann @marco-jantke I'm not saying this feature is not useful, I'm saying that the implementation is a bit too complex and may need more work, specifically this peace of code:

if globalConfiguration.Retry != nil {
	metricsRetryListener := middlewares.NewMetricsRetryListener(metrics)

	if server.accessLoggerMiddleware != nil {
		saveRetries := &accesslog.SaveRetries{}
		retryListeners := middlewares.RetryListeners{metricsRetryListener, saveRetries}
		lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, retryListeners)
		lb = saveRetries.AsHandler(lb)
	} else {
		lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, metricsRetryListener)
	}
}

@marco-jantke do you think we could refactor this part?

@mjeri
Copy link
Contributor Author

mjeri commented Jul 20, 2017

I will give it a try :)

@mjeri
Copy link
Contributor Author

mjeri commented Jul 20, 2017

@emilevauge it was worth the reinvestigation. I realised that the functionality was actually not working properly with the former implementation during benchmarking with concurrent requests. It was mixing up the RetryAttempts that I was writing into the logs between requests and often reseted the counter to 0 even though the request actually had retries. In order to get this working I changed the RetryListeners method Retried to also contain the current request. This way we can ensure that always the proper RetryAttempts information for each request access log entry is written. Additionally it made the setup server.go substantially clearer and more straight-forward.

@emilevauge @timoreimann In my last commit eff6596 I dropped the unit-test for the registerRetryMiddleware function as I simplified it and the current layout of the method does not allow for easy testing and the previous test did verify basically only one thing: that the amount of attempts on the retry middleware is configured properly. WDYT about this step? I would dare to leave it as is, as the code and functionality is rather trivial.

@mjeri
Copy link
Contributor Author

mjeri commented Aug 14, 2017

Can I do something to bring this forward?

@timoreimann
Copy link
Contributor

@marco-jantke @emilevauge is this still in the WIP state? If not, we should open it up for review.

@mjeri
Copy link
Contributor Author

mjeri commented Aug 14, 2017

The last state is that after Emile requested some refactoring after his design review, that I implemented some changes to improve the design of the code. So now it should get another round of design review and then we can move AFAICS.

@timoreimann
Copy link
Contributor

Seemingly your turn again, @emilevauge. :-)

@ldez ldez added the size/M label Aug 16, 2017
@emilevauge
Copy link
Member

I think it's a lot better :) Thanks @marco-jantke !
Design LGTM

saveRetries := &SaveRetries{}

logDataTable := &LogData{Core: make(CoreLogData)}
req := httptest.NewRequest("GET", "/some/path", nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"GET"/http.MethodGet/

By now I believe you're challenging me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I would never dare! Consider that I added this code a long time ago.

retryListener.Retried(1)
retryListener.Retried(2)
retryListener.Retried(req, 1)
retryListener.Retried(req, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also somehow expand our test expectation / assertion logic when we extend the system-under-test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general yes, but in this case we did not expand the SUT. It only gets the request now as parameter, but it doesn't do anything with it. The request was introduced for the access log implementation of the RetryListener interface.

// each of them about a retry attempt.
type RetryListeners []RetryListener

// Retried is there to implement the RetryListener interface. It calls Retried on each of its slice entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick: I'd suggest to say exists instead of is there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am always happy about formulation improvements :)

@@ -451,74 +451,6 @@ func TestNewMetrics(t *testing.T) {
}
}

func TestRegisterRetryMiddleware(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this test is now superseded by your new tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the test I left a comment in the second paragraph of my comment above #1793 (comment). Please take a look.

@mjeri
Copy link
Contributor Author

mjeri commented Aug 22, 2017

@timoreimann thanks for the review round nr1. I implemented the feedback and also did a rebase/squash now because the branch was quit outdated and squashing made the rebase easier.

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@emilevauge
Copy link
Member

emilevauge commented Aug 25, 2017

@timoreimann As there are a lot of conflicts, I think a LGTM today is a bit optimistic ;)

screenshot from 2017-08-25 17-30-55

@timoreimann
Copy link
Contributor

@emilevauge I trust in @marco-jantke to ask for a revalidating LGTM if the rebase turns out to be more complicated. :)

(Besides, we have never bothered giving LGTM in the past despite potential conflicts.)

@mjeri
Copy link
Contributor Author

mjeri commented Aug 28, 2017

Rebased this PR now to latest master version. Conflicts are all resolved and where expected, but some merging was required.

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @marco-jantke !
LGTM

Copy link
Member

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@traefiker traefiker merged commit dae7e7a into traefik:master Aug 28, 2017
@mjeri mjeri deleted the retries-in-access-log branch August 28, 2017 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logs kind/enhancement a new or improved feature. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants