add RetryAttempts to AccessLog in JSON format #1793

mjeri · 2017-06-27T10:43:31Z

This PR adds the amount of retries happened to the access log. This information is very useful for us when debugging requests.

For now this PR only adds the retry attempts to the JSON format. I am happy to extend the common log format as well, but I am not an expert in the CLF format and so I am not sure how to add this information meaningfully without breaking backwards compatibility of parsers.

ldez · 2017-06-27T13:16:42Z

@marco-jantke please rebase to fix CI.

mjeri · 2017-06-28T06:19:04Z

@ldez still some error :/

ldez · 2017-06-28T06:26:00Z

you have missed your rebase, try again 😉

mjeri · 2017-06-28T08:20:51Z

Sorry for that one.. I was just amending because I thought I just have to trigger the build again.

BTW now I have to run the make targets like DOCKER_VERSION=17.03.1 make validate to get them working locally. Do you want to improve the local workflow here again or add some documentation?

ldez · 2017-06-28T08:36:53Z

You don't need to add DOCKER_VERSION=17.03.1: in the Docker file there is a DOCKER_VERSION with the good version.

mjeri · 2017-07-12T06:25:37Z

Any feedback for this one?

emilevauge · 2017-07-14T16:23:38Z

@marco-jantke honestly, I'm a bit concerned with this one. It seems it solves a relatively rare use case and makes server.go more and more complex. @containous/traefik WDYT?

timoreimann · 2017-07-18T09:11:25Z

@emilevauge disclaimer: I'm obviously somewhat biased as I work with Marco. Still, I hope I can give some motivation stemming from our real-world experience with a large-ish proxy setup.

We are currently running large parts of our micro-services infrastructure off of HAProxy where support for retry indicators on a per-access level turned out to be very helpful for us. When you run hundreds of HTTP-driven services, connectivity can fail due to numerous reasons: Services could be slow; intermittently failing; be calling broken endpoints/paths; fail to terminate/bootstrap orderly; or just have bugs, all potentially leading to retries. On the other end, the infrastructure could have problems: proxies being overloaded; network being slow; net-split happening; and a myriad of other potential things. We can't afford to expose ephemeral failures to the client right away, so we have retries enabled by default. At the same time, we strive to minimize the number of retries we see due to the latency imposed and the fact that a higher-than-usual number is a signal for some issue.

We use monitoring (namely, Prometheus) to detect retry anomalies. More often than not, however, it's impossible to understand why retries took place without analyzing on the access log level. For instance, we are regularly able to correlate retry counters with request latency at the various stages of a request to see whether a request, say, failed on the initial connection attempt, or it took the server too long to serve the first byte. Another example that we have seen is that some service endpoints caused errors while others did not, leading to retries in one case but not the other. With HAProxy, we can put the necessary pieces together since we have all relevant information presented in the access log.

Moving such information into a monitoring system is usually not feasible: You cannot track highly dynamic aspects of your requests in something like Prometheus because the number of time series (and with it, your monitoring system memory) would quickly explode. We happen to have first-hand experience with such scenarios (both intentional and by accident).

I hope that provides some insights. Let me know what you think.

emilevauge · 2017-07-20T06:36:19Z

@timoreimann @marco-jantke I'm not saying this feature is not useful, I'm saying that the implementation is a bit too complex and may need more work, specifically this peace of code:

if globalConfiguration.Retry != nil {
	metricsRetryListener := middlewares.NewMetricsRetryListener(metrics)

	if server.accessLoggerMiddleware != nil {
		saveRetries := &accesslog.SaveRetries{}
		retryListeners := middlewares.RetryListeners{metricsRetryListener, saveRetries}
		lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, retryListeners)
		lb = saveRetries.AsHandler(lb)
	} else {
		lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, metricsRetryListener)
	}
}

@marco-jantke do you think we could refactor this part?

mjeri · 2017-07-20T07:06:19Z

I will give it a try :)

mjeri · 2017-07-20T10:23:06Z

@emilevauge it was worth the reinvestigation. I realised that the functionality was actually not working properly with the former implementation during benchmarking with concurrent requests. It was mixing up the RetryAttempts that I was writing into the logs between requests and often reseted the counter to 0 even though the request actually had retries. In order to get this working I changed the RetryListeners method Retried to also contain the current request. This way we can ensure that always the proper RetryAttempts information for each request access log entry is written. Additionally it made the setup server.go substantially clearer and more straight-forward.

@emilevauge @timoreimann In my last commit eff6596 I dropped the unit-test for the registerRetryMiddleware function as I simplified it and the current layout of the method does not allow for easy testing and the previous test did verify basically only one thing: that the amount of attempts on the retry middleware is configured properly. WDYT about this step? I would dare to leave it as is, as the code and functionality is rather trivial.

mjeri · 2017-08-14T08:39:37Z

Can I do something to bring this forward?

timoreimann · 2017-08-14T09:02:11Z

@marco-jantke @emilevauge is this still in the WIP state? If not, we should open it up for review.

mjeri · 2017-08-14T10:14:03Z

The last state is that after Emile requested some refactoring after his design review, that I implemented some changes to improve the design of the code. So now it should get another round of design review and then we can move AFAICS.

timoreimann · 2017-08-14T10:22:58Z

Seemingly your turn again, @emilevauge. :-)

emilevauge · 2017-08-17T21:08:17Z

I think it's a lot better :) Thanks @marco-jantke !
Design LGTM

timoreimann · 2017-08-21T14:01:23Z

middlewares/accesslog/save_retries_test.go

+			saveRetries := &SaveRetries{}
+
+			logDataTable := &LogData{Core: make(CoreLogData)}
+			req := httptest.NewRequest("GET", "/some/path", nil)


s/"GET"/http.MethodGet/

By now I believe you're challenging me :)

Haha, I would never dare! Consider that I added this code a long time ago.

timoreimann · 2017-08-21T14:04:24Z

middlewares/metrics_test.go

-	retryListener.Retried(1)
-	retryListener.Retried(2)
+	retryListener.Retried(req, 1)
+	retryListener.Retried(req, 2)


Shouldn't we also somehow expand our test expectation / assertion logic when we extend the system-under-test?

In general yes, but in this case we did not expand the SUT. It only gets the request now as parameter, but it doesn't do anything with it. The request was introduced for the access log implementation of the RetryListener interface.

timoreimann · 2017-08-21T14:13:40Z

middlewares/retry.go

+// each of them about a retry attempt.
+type RetryListeners []RetryListener
+
+// Retried is there to implement the RetryListener interface. It calls Retried on each of its slice entries.


nit-pick: I'd suggest to say exists instead of is there.

Thanks, I am always happy about formulation improvements :)

timoreimann · 2017-08-21T14:17:00Z

server/server_test.go

@@ -451,74 +451,6 @@ func TestNewMetrics(t *testing.T) {
 	}
 }

-func TestRegisterRetryMiddleware(t *testing.T) {


I suppose this test is now superseded by your new tests?

Regarding the test I left a comment in the second paragraph of my comment above #1793 (comment). Please take a look.

mjeri · 2017-08-22T07:22:21Z

@timoreimann thanks for the review round nr1. I implemented the feedback and also did a rebase/squash now because the branch was quit outdated and squashing made the rebase easier.

timoreimann

LGTM, thanks!

emilevauge · 2017-08-25T15:30:36Z

@timoreimann As there are a lot of conflicts, I think a LGTM today is a bit optimistic ;)

timoreimann · 2017-08-25T16:04:42Z

@emilevauge I trust in @marco-jantke to ask for a revalidating LGTM if the rebase turns out to be more complicated. :)

(Besides, we have never bothered giving LGTM in the past despite potential conflicts.)

mjeri · 2017-08-28T08:34:42Z

Rebased this PR now to latest master version. Conflicts are all resolved and where expected, but some merging was required.

emilevauge

Great job @marco-jantke !
LGTM

ldez

LGTM

mjeri force-pushed the retries-in-access-log branch from 9017ccb to 3c17575 Compare June 27, 2017 11:29

ldez added area/logs kind/enhancement a new or improved feature. labels Jun 27, 2017

traefik deleted a comment from mjeri Jun 27, 2017

mjeri force-pushed the retries-in-access-log branch from 3c17575 to bfd96e5 Compare June 27, 2017 13:33

ldez added the status/1-needs-design-review label Jun 27, 2017

mjeri force-pushed the retries-in-access-log branch from bfd96e5 to abd50c4 Compare June 28, 2017 06:09

mjeri force-pushed the retries-in-access-log branch 2 times, most recently from 2d804b2 to 663b3cb Compare June 28, 2017 08:16

mjeri force-pushed the retries-in-access-log branch from 663b3cb to c0bfa12 Compare July 7, 2017 08:45

emilevauge added WIP and removed status/1-needs-design-review labels Jul 20, 2017

mjeri force-pushed the retries-in-access-log branch from c0bfa12 to e7bfd95 Compare July 20, 2017 10:15

mjeri force-pushed the retries-in-access-log branch from e7bfd95 to eff6596 Compare July 21, 2017 06:16

timoreimann added status/1-needs-design-review and removed WIP labels Aug 14, 2017

ldez added the size/M label Aug 16, 2017

emilevauge added status/2-needs-review and removed status/1-needs-design-review labels Aug 17, 2017

timoreimann suggested changes Aug 21, 2017

View reviewed changes

traefiker added the contributor/waiting-for-corrections label Aug 21, 2017

mjeri force-pushed the retries-in-access-log branch from eff6596 to 5f8caeb Compare August 22, 2017 07:20

ldez added contributor/needs-resolve-conflicts and removed contributor/waiting-for-corrections labels Aug 25, 2017

timoreimann approved these changes Aug 25, 2017

View reviewed changes

mjeri force-pushed the retries-in-access-log branch from 5f8caeb to 87fa513 Compare August 28, 2017 08:03

mjeri removed the contributor/needs-resolve-conflicts label Aug 28, 2017

emilevauge added this to the 1.4 milestone Aug 28, 2017

emilevauge approved these changes Aug 28, 2017

View reviewed changes

ldez approved these changes Aug 28, 2017

View reviewed changes

ldez added status/3-needs-merge and removed status/2-needs-review labels Aug 28, 2017

traefiker added the status/4-merge-in-progress label Aug 28, 2017

add RetryAttempts to AccessLog in JSON format

7bdee01

traefiker force-pushed the retries-in-access-log branch from 87fa513 to 7bdee01 Compare August 28, 2017 09:50

traefiker removed the status/4-merge-in-progress label Aug 28, 2017

traefiker merged commit dae7e7a into traefik:master Aug 28, 2017

traefiker removed the status/3-needs-merge label Aug 28, 2017

mjeri deleted the retries-in-access-log branch August 28, 2017 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add RetryAttempts to AccessLog in JSON format #1793

add RetryAttempts to AccessLog in JSON format #1793

mjeri commented Jun 27, 2017

ldez commented Jun 27, 2017

mjeri commented Jun 28, 2017

ldez commented Jun 28, 2017

mjeri commented Jun 28, 2017

ldez commented Jun 28, 2017

mjeri commented Jul 12, 2017

emilevauge commented Jul 14, 2017

timoreimann commented Jul 18, 2017

emilevauge commented Jul 20, 2017 •

edited

mjeri commented Jul 20, 2017

mjeri commented Jul 20, 2017 •

edited

mjeri commented Aug 14, 2017

timoreimann commented Aug 14, 2017

mjeri commented Aug 14, 2017

timoreimann commented Aug 14, 2017

emilevauge commented Aug 17, 2017

timoreimann Aug 21, 2017

mjeri Aug 22, 2017

timoreimann Aug 21, 2017

mjeri Aug 22, 2017

timoreimann Aug 21, 2017

mjeri Aug 22, 2017

timoreimann Aug 21, 2017

mjeri Aug 22, 2017

mjeri commented Aug 22, 2017 •

edited

timoreimann left a comment

emilevauge commented Aug 25, 2017 •

edited

timoreimann commented Aug 25, 2017

mjeri commented Aug 28, 2017

emilevauge left a comment

ldez left a comment

add RetryAttempts to AccessLog in JSON format #1793

add RetryAttempts to AccessLog in JSON format #1793

Conversation

mjeri commented Jun 27, 2017

ldez commented Jun 27, 2017

mjeri commented Jun 28, 2017

ldez commented Jun 28, 2017

mjeri commented Jun 28, 2017

ldez commented Jun 28, 2017

mjeri commented Jul 12, 2017

emilevauge commented Jul 14, 2017

timoreimann commented Jul 18, 2017

emilevauge commented Jul 20, 2017 • edited

mjeri commented Jul 20, 2017

mjeri commented Jul 20, 2017 • edited

mjeri commented Aug 14, 2017

timoreimann commented Aug 14, 2017

mjeri commented Aug 14, 2017

timoreimann commented Aug 14, 2017

emilevauge commented Aug 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjeri commented Aug 22, 2017 • edited

timoreimann left a comment

Choose a reason for hiding this comment

emilevauge commented Aug 25, 2017 • edited

timoreimann commented Aug 25, 2017

mjeri commented Aug 28, 2017

emilevauge left a comment

Choose a reason for hiding this comment

ldez left a comment

Choose a reason for hiding this comment

emilevauge commented Jul 20, 2017 •

edited

mjeri commented Jul 20, 2017 •

edited

mjeri commented Aug 22, 2017 •

edited

emilevauge commented Aug 25, 2017 •

edited