Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate HTTP server observations for cancelled exchanges #31417

Closed
bclozel opened this issue Oct 12, 2023 · 0 comments
Closed

Prevent duplicate HTTP server observations for cancelled exchanges #31417

bclozel opened this issue Oct 12, 2023 · 0 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) theme: observability An issue related to observability and tracing type: bug A general bug
Milestone

Comments

@bclozel
Copy link
Member

bclozel commented Oct 12, 2023

In some specific cases, HTTP clients can eagerly close connections with the server right after the response has been received, but before the HTTP exchange is considered complete. This can result in duplicate observations being recorded, one with the "SUCCESSFUL" outcome and another one with the "UNKNOWN" outcome (as the exchange was cancelled).

This can be reproduced with the following:

  • a web handler that introduces some delay and schedules the handling on a different thread
  • a client that closes connections eagerly (for example, this has been reproduced locally with apache bench with a 1/10K ratio)
  • a low latency setup

"Duplicate metrics" can be explained by:

  • both "COMPLETE" and "CANCEL" reactive streams signals racing
  • the reactive streams spec not preventing this case, so doOnCancel can be called after doOnTerminate
  • the Micrometer Observation API not preventing multiple Observation#stop() calls; this means observation handlers are called multiple times

We cannot ignore CANCEL signals in our instrumentation, as this would leak started observations and would not count all valid cases of cancellations. We should instead refine our instrumentation with the Mono#tap operator and locally guard against this case.

This needs to be applied on the reactive ServerHttpObservationFilter and the HttpWebHandlerAdapter instrumentation that replaces it in Spring Framework 6.1.

@bclozel bclozel added in: web Issues in web modules (web, webmvc, webflux, websocket) type: bug A general bug theme: observability An issue related to observability and tracing labels Oct 12, 2023
@bclozel bclozel added this to the 6.0.14 milestone Oct 12, 2023
@bclozel bclozel self-assigned this Oct 12, 2023
bclozel added a commit that referenced this issue Oct 13, 2023
Since the Spring WebFlux HTTP server instrumentation has been moved from
the `WebFilter` to the `HttpWebHandlerAdapter`, we need to apply similar
changes there.

See gh-31417
bclozel added a commit that referenced this issue Nov 29, 2023
Prior to this commit, regressions were introduced with gh-31417:

1. the observation keyvalues would be inconsistent with the HTTP response
2. the observation scope would not cover all controller handlers, causing
  traceIds to be missing

The first issue is caused by the fact that in case of error signals, the
observation was stopped before the response was fully committed, which
means further processing could happen and update the response status.
This commit delays the stop event until the response is committed in
case of errors.

The second problem is caused by the change from a `contextWrite`
operator to using the `tap` operator with a `SignalListener`. The
observation was started in the `doOnSubscription` callback, which is too
late in some cases. If the WebFlux controller handler is synchronous
non-blocking, the execution of the handler is performed before the
subscription happens. This means that for those handlers, the
observation was not started, even if the current observation was
present in the reactor context. This commit changes the
`doOnSubscription` to `doFirst` to ensure that the observation is
started at the right time.

Fixes gh-31703
Fixes gh-31706
bclozel added a commit that referenced this issue Nov 29, 2023
Prior to this commit, regressions were introduced with gh-31417:

1. the observation keyvalues would be inconsistent with the HTTP response
2. the observation scope would not cover all controller handlers, causing
  traceIds to be missing

The first issue is caused by the fact that in case of error signals, the
observation was stopped before the response was fully committed, which
means further processing could happen and update the response status.
This commit delays the stop event until the response is committed in
case of errors.

The second problem is caused by the change from a `contextWrite`
operator to using the `tap` operator with a `SignalListener`. The
observation was started in the `doOnSubscription` callback, which is too
late in some cases. If the WebFlux controller handler is synchronous
non-blocking, the execution of the handler is performed before the
subscription happens. This means that for those handlers, the
observation was not started, even if the current observation was
present in the reactor context. This commit changes the
`doOnSubscription` to `doFirst` to ensure that the observation is
started at the right time.

Fixes gh-31715
Fixes gh-31716
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) theme: observability An issue related to observability and tracing type: bug A general bug
Projects
None yet
Development

No branches or pull requests

1 participant