Skip to content

observability for endpoints#1151

Merged
adamw merged 27 commits intomasterfrom
observability-for-endpoints
Apr 29, 2021
Merged

observability for endpoints#1151
adamw merged 27 commits intomasterfrom
observability-for-endpoints

Conversation

@kubinio123
Copy link
Copy Markdown
Contributor

metrics interceptor
prometheus metrics

@kubinio123 kubinio123 requested a review from adamw April 12, 2021 07:37
Comment thread build.sbt

// metrics

lazy val prometheusMetrics: ProjectMatrix = (projectMatrix in file ("metrics/prometheus-metrics"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs to be added to aggregates in the beginning

def queryParameters: QueryParams

lazy val acceptsContentTypes: Either[String, Seq[ContentTypeRange]] = Accepts.parse(headers)
val timestamp: Deadline = Deadline.now
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that the right type to capture a timestamp?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe let's call it simply requestStart? It's not clear what this timestamp is

Copy link
Copy Markdown
Contributor Author

@kubinio123 kubinio123 Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It states that it main purpose is for repeated attempts, but on the other hand it's has now which captures current nanoseconds time, and +/- methods which is handy for getting time diff.
I think most appropriate type here would be plain Long from System.nanoTime but it carries no time context, so maybe Duration ?

_ <- collectMetrics { case Metric(m, Some(onRequest), _, _) => onRequest(request, m) }
response <- next(None)
_ <- collectMetrics { case Metric(m, _, Some(onResponse), _) => onResponse(request, response, m) }
} yield response
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also probably somehow notify the metrics if the effect fails, that is, there's an exception that's being thrown?

that also depends, if the metrics interceptor should wrap the exception-handling one, or the other way round

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe MetricsInterceptor should be always called after ExceptionInterceptor?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, well ... unless there's an exception during metrics collection ;-) but maybe that should be handled in the server.

If metrics should be called first/last (around the exc interceptor), they have to be added to the front of the interceptors list

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem: the response is only really "done" (well not even then, but it's more "done" than currently) when the whole response is sent. In terms of akka-http/http4s: when the response stream is completed. I think we won't run away from needing some way of interpreter-specific integration, which would allow us to "plug into" the response stream, getting a callback once it is complete.

In other words - a ServerResponse can be created quickly, but with a stream that takes a long time to evaluate

Comment thread doc/server/options.md Outdated
`.appendInterceptor`.


## Observability
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this to a separate top-level file

onResponse: Option[(ServerRequest, ServerResponse[_], M) => F[Unit]] = None,
onDecodeFailure: Option[(ServerRequest, M) => F[Unit]] = None
) {
def onRequest(f: (ServerRequest, M) => F[Unit]): Metric[F, M] = copy(onRequest = Some(f))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass in here an implicit MonadError here, as in EndpointInterceptor - that way users don't have to provide one, when creating e.g. PrometheusMetrics. This is supposed to be mainly an internal thing (the MonadError)

.help("Unserved HTTP requests")
.labelNames("path", "method")
.register(registry)
).onDecodeFailure { (req, counter) => monad.unit(counter.labels(path(req), method(req)).inc()) }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode failure is not a request failure. A single request can cause mulitple decode failures,until one endpoint matches and gives a decode success. So a failed request is rather one where the status code is non-2xx/3xx

monad.unit(histogram.labels(path(req), method(req), status(res)).observe((Deadline.now - req.timestamp).toMillis.toDouble / 1000.0))
}

private def path(request: ServerRequest): String = request.pathSegments.mkString("/")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't use the full request's path, but rather the endpoint's path template. That way, any dynamic url segments are replaced with path captures

Metric[F, Counter](
Counter
.build()
.namespace("tapir")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be somehow configurable - e.g. for most apps this namespace doesn't make sense. It would be best if each metric would coudl be configured taking in the request/endpoint, and yielding a set of labels & label values to apply. With sane defaults, of course, so that adding default metrics is just a couple lines of code

}
}

implicit val monad: FutureMonad = new FutureMonad()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our goal should be to be able to remove this from the example - that shouldn't be necessary for users to use the metrics (if possible - maybe not?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@kubinio123 kubinio123 linked an issue Apr 13, 2021 that may be closed by this pull request
@kubinio123 kubinio123 requested a review from adamw April 19, 2021 13:31
def queryParameters: QueryParams

lazy val acceptsContentTypes: Either[String, Seq[ContentTypeRange]] = Accepts.parse(headers)
val requestStart: Deadline = Deadline.now
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of adding this here, we could use a RequestInterceptor, capture the starting timestamp there, and create an EndpointInterceptor with it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work, if this RequestInterceptor would be head of interceptors list, then EndpointInteceptor created by it would be called last which is what we need for metrics.
But requestStart would have to be added as next argument to Metrics callback.

@kubinio123 kubinio123 requested a review from adamw April 22, 2021 12:55
package sttp.tapir.server.interpreter

trait BodyListener[F[_], B] {
def onComplete(body: B)(cb: => F[Unit]): F[B]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be useful to also have an onFailure callback - so that we can also capture situations, where producing the body fails (which results in a broken connection, as the headers are already sent out)

this could be a separate callback, or a single one taking a Try[Unit] parameter. Maybe see the Future/Try API for an inspiration?


case class Metric[M](
metric: M,
onRequest: Option[(Endpoint[_, _, _, _], ServerRequest, M) => Unit] = None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: maybe a metric could be initialised per-request (in the request interceptor) using onRequest: (ServerRequest, M) => P, where P is the payload type. This payload would then be passed into onEndpointRequest: (Endpoint, ServerRequest, M, P) => F[Unit] & similarly to onEndpointResponse.

This would allow us to generalise over the quite specific Deadline parameter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another question: shouldn't the metric function return F[Unit], not Unit? Capturing a metric could be a side-effecting operation (it probably is ;) )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe simpler, we could go back to the idea of "initialising" a metric on request start, which would produce a "proper" metric (a constant instance in most cases, but for capturing timings this would include the initial timestamp). So here we'd only have a function onRequest: (ServerRequest, M) => EndpointMetric

case class Metric[M](
metric: M,
onRequest: Option[(Endpoint[_, _, _, _], ServerRequest, M) => Unit] = None,
onResponse: Option[(Endpoint[_, _, _, _], ServerRequest, ServerResponse[_], Deadline, M) => Unit] = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't Instant be a simpler choice over Deadline?

# Conflicts:
#	core/src/main/scala/sttp/tapir/server/interpreter/ServerInterpreter.scala
#	core/src/test/scala/sttp/tapir/server/interpreter/ServerInterpreterTest.scala
#	server/akka-http-server/src/main/scala/sttp/tapir/server/akkahttp/AkkaHttpServerOptions.scala
#	server/finatra-server/src/main/scala/sttp/tapir/server/finatra/FinatraServerInterpreter.scala
#	server/finatra-server/src/main/scala/sttp/tapir/server/finatra/FinatraServerOptions.scala
#	server/http4s-server/src/main/scala/sttp/tapir/server/http4s/Http4sServerOptions.scala
#	server/vertx/src/main/scala/sttp/tapir/server/vertx/VertxCatsServerOptions.scala
#	server/vertx/src/main/scala/sttp/tapir/server/vertx/VertxFutureServerOptions.scala
#	server/vertx/src/main/scala/sttp/tapir/server/vertx/VertxZioServerOptions.scala
}
} yield withMetrics

responseWithMetrics.handleError { case e: Exception =>
Copy link
Copy Markdown
Contributor Author

@kubinio123 kubinio123 Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onException metrics are collected here, so in withBodyOnComplete case Failure is left empty, since it could collect onException twice i think. If body is a stream which fails before the response is returned, then it could be collected in body listener and here in handleError if the exception is thrown?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both should be handled, as these are two different scenarios:

  1. the effect creating the ServerResponse completes successfully, yielding an instance of ServerResponse with a streaming body. However, the streaming body might fail when streamed to the client.
  2. the above effect fails, so we never even get a ServerResponse

Note that a request can be considered "complete" only after 1. successfully finishes (the stream is done). But it can fail either with 1. or 2., but not with both.

}

private def collectMetrics(pf: PartialFunction[EndpointMetric[F], F[Unit]])(implicit monad: MonadError[F]): F[Unit] = {
def collect(metrics: List[EndpointMetric[F]]): F[Unit] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's usually called sequence (a function List[F[T]] => F[List[T]] - which we kind of have here)

)(implicit monad: MonadError[F], bodyListener: BodyListener[F, B]): F[Option[ServerResponse[B]]] = {
if (ignoreEndpoints.contains(ctx.endpoint)) endpointHandler.onDecodeFailure(ctx)
else {
ctx.failure match {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should attach run the metrics not based on the nature of the decode failure, but basing on its handling (we don't really know if a mismatch will be handled differently than other errors). So if downstream returns a response, then we need to run the on request & on response callbacks

def onComplete(body: B)(cb: Try[Unit] => F[Unit]): F[B]
}

object BodyListenerSyntax {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you add the implicit in the companion object (BodyListener), then you won't have to import the syntax, it will be automatically searched

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, and it still requires both

import sttp.tapir.server.interpreter.BodyListener
import sttp.tapir.server.interpreter.BodyListener._

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I thought that this is in the implicit scope. But apparently not :)

is match {
case Nil => (request: ServerRequest) => firstNotNone(request, ses, eisAcc.reverse)
case Nil =>
new RequestHandler[F, B] {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would make sense to add a factory method to RequestHandler which would accept a plain Request => F[Response] function?

val collectorRegistry = CollectorRegistry.defaultRegistry

// Metric for counting responses labeled by path, method and status code
val responsesTotal = Metric[Future, Counter](
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the example we should show how to use default metrics, and move the custom metric creation to the docs

m.unit {
EndpointMetric()
.onRequest { ep => m.eval(gauge.labels(labels.forRequest(ep, req): _*).inc()) }
.onResponse { (ep, _) => m.eval(gauge.labels(labels.forRequest(ep, req): _*).dec()) }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should decrease on exception as well

Comment thread doc/server/observability.md Outdated
`Metric` wraps an aggregation object (like counter or gauge), and requires a function
returning `EndpointMetric` with metric collection logic. There are three callbacks in `EndpointMetric`:

1. `onRequest` - called after successful request decoding
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to describe the two-stage onRequest (with metric initialisation)

Comment thread doc/server/observability.md Outdated
})
```

Labels for default metrics can be customized, any attribute from `Endpoint`, `ServerRequest` and `ServerResponse` could
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to mention what are the default labels (with a path template, status code etc.)

Copy link
Copy Markdown
Contributor Author

@kubinio123 kubinio123 Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that couple lines above, before prometheus example, but I'll move it here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to give an example of what the labels will be

onRequest = { (req, counter, m) =>
m.unit {
EndpointMetric()
.onResponse { (ep, res) => m.eval(counter.labels(labels.forRequest(ep, req) ++ labels.forResponse(res): _*).inc()) }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I guess onException should be counted as a 500?


import scala.util.{Failure, Success, Try}

class Http4SBodyListener[F[_], G[_]](gToF: G ~> F)(implicit m: MonadError[G], a: Applicative[F])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] name: Http4sBodyListener (small s :) )

.body("""{"invalid":"body",}""")
.send(backend)
.map { _ =>
wait(200) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to wait for the counters to be updated?

Copy link
Copy Markdown
Contributor Author

@kubinio123 kubinio123 Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without that it sometimes fail for akka server, maybe transformDataBytes where the callback is executed on Future[Done] completion is called after response is returned, but it shouldn't right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, onComplete can run asynchronously, after the request completes and is sent. Hm... maybe we can use eventuallyhere? I wouldn't want to always sleep 200ms, this can be a cause of flaky tests

}

object ServerMetricsTest {
class Counter(var value: Int = 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should be an AtomicInteger as it can be updated from multiple threads

m.unit {
{ (ctx: RoutingContext) =>
body {
ctx.addBodyEndHandler(_ => cb(Success(())))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no way to intercept an exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addBodyEndHandler callback will not be executed on exception, there's addEndHandler which will be executed but will be called when the response is disposed or an exception has been encountered to allow consistent cleanup, so it's not the moment we want to capture here right, since it has to be a body failure?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use both, one for success, another for error?

val in_input_stream_out_input_stream: Endpoint[InputStream, Unit, InputStream, Any] =
endpoint.post.in("api" / "echo").in(inputStreamBody).out(inputStreamBody).name("echo input stream")

val in_empty_out_empty: Endpoint[Unit, Unit, Unit, Any] = endpoint.post.in("api" / "empty").in(emptyInput).out(emptyOutput)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just endpoint.post.in("api" / "empty")? Also, maybe we can simply use e.g. in_root_path

@adamw adamw merged commit 5f456a2 into master Apr 29, 2021
@mergify mergify Bot deleted the observability-for-endpoints branch April 29, 2021 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Observability for endpoints

2 participants