Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(observability): micrometer telemetry provider #1089

Merged

Conversation

monosoul
Copy link
Contributor

@monosoul monosoul commented May 2, 2024

This is a PoC telemetry provider implementation using micrometer

Issue #

Fixes #1087

Description of changes

Introduces telemetry provider using micrometer as a backend

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@monosoul monosoul requested a review from a team as a code owner May 2, 2024 23:09
Copy link
Contributor

@ianbotsf ianbotsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. I've left several comments/questions on the diff itself but I also note there's no new tests for verifying correctness. Please add additional tests to cover the new provider.

Comment on lines 12 to 20
@ExperimentalApi
public class MicrometerTelemetryProvider(
meterRegistry: MeterRegistry,
override val loggerProvider: LoggerProvider = GlobalTelemetryProvider.instance.loggerProvider,
) : TelemetryProvider {
override val meterProvider: MeterProvider = MicrometerMeterProvider(meterRegistry)
override val tracerProvider: TracerProvider = TracerProvider.None
override val contextManager: ContextManager = ContextManager.None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Micrometer supports more than just metrics, it also supports tracing (either directly or via bridges to other providers like OTel and Brave) and context propagation—which seem like natural fits for new TracingProvider and ContextManager implementations. Have you looked yet at what it would take to add support for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At my org we use another tool for tracing, so atm I'm only interested in metrics


@ExperimentalApi
public class MicrometerTelemetryProvider(
meterRegistry: MeterRegistry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Would it make sense for meterRegistry to default to Metrics.globalRegistry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal class MicrometerMeterProvider(meterRegistry: MeterRegistry) : MeterProvider {
private val meter = MicrometerMeter(meterRegistry)

override fun getOrCreateMeter(scope: String): Meter = meter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: The scope parameter is intended to delineate which component of the SDK is under observation by specific meters. For instance, meters related to individual calls are given the scope aws.sdk.kotlin.<service-name> whereas meters related to the default HTTP engine are given the scope aws.smithy.kotlin.runtime.http.engine.okhttp. Could the scope be used here, perhaps as tags attached to newly-created meters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +98 to +109
private class MicrometerUpDownCounter(
private val meterMetadata: MeterMetadata,
private val meterRegistry: MeterRegistry,
) : UpDownCounter {
override fun add(value: Long, attributes: Attributes, context: Context?) {
meterMetadata
.counter()
.tags(attributes.toTags())
.register(meterRegistry)
.increment(value.toDouble())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: The UpDownCounter type accepts both increments and decrements (e.g., something like queue length which could rise/fall over time). Micrometers' Counter seems to accept only positive increments. What will happen if MicrometerUpDownCounter.add gets called with a negative value? If the underlying Micrometer implementation would throw an exception or otherwise misbehave then it's better to make UpDownCounter a no-op implementation (i.e., createUpDownCounter returns a dummy object that does nothing when add is called).

Copy link
Contributor Author

@monosoul monosoul Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work fine, you can use this snippet to check it yourself with a Kotlin scratch file in IntelliJ:

import io.micrometer.core.instrument.simple.SimpleMeterRegistry

val registry = SimpleMeterRegistry()
val counter = registry.counter("test.counter")

counter.increment(3.0)
println(counter.count())

counter.increment(-2.0)
println(counter.count())

Will produce:

3.0
1.0

@monosoul monosoul requested a review from ianbotsf June 20, 2024 13:28
Copy link
Contributor

@ianbotsf ianbotsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this a bit today and ran some sample tests using the new Micrometer telemetry provider with the Cloudwatch backend. I'm seeing metrics named with suffixes that aren't specified in the code. For instance, during a simple S3 ListBuckets call, I saw the following metrics emitted into Cloudwatch:

  • smithy.client.call.attempt_duration.count
  • smithy.client.call.attempt_duration.sum
  • smithy.client.call.attempt_overhead_duration.count
  • smithy.client.call.attempt_overhead_duration.sum
  • smithy.client.call.attempts.count
  • smithy.client.call.auth.resolve_identity_duration.count
  • smithy.client.call.auth.resolve_identity_duration.sum
  • smithy.client.call.auth.signing_duration.count
  • smithy.client.call.auth.signing_duration.sum
  • smithy.client.call.deserialization_duration.count
  • smithy.client.call.deserialization_duration.sum
  • smithy.client.call.duration.count
  • smithy.client.call.duration.sum
  • smithy.client.call.resolve_endpoint_duration.count
  • smithy.client.call.resolve_endpoint_duration.sum
  • smithy.client.call.serialization_duration.count
  • smithy.client.call.serialization_duration.sum

Note that every metric name is suffixed with either .count or .sum but the metrics emitted by the SDK into telemetry providers don't include those.

Furthermore, none of those emitted metrics showed any non-zero values except for smithy.client.call.attempts.count. See image below and/or this CSV file.

micrometer-metrics-test-output

I'm not familiar enough with Micrometer to understand what's going wrong but I'm hoping you have some insight @monosoul!

My test code
val cwConfig = object : CloudWatchConfig {
    override fun get(s: String) = null
    override fun namespace() = "ian.micrometer.test"
}
val registry = CloudWatchMeterRegistry(cwConfig, Clock.SYSTEM, CloudWatchAsyncClient.create())
val telProvider = MicrometerTelemetryProvider(registry)
S3Client.fromEnvironment {
    telemetryProvider = telProvider
}.use { s3 ->
    s3.listBuckets()
}
registry.close()

@ianbotsf
Copy link
Contributor

@monosoul have you had a chance to dig deeper into the metric naming issues?

@monosoul
Copy link
Contributor Author

@ianbotsf sorry, I was away for a few weeks.

Metrics with count suffix represent the number of times the counter got incremented, while the ones with sum suffix represent the actual value of the counter.

@monosoul
Copy link
Contributor Author

monosoul commented Jul 24, 2024

@ianbotsf should there be any values for metrics other than client.call.attempts though? 🤔

I used this snippet to try it locally (uses testcontainers to spin up localstack instance):

import aws.sdk.kotlin.runtime.auth.credentials.StaticCredentialsProvider
import aws.sdk.kotlin.services.s3.S3Client
import aws.sdk.kotlin.services.s3.createBucket
import aws.smithy.kotlin.runtime.ExperimentalApi
import aws.smithy.kotlin.runtime.net.url.Url
import aws.smithy.kotlin.runtime.telemetry.micrometer.MicrometerTelemetryProvider
import io.micrometer.core.instrument.Clock
import io.micrometer.core.instrument.logging.LoggingMeterRegistry
import io.micrometer.core.instrument.logging.LoggingRegistryConfig
import kotlinx.coroutines.runBlocking
import org.testcontainers.containers.localstack.LocalStackContainer
import org.testcontainers.utility.DockerImageName
import java.time.Duration

fun main(): Unit = runBlocking {

    val localstack = LocalStackContainer(DockerImageName.parse("localstack/localstack:3.5.0"))
        .withServices(LocalStackContainer.Service.S3)
        .also { it.start() }

    val config = object : LoggingRegistryConfig {
        override fun get(p0: String): String? = null
        override fun step(): Duration = Duration.ofSeconds(1)
    }
    val registry = LoggingMeterRegistry(config, Clock.SYSTEM)
    val telProvider = MicrometerTelemetryProvider(registry)

    S3Client.fromEnvironment {
        endpointUrl = Url.parse(localstack.getEndpointOverride(LocalStackContainer.Service.S3).toString())
        credentialsProvider = StaticCredentialsProvider {
            accessKeyId = localstack.accessKey
            secretAccessKey = localstack.secretKey
        }
        region = localstack.region
        telemetryProvider = telProvider
        httpClient {
            telemetryProvider = telProvider
        }
    }.use { s3 ->
        s3.createBucket {
            bucket = "mybucket"
        }
        s3.listBuckets()
    }

    registry.stop()
    Thread.sleep(2000)
    registry.close()
}

Here's what I got as the output:

14:51:56.110 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.connections.limit{scope=aws.smithy.kotlin.runtime.http.engine.okhttp} value=0 {connection} | MDC: []
14:51:56.110 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.connections.usage{scope=aws.smithy.kotlin.runtime.http.engine.okhttp,state=acquired} value=1 {connection} | MDC: []
14:51:56.110 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.requests.limit{scope=aws.smithy.kotlin.runtime.http.engine.okhttp} value=0 {request} | MDC: []
14:51:56.110 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.requests.usage{scope=aws.smithy.kotlin.runtime.http.engine.okhttp,state=queued} value=0 {request} | MDC: []
14:51:57.108 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.connections.limit{scope=aws.smithy.kotlin.runtime.http.engine.okhttp} value=0 {connection} | MDC: []
14:51:57.108 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.connections.usage{scope=aws.smithy.kotlin.runtime.http.engine.okhttp,state=acquired} value=1 {connection} | MDC: []
14:51:57.108 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.requests.limit{scope=aws.smithy.kotlin.runtime.http.engine.okhttp} value=0 {request} | MDC: []
14:51:57.108 [logging-metrics-publisher] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.requests.usage{scope=aws.smithy.kotlin.runtime.http.engine.okhttp,state=queued} value=0 {request} | MDC: []
14:51:59.172 [main] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.call.attempts{rpc.method=CreateBucket,rpc.service=S3,scope=aws.sdk.kotlin.services.s3} throughput=1 {attempt}/s | MDC: []
14:51:59.172 [main] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.call.attempts{rpc.method=ListBuckets,rpc.service=S3,scope=aws.sdk.kotlin.services.s3} throughput=1 {attempt}/s | MDC: []
14:51:59.172 [main] INFO  i.m.c.i.logging.LoggingMeterRegistry - smithy.client.http.bytes_received{scope=aws.smithy.kotlin.runtime.http.engine.okhttp,server.address=127.0.0.1:32783} throughput=369 By/s | MDC: []

So http client metrics get written just fine, not sure why we only get smithy.client.call.attempts though. Could it be an issue where those other metrics are written? 🤔

UPD: nvm, there was a missing call: aba4715

@monosoul
Copy link
Contributor Author

@ianbotsf I can add tests as well if you're fine with the current implementation

@ianbotsf
Copy link
Contributor

UPD: nvm, there was a missing call: aba4715

Looks like that was it. With the latest commits I now see the full suite of expected metrics now from my test code!

I can add tests as well if you're fine with the current implementation

Yes, that sounds great. Thank you!

@monosoul monosoul requested a review from ianbotsf August 8, 2024 12:51
@monosoul
Copy link
Contributor Author

monosoul commented Aug 8, 2024

@ianbotsf I've added tests and resolved conflicts with main branch, should be good to merge now

Copy link
Contributor

@ianbotsf ianbotsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll let @lauzadis chime in as well and then we can get this shipped,

Copy link
Contributor

@lauzadis lauzadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the contribution!

@aws-sdk-kotlin-ci aws-sdk-kotlin-ci merged commit 6ad0352 into smithy-lang:main Aug 8, 2024
13 of 14 checks passed
@ianbotsf
Copy link
Contributor

ianbotsf commented Aug 8, 2024

OK this has been merged to main and should be available in the next release, tentatively scheduled for tomorrow. Thanks again for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Telemetry provider implementation for micrometer
4 participants