Skip to content

Implement span started and live health metrics #7430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Jun 18, 2025

Implements the semantic conventions SDK health metrics otel.sdk.span.live and otel.sdk.span.endedotel.sdk.span.started metrics.

Follows the approach of #7265 that these metrics are currently disabled by default, but can be enabled via the SdkTracerProviderBuilder.

Some examples of insights gained from these metrics:

  • The effective span sampling rate
  • Detection of span-leaks due to missing end() calls (via a steadily increasing otel.sdk.span.live count)
  • Mis-configured / Mis-instrumentation detection indication via excessive span counts

Copy link

codecov bot commented Jun 18, 2025

Codecov Report

Attention: Patch coverage is 89.42308% with 11 lines in your changes missing coverage. Please review.

Project coverage is 89.84%. Comparing base (317c002) to head (24a7a0a).

Files with missing lines Patch % Lines
...sdk/trace/internal/SemConvSpanInstrumentation.java 89.55% 4 Missing and 3 partials ⚠️
...ava/io/opentelemetry/sdk/trace/SdkSpanBuilder.java 50.00% 1 Missing and 1 partial ⚠️
...emetry/sdk/trace/internal/SpanInstrumentation.java 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #7430      +/-   ##
============================================
+ Coverage     89.82%   89.84%   +0.02%     
- Complexity     6998     7026      +28     
============================================
  Files           798      801       +3     
  Lines         21199    21297      +98     
  Branches       2055     2062       +7     
============================================
+ Hits          19041    19135      +94     
- Misses         1496     1498       +2     
- Partials        662      664       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


@Test
void doNotCrash() {
Span span = new NonRecordingSpan(DUMMY_CONTEXT, mockRecording);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is pretty much a copy of PropagatedSpanTest.doNotCrash. I don't think it's very useful, but it keeps the CodeCov bot happy

@JonasKunz JonasKunz marked this pull request as ready for review June 18, 2025 12:19
@JonasKunz JonasKunz requested a review from a team as a code owner June 18, 2025 12:19
Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid stuff. One recommendation about renaming to *Instrumentation - since there's going to be more and more internal telemetry, we should try to standardize naming / patterns where we can.

Thanks!

* Sets the {@link MeterProvider} supplier used to collect self-monitoring metrics. If not set,
* uses {@link GlobalOpenTelemetry#getMeterProvider()}.
*/
public SdkTracerProviderBuilder setMeterProvider(Supplier<MeterProvider> meterProviderSupplier) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the contract of other components, but I wonder if we may want to think bigger, and start accepting Supplier<OpenTelemetry> for internal monitoring. That would position us well if there are semantic conventions that involve emitting traces or logs or events for internal monitoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 74fdbae. I changed it just at the API-level, the implementation still directly uses Supplier<MeterProvider> for now, because this is the only part used currently..

I went for setInternalTelemetryOpenTelemetry as name, because I feel setOpenTelemetry would be confusing: It could make it look like the counterpart to SdkOpenTelemetryBuilder.setTracerProvider.

@@ -74,6 +81,19 @@ public static SdkTracerProviderBuilder builder() {
this.tracerConfigurator = tracerConfigurator;
}

private static SpanMetrics createSpanMetrics(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the pattern of ExporterInstrumentation:

  • Let's have a class called SpanInstrumentation (or maybe TracerProviderInstrumentation)
  • Class should have a constructor which accepts InternalTelemetryVersion and Supplier<MeterProvider> and initializes accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 6f4b0af.

However, I kept SpanInstrumentation as an interface instead of class with a static create method for now: It doesn't contain any logic except for create, which is why I think making it a class and adding delegation like ExporterInstrumentation would be unnecessary boilerplate. It however does allow us to change to this model later if we need it without touching the consumers of SpanInstrumentation.


@Override
public synchronized void recordSpanEnd() {
if (endAlreadyReported) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark endAlreadyReported concurrent safe? If a span is ended multiple times on different threads, then the state of these metrics could become corrupted. On the other hand, SdkSpan#endInternal is already protected by a lock, so the only thing that could cause is is concurrent calls to end a span + NonRecordSpan, which isn't protected by a lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a @GuardedBy("this"), is this what you meant? It was already concurrent safe because recordSpanEnd() is synchronized.
Also this is in fact necessary, because for NonRecordingSpan end() isn't synchronized.

@JonasKunz JonasKunz requested a review from jack-berg June 30, 2025 14:00
@JonasKunz
Copy link
Contributor Author

I'm moving this back to draft, because this implementation actally violates the spec. See this comment.

We'll fix the definition of the metrics first on that Sem-Conv PR, and then update this one after reaching consensus there.

@JonasKunz JonasKunz marked this pull request as draft June 30, 2025 14:34
@JonasKunz JonasKunz changed the title Implement span ended and live health metrics Implement span started and live health metrics Jul 3, 2025
@JonasKunz JonasKunz marked this pull request as ready for review July 3, 2025 09:38
@JonasKunz
Copy link
Contributor Author

Ready for review again, as the spec changes from this merged semconv PR have now been applied:

  • The otel.sdk.span.ended metric was replaced with otel.sdk.span.started
  • otel.sdk.span.live was changed to not be collected for non-recording spans
  • Added a new otel.span.parent.origin to otel.sdk.span.started. This allows us to derive many cool insights, such as:
    • average number of spans per trace
    • effective trace sampling rate (versus the effective span-sampling rate)
    • effective trace sampling rate for traces starting in the current service
    • effective trace sampling rate for traces with a remote parent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants