Skip to content

SOLR-17793: Create dedicated SdkMeterProviders for Solr core metrics #3402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mlbiscoc
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17793

OTEL metrics are immutable once created for a meter provider. This is problematic for creation and deletion of Solr Cores keeping stale metrics. To do this we do a few things:

  1. Reuse Dropwizards concept of registries and tie it to a dedicated SdkMeterProvider for each Solr Core with a registered Prometheus Metric Reader.
  2. With the use of dedicated SdkmeterProviders, we move away from using GlobalOpenTelemetry singleton instance for metrics
  3. When deleting a core, close the corresponding SdkMeterProvider and its metric readers with it

Comment on lines -81 to -95
private static void configureOpenTelemetrySdk(
SdkMeterProvider sdkMeterProvider, SdkTracerProvider sdkTracerProvider) {
private static void configureOpenTelemetrySdk() {
if (loaded) return;

OpenTelemetrySdkBuilder builder = OpenTelemetrySdk.builder();
if (TRACE_ID_GEN_ENABLED) {
log.info("OpenTelemetry tracer enabled with simple propagation only.");
ExecutorUtil.addThreadLocalProvider(new ContextThreadLocalProvider());
builder.setPropagators(ContextPropagators.create(SimplePropagator.getInstance()));
}

if (sdkMeterProvider != null) builder.setMeterProvider(sdkMeterProvider);
if (sdkTracerProvider != null) builder.setTracerProvider(sdkTracerProvider);

GlobalOpenTelemetry.set(builder.build());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We stop using GlobalOpenTelemetry instance as you are fixed to 1 MeterProvider for the whole process and can't be modified. So reverted OTELs initialization of its SimplePropogator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say that but the code here continues to set GlobalOpenTelemetry 🤷

FWIW I think we should continue to use GOT but only for the container instance, if that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been more clear here. We are not using GlobalOpenTelemetry for metrics here anymore but traces still use it for the SimplePropogator now which is why I am still setting the GlobalOpenTelemetry.

Not sure what you mean using GlobalOpenTelemetry only for the container instance? If we continue with our manage dediated SDKMeterProviders and the GlobalOpenTelemetry for metrics things might get messy. Different metric readers and exporters managed by different things. Autoconfiguration would work for only the container instance but not our instrumented sdk providers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see; nevermind my suggestion then..

@@ -115,6 +115,8 @@ public class SolrMetricManager {

private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());

public static final String OTEL_SCOPE_NAME = "org.apache.solr";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All metrics in Solr will have this as OTEL scope name

Comment on lines +152 to +153
private final ConcurrentMap<String, MeterProviderAndReaders> meterProviderAndReaders =
new ConcurrentHashMap<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wrap a Meter Provider and its corresponding Metric Readers together. Currently the only metric reader is Prometheus but additional metric reader could be a periodic reader to push with OTLP. This is essentially our concept of dropwizard's registry

* resulting in duplicate metric names across cores which is an illegal format if not under the
* same prometheus grouping.
*/
private MetricSnapshots mergeSnapshots(List<MetricSnapshot> snapshots) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit annoying that it had to be done. All Solr Cores have the same metric names with just different labels, but when using dedicated MeterProviders they are considered duplicated names running into this bug I fixed a while ago because they aren't under the same grouping. So we need to merge the duplicate metrics into a single prometheus grouping so we are compliant with Prometheus exposition format.

import org.junit.BeforeClass;
import org.junit.ClassRule;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TestPrometheusResponseWriter extends SolrTestCaseJ4 {
public class TestPrometheusResponseWriter extends SolrCloudTestCase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to SolrCloudTestCase since all of this was being tested in standalone mode. All of these should test in cloud mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it matter? In general, tests should be lean, only testing at the highest level that's pertinent to the feature. Metrics don't have to do with SolrCloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solr defaults to Cloud mode, so felt it make sense to move to cloud mode. Lets you also get specific labels from cloud. For example, all cloud and standalone core metrics have the label coreName but collection or shard is cloud specific. Lets us test for that with CloudTestCase but I get where you're coming from.

I can revert it back if you prefer. I was going to make a separate test suite like you recommended below anyways and move testCollectionDeletePrometheusOutput that I made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solr defaults to Cloud mode

That's not relevant. Would you then suggest every test in Solr run SolrCloud? It's been difficult to keep our test times down (we used to be at ~40min); I shudder to think if every test had to run SolrCloud.

all cloud and standalone core metrics have the label coreName but collection or shard is cloud specific

Okay then I don't have a strong preference. If you do wind up separating the higher level integration test, maybe it could have check at least one core metric has the collection & shard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats understandable. I went with meeting in the middle and wrote a test to just check a single metric that the cloud labels exist and moved testCollectionDeletePrometheusOutput into that test suite. Reverted everything else back.

}

@Test
public void testCollectionDeletePrometheusOutput() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this should be here as its a full high level test. Maybe a lower level test for this in MetricsHandlerTest or just a unit test in TestSolrMetricManager is suffice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a separate test suite so that TestPrometheusResponseWriter doesn't need SolrCloud.

@mlbiscoc
Copy link
Contributor Author

With dedicated SDK Meter Providers, this will not automatically work with the Java Agent as that relies on a linking with a GlobalOpenTelemetry which has a single SdkMeterProvider and so we lose auto-configuration that tracing gets and OTLP. So for OTLP push we need to manually set the SDK up. I have another change that I did not add to this PR to keep it in scope, but essentially to MetricManager we add a OtlpExporter and register a PeriodicMetricReader with the MeterProvider to push tjese metrics with OTLP.

@dsmiley dsmiley self-requested a review June 25, 2025 13:33
@@ -849,6 +847,7 @@ private void loadInternal() {
new SolrMetricsContext(
metricManager, SolrMetricManager.getRegistryName(SolrInfoBean.Group.node), metricTag);

// NOCOMMIT: Do we need this for OTEL or is this specific for reporters?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not specific to otel; it's generic. At the moment, I see it's only used for async core loading, which feels needless but anyway another topic. I may put up a trivial PR.

Comment on lines -81 to -95
private static void configureOpenTelemetrySdk(
SdkMeterProvider sdkMeterProvider, SdkTracerProvider sdkTracerProvider) {
private static void configureOpenTelemetrySdk() {
if (loaded) return;

OpenTelemetrySdkBuilder builder = OpenTelemetrySdk.builder();
if (TRACE_ID_GEN_ENABLED) {
log.info("OpenTelemetry tracer enabled with simple propagation only.");
ExecutorUtil.addThreadLocalProvider(new ContextThreadLocalProvider());
builder.setPropagators(ContextPropagators.create(SimplePropagator.getInstance()));
}

if (sdkMeterProvider != null) builder.setMeterProvider(sdkMeterProvider);
if (sdkTracerProvider != null) builder.setTracerProvider(sdkTracerProvider);

GlobalOpenTelemetry.set(builder.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say that but the code here continues to set GlobalOpenTelemetry 🤷

FWIW I think we should continue to use GOT but only for the container instance, if that makes sense.

Comment on lines 1682 to 1683
if (!meterProviderAndReaders.containsKey(name)) return null;
return meterProviderAndReaders.get(name).prometheusMetricReader();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling contains and then conditionally get is a minor anti-pattern since it's a double-lookup.

? base.unit(counterSnapshot.getMetadata().getUnit())
: base;
});
counterSnapshot.getDataPoints().forEach(counterSnapshotMap.get(metricName)::dataPoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear what these lines actually accomplish. You loop stuff but appear to do do nothing with any results. disclaimer: I'm merely looking via GitHub UI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly maybe my java docs weren't clear enough. Essentially we need to return a single Snapshots object (Not to confuse with Snapshot object) that has all of our metrics. All these lines combine equivalent metrics names together. With multiple metric readers because of multiple meter providers, Prometheus structures the data like so between 2 cores from 2 Snapshot

Counter snapshot: core_1

  • metric name “requests_total”
    Datapoints
  • {core: “core_1”, path: “/select”}, value = 10

Counter snapshot: core_2

  • metric name “requests_total”
    Datapoints
  • {core: “core_2”, path: “/select”}, value = 20

We then need to merge these into a single Counter snapshot otherwise we output 2 Snapshots with duplicate metric name. So we take the builder for that specific metric name and merge them into a single snapshot.

Counter snapshot: merged

  • metric name “requests_total”
    Datapoints
  • {core: “core_1”, path: “/select”}, value = 10
  • {core: “core_2”, path: “/select”}, value = 20

At the very end of this method, we build all the snapshots and create the final Snapshots object that the prometheus write can output without the duplicate TYPE and DESC information.

I tested this as well with an an internal OTLP exporting in solr to a OTEL collector and the collector seems to do some kind merging like this to avoid the duplicate information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! What I specifically was confused about was the line of code that had an appearance of being side-effect free, thus pointless. But the method reference dataPoint adds a data point. It looks ambiguous to my fresh eyes on this API.

Having to write all this here is a bit of a PITA

import org.junit.BeforeClass;
import org.junit.ClassRule;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TestPrometheusResponseWriter extends SolrTestCaseJ4 {
public class TestPrometheusResponseWriter extends SolrCloudTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it matter? In general, tests should be lean, only testing at the highest level that's pertinent to the feature. Metrics don't have to do with SolrCloud.

}

@Test
public void testCollectionDeletePrometheusOutput() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a separate test suite so that TestPrometheusResponseWriter doesn't need SolrCloud.

metricManager.removeRegistry(core.getCoreMetricManager().getRegistryName());
metricManager.closeMeterProvider(core.getCoreMetricManager().getRegistryName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could restore the previous name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean keep the name removeRegistry instead of a separate? If so, sure then I'll just move the closing of the meterProviders into removeRegistry until we completely move off. Made that change in latest commit

public Set<String> registryNames() {
Set<String> set = new HashSet<>();
set.addAll(registries.keySet());
set.addAll(SharedMetricRegistries.names());
return set;
}

public Set<String> providerNames() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed? We're calling these registries...

import org.junit.Test;

public class TestPrometheusResponseWriterCloud extends SolrCloudTestCase {
@ClassRule public static SolrJettyTestRule solrClientTestRule = new SolrJettyTestRule();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing here? It's a standalone Solr instance inside a SolrCloud test!

For your awareness, I have a plan to hopefully make a TestRule for SolrCloud, so that we need not extend SolrCloudTestCase just to test SolrCloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistake leaving that in. Was the result of me copy and pasting some things and modifying...

line -> {
if (line.startsWith("solr_metrics_core_requests_total")
&& line.contains("handler=\"/select\""))
assertTrue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check that at least one line reaches this point :-)
Maybe use the anyMatch technique that you use later.

promReq.setResponseParser(new InputStreamResponseParser("prometheus"));

NamedList<Object> prometheusResponse = solrClient.request(promReq);
assertNotNull("null response from server", prometheusResponse);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is never needed

NamedList<Object> prometheusResponse = solrClient.request(promReq);
assertNotNull("null response from server", prometheusResponse);

InputStream in = (InputStream) prometheusResponse.get("stream");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these streams need to be closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a tricky thing here right now is that SolrMetricManager is holding both DropWizard & OTEL metrics. Thus only one can have the method name we want (i.e. one with registry in the name). Seems you should make the leap to OTEL completely soon. Could go most of the way by retaining the DropWizard pervasive methods like counter, etc. and return NoopCounter and similar until it can completely go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its mostly the tests that blow the number of changes which is why I was trying to add otel in parallel then incrementally migrating a subset of dropwizard metrics and the corresponding tests using them. Since everything goes through metric manager, if I switch it to NoopCounter then every test relying on counter will fail. I'm fine just moving over and making one large PR with the migration by returning a NoopCounter if you prefer and then moving to OTEL completely.

But I also found a while ago there is something in solr that was parsing the dropwizard output and using those metrics. It may have been the UI or something but it was a failing a test and I found something about it making the fix not simple. Don't fully remember but I'll look for it again.

I have another somewhat small PR I wanted to put after this that add in a OTLP exporter that can be enabled. Although its debatable where that code should live which is what I wanted to bring up next.

@mlbiscoc mlbiscoc merged commit c7a5950 into apache:feature/SOLR-17458 Jun 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants