-
Notifications
You must be signed in to change notification settings - Fork 737
SOLR-17793: Create dedicated SdkMeterProviders for Solr core metrics #3402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17793: Create dedicated SdkMeterProviders for Solr core metrics #3402
Conversation
private static void configureOpenTelemetrySdk( | ||
SdkMeterProvider sdkMeterProvider, SdkTracerProvider sdkTracerProvider) { | ||
private static void configureOpenTelemetrySdk() { | ||
if (loaded) return; | ||
|
||
OpenTelemetrySdkBuilder builder = OpenTelemetrySdk.builder(); | ||
if (TRACE_ID_GEN_ENABLED) { | ||
log.info("OpenTelemetry tracer enabled with simple propagation only."); | ||
ExecutorUtil.addThreadLocalProvider(new ContextThreadLocalProvider()); | ||
builder.setPropagators(ContextPropagators.create(SimplePropagator.getInstance())); | ||
} | ||
|
||
if (sdkMeterProvider != null) builder.setMeterProvider(sdkMeterProvider); | ||
if (sdkTracerProvider != null) builder.setTracerProvider(sdkTracerProvider); | ||
|
||
GlobalOpenTelemetry.set(builder.build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We stop using GlobalOpenTelemetry
instance as you are fixed to 1 MeterProvider for the whole process and can't be modified. So reverted OTELs initialization of its SimplePropogator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say that but the code here continues to set GlobalOpenTelemetry 🤷
FWIW I think we should continue to use GOT but only for the container instance, if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been more clear here. We are not using GlobalOpenTelemetry
for metrics here anymore but traces still use it for the SimplePropogator
now which is why I am still setting the GlobalOpenTelemetry
.
Not sure what you mean using GlobalOpenTelemetry
only for the container instance? If we continue with our manage dediated SDKMeterProviders and the GlobalOpenTelemetry for metrics things might get messy. Different metric readers and exporters managed by different things. Autoconfiguration would work for only the container instance but not our instrumented sdk providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see; nevermind my suggestion then..
@@ -115,6 +115,8 @@ public class SolrMetricManager { | |||
|
|||
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); | |||
|
|||
public static final String OTEL_SCOPE_NAME = "org.apache.solr"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All metrics in Solr will have this as OTEL scope name
private final ConcurrentMap<String, MeterProviderAndReaders> meterProviderAndReaders = | ||
new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wrap a Meter Provider and its corresponding Metric Readers together. Currently the only metric reader is Prometheus but additional metric reader could be a periodic reader to push with OTLP. This is essentially our concept of dropwizard's registry
* resulting in duplicate metric names across cores which is an illegal format if not under the | ||
* same prometheus grouping. | ||
*/ | ||
private MetricSnapshots mergeSnapshots(List<MetricSnapshot> snapshots) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit annoying that it had to be done. All Solr Cores have the same metric names with just different labels, but when using dedicated MeterProviders they are considered duplicated names running into this bug I fixed a while ago because they aren't under the same grouping. So we need to merge the duplicate metrics into a single prometheus grouping so we are compliant with Prometheus exposition format.
import org.junit.BeforeClass; | ||
import org.junit.ClassRule; | ||
import org.junit.Test; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class TestPrometheusResponseWriter extends SolrTestCaseJ4 { | ||
public class TestPrometheusResponseWriter extends SolrCloudTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved to SolrCloudTestCase since all of this was being tested in standalone mode. All of these should test in cloud mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it matter? In general, tests should be lean, only testing at the highest level that's pertinent to the feature. Metrics don't have to do with SolrCloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solr defaults to Cloud mode, so felt it make sense to move to cloud mode. Lets you also get specific labels from cloud. For example, all cloud and standalone core metrics have the label coreName
but collection
or shard
is cloud specific. Lets us test for that with CloudTestCase but I get where you're coming from.
I can revert it back if you prefer. I was going to make a separate test suite like you recommended below anyways and move testCollectionDeletePrometheusOutput
that I made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solr defaults to Cloud mode
That's not relevant. Would you then suggest every test in Solr run SolrCloud? It's been difficult to keep our test times down (we used to be at ~40min); I shudder to think if every test had to run SolrCloud.
all cloud and standalone core metrics have the label coreName but collection or shard is cloud specific
Okay then I don't have a strong preference. If you do wind up separating the higher level integration test, maybe it could have check at least one core metric has the collection & shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats understandable. I went with meeting in the middle and wrote a test to just check a single metric that the cloud labels exist and moved testCollectionDeletePrometheusOutput
into that test suite. Reverted everything else back.
} | ||
|
||
@Test | ||
public void testCollectionDeletePrometheusOutput() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this should be here as its a full high level test. Maybe a lower level test for this in MetricsHandlerTest
or just a unit test in TestSolrMetricManager
is suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a separate test suite so that TestPrometheusResponseWriter doesn't need SolrCloud.
With dedicated SDK Meter Providers, this will not automatically work with the Java Agent as that relies on a linking with a |
@@ -849,6 +847,7 @@ private void loadInternal() { | |||
new SolrMetricsContext( | |||
metricManager, SolrMetricManager.getRegistryName(SolrInfoBean.Group.node), metricTag); | |||
|
|||
// NOCOMMIT: Do we need this for OTEL or is this specific for reporters? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not specific to otel; it's generic. At the moment, I see it's only used for async core loading, which feels needless but anyway another topic. I may put up a trivial PR.
private static void configureOpenTelemetrySdk( | ||
SdkMeterProvider sdkMeterProvider, SdkTracerProvider sdkTracerProvider) { | ||
private static void configureOpenTelemetrySdk() { | ||
if (loaded) return; | ||
|
||
OpenTelemetrySdkBuilder builder = OpenTelemetrySdk.builder(); | ||
if (TRACE_ID_GEN_ENABLED) { | ||
log.info("OpenTelemetry tracer enabled with simple propagation only."); | ||
ExecutorUtil.addThreadLocalProvider(new ContextThreadLocalProvider()); | ||
builder.setPropagators(ContextPropagators.create(SimplePropagator.getInstance())); | ||
} | ||
|
||
if (sdkMeterProvider != null) builder.setMeterProvider(sdkMeterProvider); | ||
if (sdkTracerProvider != null) builder.setTracerProvider(sdkTracerProvider); | ||
|
||
GlobalOpenTelemetry.set(builder.build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say that but the code here continues to set GlobalOpenTelemetry 🤷
FWIW I think we should continue to use GOT but only for the container instance, if that makes sense.
if (!meterProviderAndReaders.containsKey(name)) return null; | ||
return meterProviderAndReaders.get(name).prometheusMetricReader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling contains
and then conditionally get
is a minor anti-pattern since it's a double-lookup.
solr/core/src/java/org/apache/solr/response/PrometheusResponseWriter.java
Show resolved
Hide resolved
? base.unit(counterSnapshot.getMetadata().getUnit()) | ||
: base; | ||
}); | ||
counterSnapshot.getDataPoints().forEach(counterSnapshotMap.get(metricName)::dataPoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear what these lines actually accomplish. You loop stuff but appear to do do nothing with any results. disclaimer: I'm merely looking via GitHub UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly maybe my java docs weren't clear enough. Essentially we need to return a single Snapshots
object (Not to confuse with Snapshot
object) that has all of our metrics. All these lines combine equivalent metrics names together. With multiple metric readers because of multiple meter providers, Prometheus structures the data like so between 2 cores from 2 Snapshot
Counter snapshot: core_1
- metric name “requests_total”
Datapoints - {core: “core_1”, path: “/select”}, value = 10
Counter snapshot: core_2
- metric name “requests_total”
Datapoints - {core: “core_2”, path: “/select”}, value = 20
We then need to merge these into a single Counter snapshot otherwise we output 2 Snapshots
with duplicate metric name. So we take the builder for that specific metric name and merge them into a single snapshot.
Counter snapshot: merged
- metric name “requests_total”
Datapoints - {core: “core_1”, path: “/select”}, value = 10
- {core: “core_2”, path: “/select”}, value = 20
At the very end of this method, we build all the snapshots and create the final Snapshots
object that the prometheus write can output without the duplicate TYPE and DESC information.
I tested this as well with an an internal OTLP exporting in solr to a OTEL collector and the collector seems to do some kind merging like this to avoid the duplicate information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! What I specifically was confused about was the line of code that had an appearance of being side-effect free, thus pointless. But the method reference dataPoint
adds a data point. It looks ambiguous to my fresh eyes on this API.
Having to write all this here is a bit of a PITA
import org.junit.BeforeClass; | ||
import org.junit.ClassRule; | ||
import org.junit.Test; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class TestPrometheusResponseWriter extends SolrTestCaseJ4 { | ||
public class TestPrometheusResponseWriter extends SolrCloudTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it matter? In general, tests should be lean, only testing at the highest level that's pertinent to the feature. Metrics don't have to do with SolrCloud.
} | ||
|
||
@Test | ||
public void testCollectionDeletePrometheusOutput() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a separate test suite so that TestPrometheusResponseWriter doesn't need SolrCloud.
metricManager.removeRegistry(core.getCoreMetricManager().getRegistryName()); | ||
metricManager.closeMeterProvider(core.getCoreMetricManager().getRegistryName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could restore the previous name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean keep the name removeRegistry
instead of a separate? If so, sure then I'll just move the closing of the meterProviders into removeRegistry
until we completely move off. Made that change in latest commit
public Set<String> registryNames() { | ||
Set<String> set = new HashSet<>(); | ||
set.addAll(registries.keySet()); | ||
set.addAll(SharedMetricRegistries.names()); | ||
return set; | ||
} | ||
|
||
public Set<String> providerNames() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed? We're calling these registries...
import org.junit.Test; | ||
|
||
public class TestPrometheusResponseWriterCloud extends SolrCloudTestCase { | ||
@ClassRule public static SolrJettyTestRule solrClientTestRule = new SolrJettyTestRule(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing here? It's a standalone Solr instance inside a SolrCloud test!
For your awareness, I have a plan to hopefully make a TestRule for SolrCloud, so that we need not extend SolrCloudTestCase just to test SolrCloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistake leaving that in. Was the result of me copy and pasting some things and modifying...
line -> { | ||
if (line.startsWith("solr_metrics_core_requests_total") | ||
&& line.contains("handler=\"/select\"")) | ||
assertTrue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check that at least one line reaches this point :-)
Maybe use the anyMatch
technique that you use later.
promReq.setResponseParser(new InputStreamResponseParser("prometheus")); | ||
|
||
NamedList<Object> prometheusResponse = solrClient.request(promReq); | ||
assertNotNull("null response from server", prometheusResponse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is never needed
NamedList<Object> prometheusResponse = solrClient.request(promReq); | ||
assertNotNull("null response from server", prometheusResponse); | ||
|
||
InputStream in = (InputStream) prometheusResponse.get("stream"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these streams need to be closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a tricky thing here right now is that SolrMetricManager is holding both DropWizard & OTEL metrics. Thus only one can have the method name we want (i.e. one with registry in the name). Seems you should make the leap to OTEL completely soon. Could go most of the way by retaining the DropWizard pervasive methods like counter
, etc. and return NoopCounter
and similar until it can completely go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its mostly the tests that blow the number of changes which is why I was trying to add otel in parallel then incrementally migrating a subset of dropwizard metrics and the corresponding tests using them. Since everything goes through metric manager, if I switch it to NoopCounter then every test relying on counter will fail. I'm fine just moving over and making one large PR with the migration by returning a NoopCounter if you prefer and then moving to OTEL completely.
But I also found a while ago there is something in solr that was parsing the dropwizard output and using those metrics. It may have been the UI or something but it was a failing a test and I found something about it making the fix not simple. Don't fully remember but I'll look for it again.
I have another somewhat small PR I wanted to put after this that add in a OTLP exporter that can be enabled. Although its debatable where that code should live which is what I wanted to bring up next.
https://issues.apache.org/jira/browse/SOLR-17793
OTEL metrics are immutable once created for a meter provider. This is problematic for creation and deletion of Solr Cores keeping stale metrics. To do this we do a few things:
registries
and tie it to a dedicatedSdkMeterProvider
for each Solr Core with a registered Prometheus Metric Reader.GlobalOpenTelemetry
singleton instance for metricsSdkMeterProvider
and its metric readers with it