Skip to content

Add more metric reporting to MSQ #18121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jul 4, 2025
Merged

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Jun 11, 2025

Adds reporting of additional metrics and dimensions to MSQ task and Dart.

Summary
Currently, MSQ reports the a few metrics. (ingest/tombstones/count, ingest/segments/count and ingest/input/bytes).

This PR adds a few new metrics:

  • query/time: Reported by controller and worker at the end of the query.
  • query/cpu/time: Reported by each worker at the end of the query.

Additionally, adds some more dimensions to the metrics emitted. Dimensions added to metrics:

  • queryId
  • sqlQueryId
  • engine: Denotes the engine used for the query, msq-dart or msq-task.
  • dartQueryId: (Dart only)
  • type: Always msq
  • dataSource
  • interval
  • duration
  • success

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Documentation Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jun 11, 2025
@adarshsanjeev adarshsanjeev marked this pull request as ready for review June 12, 2025 05:43
@adarshsanjeev adarshsanjeev requested a review from gianm June 24, 2025 09:23
@capistrant capistrant added this to the 34.0.0 milestone Jun 30, 2025
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few line comments. Please also include tests that verify the controller and workers are emitting the correct metrics with the correct dimensions. One way you can do that is:

  • update MSQTestControllerContext and MSQTestWorkerContext to have emitMetric write emitted metrics into some list
  • provide a mechanism for test cases to get the emitted metrics

Comment on lines 575 to 578
context.emitMetric(
"query/time",
MSQMetricUtils.createQueryMetricDimensions(datasources, intervals, success),
startTime - System.currentTimeMillis()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that the context#emitMetric method is odd; it could be something like:

    MsqServiceMetricEventBuilder metricBuilder = new MsqServiceMetricEventBuilder(querySpec);

    metricBuilder.setDimension(DruidMetrics.DATASOURCE, DefaultQueryMetrics.getTableNamesAsString(datasources));
    metricBuilder.setDimension(DruidMetrics.INTERVAL, DefaultQueryMetrics.getIntervalsAsStringArray(intervals));
    metricBuilder.setDimension(DruidMetrics.DURATION, BaseQuery.calculateDuration(intervals));
    metricBuilder.setDimension(DruidMetrics.SUCCESS, success);

    metricBuilder.setMetric("query/time", startTime - System.currentTimeMillis());
    context.getEmitter().emit(metricBuilder);

probably MsqServiceMetricEventBuilder could also provide a more natural home for those static methods which could become instance methods

..or alter the signature of context#emitrMetric to context#emit(metricBuilder) ? so that the context could still
add stuff if it wants - but the caller has the control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the above author :D it was me...I was just checking some github features ; which involved switching between 2 users ....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that would work too. It's kind of equivalent (the metricBuilder has a metric name, value, and dimensions in it) but bundled together more nicely. @adarshsanjeev - could you try this approach? It looks like it could be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk Made these changes. The function now takes a service emitter.

There's an awkward bit in TaskControllerContext, where we have to check if the dataSource has already been provided in the metricBuilder before setting the default value, for backward compatibility. I wonder if it is worth adding yet another function of getDefaultMetricBuilder() from the context instead, so that the caller can decide what all to override instead of this weird code.

Please let me know your thoughts.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the updates! looks good
left a few notes here and there - the IT failures are unrelated

Set<String> datasourceDim = (Set<String>) queryMetricDimensions.computeIfAbsent(DruidMetrics.DATASOURCE, s -> new HashSet<>());
Set<Interval> intervalDim = (Set<Interval>) queryMetricDimensions.computeIfAbsent(DruidMetrics.INTERVAL, s -> new HashSet<>());
for (StageDefinition stageDef : queryDef.getStageDefinitions()) {
datasourceDim.addAll(MSQMetricUtils.getDatasources(stageDef));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this a one time need - or maybe there should be queryDef.getDatasources ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I believe this is a one time need. Now that I have changed getDatasources to a member of stageDefinition, maybe this isn't required.

@cryptoe cryptoe merged commit 65950fd into apache:master Jul 4, 2025
96 of 98 checks passed
kgyrtkirk added a commit that referenced this pull request Jul 15, 2025
* concurrency issue with StubServiceEmitter - #18121 have added a few new metrics which have increased the load on it and have caused  random appearances of an issue arising from the fact it used an `ArrayList` under the hood.
* added some catches to shut down queries properly in case some unexpected exceptions occur - this could give better exceptions and reduce time to fix in the future

this should reduce the probability that `QTest` splits remain hanging
capistrant pushed a commit to capistrant/incubator-druid that referenced this pull request Jul 15, 2025
* concurrency issue with StubServiceEmitter - apache#18121 have added a few new metrics which have increased the load on it and have caused  random appearances of an issue arising from the fact it used an `ArrayList` under the hood.
* added some catches to shut down queries properly in case some unexpected exceptions occur - this could give better exceptions and reduce time to fix in the future

this should reduce the probability that `QTest` splits remain hanging
capistrant added a commit that referenced this pull request Jul 15, 2025
* concurrency issue with StubServiceEmitter - #18121 have added a few new metrics which have increased the load on it and have caused  random appearances of an issue arising from the fact it used an `ArrayList` under the hood.
* added some catches to shut down queries properly in case some unexpected exceptions occur - this could give better exceptions and reduce time to fix in the future

this should reduce the probability that `QTest` splits remain hanging

Co-authored-by: Zoltan Haindrich <kirk@rxd.hu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants