Rearrange metric enablement, so that model metric reporter can procee… #321

ClifHouck · 2024-01-19T19:21:15Z

…d properly.

Addresses triton-inference-server/server#6815

The fix is to enable GPU metrics (assuming they're enabled at compile time and by the user at run time) prior to calling MetricModelReporter::Create. If GPU metrics are not enabled then MetricModelReporter::GetMetricsLabels will not get/populate relevant GPU labels.

ClifHouck · 2024-01-22T18:09:48Z

To elaborate why this change fixes GPU metrics labels: enabling GPU metrics before initializing the server around line 2396: tc::Status status = lserver->Init() allows metric labels to be populated with GPU information.

…d properly

dyastremsky · 2024-02-21T00:49:55Z

Thank you for this PR!

These look good to me. Adding @rmccorm4 as a reviewer as well, since he is more familiar with these files.

Once Ryan is good with these changes, we'll run them through the CI and merge once all passes.

rmccorm4 · 2024-02-21T20:53:21Z

Hi @ClifHouck, thanks for this contribution!

While you have figured out a way to have the existing logic propagate the GPU labels to the generic per-model inference metrics - I wouldn't exactly say this is a bug at the moment.

Our per-model metrics are currently aggregated per-model, even if technically under the hood they are being tracked per-model-instance. By introducing these GPU labels for metrics other than the gpu mem/util metrics, it would start to expose the notion of per-model-instance metrics for the case of KIND_GPU models with multiple model instances.

To me I think there is some drawback to adding this support as-is, because it will introduce some inconsistency in how our metrics are reported and aggregated. With this change, KIND_GPU models will have per-model-instance metrics, but KIND_CPU/KIND_MODEL models will not. Similarly, I think this will also beg the question for models using multi-gpu (currently only supported via KIND_MODEL), why aren't the gpu_labels showing the multiple gpus being used for these model instances?

We have a ticket in our backlog (DLIS-3959) to increase the breakdown to per-model-instance metrics (generically for all model instances, irrespective of device type), but it hasn't been prioritized over other tasks yet. Not exposing the GPU labels for these inference metrics allows the metrics to be aggregated for consistency across all cases.

Can you elaborate more on your use case and needs, and how your proposed changes or our future changes for per-gpu or per-model-instance inference metrics would directly impact you?

Thanks,
Ryan

rmccorm4

Blocking accidental merge while we discuss the above comments.

ClifHouck · 2024-02-22T16:01:48Z

@rmccorm4 I have to disagree that this is not a bug. Given what you have said, there are at least two here:

It is not possible for certain metric information to be gathered or initialized during server initialization. Clearly MetricModelReporter expected metrics to be decisively enabled or disabled by the time that InferenceServer::Init is called. I think that's a reasonable thing to expect.
If MetricModelReporter shouldn't apply GPU labels to its metrics, then that code should be changed or removed.

I can add a commit to this PR which removes the gathering and applying of GPU UUID information to model metrics. That way we solve both issues outlined above.

rmccorm4 · 2024-02-22T20:14:20Z

(1) Clearly MetricModelReporter expected metrics to be decisively enabled or disabled by the time that InferenceServer::Init is called. I think that's a reasonable thing to expect.

lserver->Init() initializes most components of the server, several of which are the components that get queried to perioridcally update metrics. For example, tc::Metrics::StartPollingThreadSingleton(); starts a thread to poll metrics from the PinnedMemoryManager, which is initialized along with the server. So swapping these two operations does not currently make sense without greater refactoring.

I agree that this flow may be a bit unintuitive currently, since the MetricModelReporters are initialized along with the models+model_repository_manager. In fact, if you were do use the --model-control-mode explicit and dynamically load a model with KIND_GPU after the server has started up, then the gpu labels will actually get populated for these per-model metrics.

I agree this should be resolved one way or the other to be consistent, but I think it's something we should take care to change and we need to balance our current list of priorities. If this behavior is having a significant impact on some workflow or use case, please do let us know. But otherwise I think this is something for us to revisit when we have the bandwidth to do so.

(2) If MetricModelReporter shouldn't apply GPU labels to its metrics, then that code should be changed or removed.

I agree that this code should probably be commented out with a note that it could be re-applied if per-model-instance metrics are exposed.

ClifHouck mentioned this pull request Jan 20, 2024

MetricModelReporter is not populating available GPU information to appropriate metric labels. triton-inference-server/server#6815

Open

ClifHouck force-pushed the clif/fix_enablement_of_metric_labels branch from f93cf3a to b0a970d Compare January 26, 2024 14:29

Rearrange metric enablement, so that model metric reporter can procee…

89a0126

…d properly

ClifHouck force-pushed the clif/fix_enablement_of_metric_labels branch from b0a970d to 89a0126 Compare January 29, 2024 14:34

ClifHouck mentioned this pull request Feb 20, 2024

Enhancement Request: Additional GPU Information in Prometheus Metrics triton-inference-server/server#6384

Open

dyastremsky requested a review from rmccorm4 February 21, 2024 00:48

dyastremsky self-assigned this Feb 21, 2024

dyastremsky approved these changes Feb 21, 2024

View reviewed changes

rmccorm4 requested changes Feb 21, 2024

View reviewed changes

ClifHouck mentioned this pull request Apr 18, 2024

Comment out gpu metric gathering code that cannot succeed #342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearrange metric enablement, so that model metric reporter can procee… #321

Rearrange metric enablement, so that model metric reporter can procee… #321

ClifHouck commented Jan 19, 2024

ClifHouck commented Jan 22, 2024

dyastremsky commented Feb 21, 2024 •

edited

rmccorm4 commented Feb 21, 2024 •

edited

rmccorm4 left a comment

ClifHouck commented Feb 22, 2024

rmccorm4 commented Feb 22, 2024 •

edited

Rearrange metric enablement, so that model metric reporter can procee… #321

Are you sure you want to change the base?

Rearrange metric enablement, so that model metric reporter can procee… #321

Conversation

ClifHouck commented Jan 19, 2024

ClifHouck commented Jan 22, 2024

dyastremsky commented Feb 21, 2024 • edited

rmccorm4 commented Feb 21, 2024 • edited

rmccorm4 left a comment

Choose a reason for hiding this comment

ClifHouck commented Feb 22, 2024

rmccorm4 commented Feb 22, 2024 • edited

dyastremsky commented Feb 21, 2024 •

edited

rmccorm4 commented Feb 21, 2024 •

edited

rmccorm4 commented Feb 22, 2024 •

edited