Health metrics (Part 2) #2796

Nadine-H · 2025-06-12T16:24:06Z

Part of #2736

Adding two custom metrics:

dstack_submit_to_provision_duration_seconds: Time from when a run has been submitted and first job provisioning
dstack_pending_runs_total: Total number of pending runs

We can add metrics later too, but I think for now these two are helpful to see if there are runs stuck in SUBMITTED or PENDING states, which could be due to an issue with dstack or the underlying infrastructure.

src/dstack/_internal/server/background/tasks/process_runs.py

src/tests/_internal/server/background/tasks/test_process_runs.py

un-def · 2025-06-17T16:21:53Z

src/dstack/_internal/server/background/metrics.py

@@ -0,0 +1,52 @@
+from prometheus_client import Counter, Histogram


I don't think the background package is the right place for this module. Not sure what is the best place, maybe server.services.prometheus? We can convert it to a package with submodules to separate the existing functions from the new RunMetrics class/instance.

cc @r4victor

Makes sense, I'm not sure of the best naming that would distinguish the existing prometheus.py module and the new metrics.py. The major difference between them is how the metrics are collected; the earlier is more of a pull approach where the metrics are fetched when the /metrics endpoint is invoked, and the latter is more a push approach where the metrics are written to the prometheus client right when the runs are being processed. So maybe something like pull_metrics and push_metrics?

Nadine-H added 5 commits June 12, 2025 12:19

Add basic http metrics

6983f2f

Implement custom http metrics

97aac8f

Update docs

ecc2ad7

Add custom health metrics

36300d4

Add prometheus wrapper class

f197327

peterschmidt85 requested a review from un-def June 16, 2025 11:08

un-def reviewed Jun 17, 2025

View reviewed changes

Apply PR comments

7227265

Nadine-H force-pushed the nadine/2736_add-custom-health-metrics branch from 8941494 to 7227265 Compare June 18, 2025 18:22

Nadine-H requested a review from un-def June 18, 2025 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Health metrics (Part 2) #2796

Health metrics (Part 2) #2796

Uh oh!

Nadine-H commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

un-def Jun 17, 2025

Uh oh!

Nadine-H Jun 18, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,52 @@
		from prometheus_client import Counter, Histogram

Health metrics (Part 2) #2796

Are you sure you want to change the base?

Health metrics (Part 2) #2796

Uh oh!

Conversation

Nadine-H commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

un-def Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Nadine-H Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!