Skip to content

Health metrics (Part 2) #2796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Nadine-H
Copy link
Contributor

Part of #2736

Adding two custom metrics:

  • dstack_submit_to_provision_duration_seconds: Time from when a run has been submitted and first job provisioning
  • dstack_pending_runs_total: Total number of pending runs

We can add metrics later too, but I think for now these two are helpful to see if there are runs stuck in SUBMITTED or PENDING states, which could be due to an issue with dstack or the underlying infrastructure.

@peterschmidt85 peterschmidt85 requested a review from un-def June 16, 2025 11:08
@@ -0,0 +1,52 @@
from prometheus_client import Counter, Histogram
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the background package is the right place for this module. Not sure what is the best place, maybe server.services.prometheus? We can convert it to a package with submodules to separate the existing functions from the new RunMetrics class/instance.

cc @r4victor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm not sure of the best naming that would distinguish the existing prometheus.py module and the new metrics.py. The major difference between them is how the metrics are collected; the earlier is more of a pull approach where the metrics are fetched when the /metrics endpoint is invoked, and the latter is more a push approach where the metrics are written to the prometheus client right when the runs are being processed. So maybe something like pull_metrics and push_metrics?

@Nadine-H Nadine-H force-pushed the nadine/2736_add-custom-health-metrics branch from 8941494 to 7227265 Compare June 18, 2025 18:22
@Nadine-H Nadine-H requested a review from un-def June 18, 2025 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants