-
Notifications
You must be signed in to change notification settings - Fork 180
Health metrics (Part 2) #2796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Health metrics (Part 2) #2796
Conversation
src/tests/_internal/server/background/tasks/test_process_runs.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,52 @@ | |||
from prometheus_client import Counter, Histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the background
package is the right place for this module. Not sure what is the best place, maybe server.services.prometheus
? We can convert it to a package with submodules to separate the existing functions from the new RunMetrics
class/instance.
cc @r4victor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'm not sure of the best naming that would distinguish the existing prometheus.py
module and the new metrics.py
. The major difference between them is how the metrics are collected; the earlier is more of a pull approach where the metrics are fetched when the /metrics
endpoint is invoked, and the latter is more a push approach where the metrics are written to the prometheus client right when the runs are being processed. So maybe something like pull_metrics
and push_metrics
?
8941494
to
7227265
Compare
Part of #2736
Adding two custom metrics:
dstack_submit_to_provision_duration_seconds
: Time from when a run has been submitted and first job provisioningdstack_pending_runs_total
: Total number of pending runsWe can add metrics later too, but I think for now these two are helpful to see if there are runs stuck in SUBMITTED or PENDING states, which could be due to an issue with dstack or the underlying infrastructure.