Add Prometheus metrics for the number of jobs completed by each agent #477
Labels
C-new-feature
Category: a new feature to implement
E-mentor
Call for participation: this issue has instructions how to fix it
E-needs-help
Call for participation: we need help for this issue
Before distributed experiments were merged it was possible to see if an agent was actually doing work by looking at the number of completed jobs for the experiment it was working on and seeing if it increments over time.
Now that distributed experiments are implemented it's not possible anymore to do that: if two agents are working on the distributed experiment and only one is actually completing jobs we'd have no way to actually know that.
The Rust infrastructure team is using Prometheus for our monitoring and alerting, and if we expose a counter for each agent we'll be able to gather enough data.
The metrics would be implemented on the server on a
/metrics
endpoint (which is the standard for Prometheus), and expose acrater_completed_jobs
metrics with the labelsagent
andexperiment
(respectively with the agent name and the experiment name). The counter should only increase over time, and doesn't need to be persisted (Prometheus handles counters resetting after a restart just fine).By the way, on docs.rs I used the Prometheus library to implement the changes. It'd be a bit nice to use the same library across services.
The text was updated successfully, but these errors were encountered: