fix: metrics bugs in coglet prediction server#2896
Conversation
There was a problem hiding this comment.
This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.
I found one issue:
- Inconsistent metrics in
NotReadyerror response: ThePredictionError::NotReadybranch at line 498 doesn't include user metrics, whileFailedandCancelledbranches do. For consistency with other error responses, it should include metrics too.
| @@ -502,26 +507,32 @@ async fn create_prediction_with_id( | |||
| })), | |||
| ) | |||
| } | |||
There was a problem hiding this comment.
The NotReady error response is missing user metrics, which are included in the Failed and Cancelled branches. For API consistency, this should also include the metrics object.
|
Posted my review. I identified one inconsistency: the |
|
@markphelps Bonk workflow was cancelled. View workflow run · To retry, trigger Bonk again. |
- Reject empty keys and empty dot-path segments in set_metric to prevent panics from invalid metric names (e.g. "", "a.", ".b", "a..b") - Fix integer increment precision loss by extracting native i64/u64 values instead of converting through f64 first - Always include metrics object in state snapshots for consistent payloads, even when no user metrics have been set - Include user metrics in error and cancellation HTTP responses, not just predict_time, so callers always get the full metrics picture
b63f326 to
f9f9f4b
Compare
There was a problem hiding this comment.
This PR fixes metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.
I found one issue:
- Inconsistent metrics in
NotReadyerror response: ThePredictionError::NotReadybranch at line 498 doesn't include user metrics, whileFailedandCancelledbranches do. For consistency with other error responses, it should include metrics too.
| @@ -502,26 +507,32 @@ async fn create_prediction_with_id( | |||
| })), | |||
| ) | |||
| } | |||
There was a problem hiding this comment.
The NotReady error response is missing user metrics, which are included in the Failed and Cancelled branches. For API consistency, this should also include the metrics object.
| } | |
| Err(PredictionError::NotReady) => { | |
| let msg = PredictionError::NotReady.to_string(); | |
| let metrics = build_metrics(&user_metrics); | |
| ( | |
| StatusCode::SERVICE_UNAVAILABLE, | |
| Json(serde_json::json!({ | |
| "id": prediction_id, | |
| "error": msg, | |
| "logs": "", | |
| "status": "failed", | |
| "metrics": metrics | |
| })), | |
| ) | |
| } |
|
Posted my review. I identified one inconsistency: the |
There was a problem hiding this comment.
This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.
I found one issue:
- Inconsistent metrics in
NotReadyerror response: ThePredictionError::NotReadybranch at line 498 doesn't include user metrics, whileFailedandCancelledbranches do. For consistency with other error responses, it should include metrics too.
| @@ -502,26 +507,32 @@ async fn create_prediction_with_id( | |||
| })), | |||
| ) | |||
| } | |||
There was a problem hiding this comment.
The NotReady error response is missing user metrics, which are included in the Failed and Cancelled branches. For API consistency, this should also include the metrics object.
| Err(PredictionError::NotReady) => { | |
| let msg = PredictionError::NotReady.to_string(); | |
| let metrics = build_metrics(&user_metrics); | |
| ( | |
| StatusCode::SERVICE_UNAVAILABLE, | |
| Json(serde_json::json!({ | |
| "id": prediction_id, | |
| "error": msg, | |
| "logs": "", | |
| "status": "failed", | |
| "metrics": metrics | |
| })), | |
| ) | |
| } |
|
I'm Bonk, and I've done a quick review of your PR. This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses. I found one issue:
|
I dont think we need metrics for |
…icate/cog into mphelps/push-phase-progress * 'mphelps/push-phase-progress' of https://github.com/replicate/cog: (95 commits) feat: add metric name validation (#2911) Rename `cog run` to `cog exec` (#2916) chore(deps): bump github.com/google/go-containerregistry (#2884) fix: replace removed libgl1-mesa-glx in tensorflow integration test (#2914) ci: enforce stub freshness in CI, fix existing stub drift (#2912) feat: add schema-compare command to test harness (#2891) chore(deps): bump uuid from 1.22.0 to 1.23.0 in /crates (#2887) chore(deps): bump github.com/hashicorp/go-version from 1.7.0 to 1.9.0 (#2909) chore(deps): bump insta from 1.46.3 to 1.47.2 in /crates (#2908) fix: support list[X] | None inputs + integration tests for PEP 604 union File/Path coercion (#2882) ci: exclude Dependabot PRs from auto-code review (#2910) chore(deps): bump actions/checkout from 4 to 6 (#2904) chore(deps): bump github.com/testcontainers/testcontainers-go/modules/registry (#2886) fix: metrics bugs in coglet prediction server (#2896) Bump version to 0.17.2 (#2903) fix(coglet): propagate metric scope to async event loop thread (#2902) chore: remove unnecessary nolint directive in test (#2803) feat(coglet): add Sentry error reporting for infrastructure errors (#2865) fix: homebrew cask postflight xattr references wrong binary name (#2899) fix: include custom metrics in cog predict --json output (#2897) ...
Summary
Fixes several bugs in the coglet metrics subsystem:
"","a.",".b","a..b") are now rejected with a warning log instead of causing panics or inserting garbage entries.i64/u64values directly instead of converting throughf64first, which lost precision for large integer values.metricsobject is now always included inbuild_state_snapshot()payloads, even when empty, so consumers don't need to handle its absence.predict_time) in failure and cancellation responses, matching the behavior of the success path.Changes
crates/coglet/src/prediction.rs— Input validation, increment logic fix, snapshot consistency, new testscrates/coglet/src/transport/http/routes.rs— Extract user metrics alongside predict_time; use them in all response branches