fix: metrics bugs in coglet prediction server by markphelps · Pull Request #2896 · replicate/cog

markphelps · 2026-03-31T18:39:13Z

Summary

Fixes several bugs in the coglet metrics subsystem:

Reject invalid metric keys: Empty keys and keys with empty dot-path segments (e.g. "", "a.", ".b", "a..b") are now rejected with a warning log instead of causing panics or inserting garbage entries.
Fix integer increment precision loss: Metric increments now extract native i64/u64 values directly instead of converting through f64 first, which lost precision for large integer values.
Consistent metrics in state snapshots: The metrics object is now always included in build_state_snapshot() payloads, even when empty, so consumers don't need to handle its absence.
Include user metrics in error/cancel responses: The HTTP routes now include user-defined metrics (not just predict_time) in failure and cancellation responses, matching the behavior of the success path.

Changes

crates/coglet/src/prediction.rs — Input validation, increment logic fix, snapshot consistency, new tests
crates/coglet/src/transport/http/routes.rs — Extract user metrics alongside predict_time; use them in all response branches

ask-bonk

This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.

I found one issue:

Inconsistent metrics in NotReady error response: The PredictionError::NotReady branch at line 498 doesn't include user metrics, while Failed and Cancelled branches do. For consistency with other error responses, it should include metrics too.

ask-bonk · 2026-03-31T18:41:23Z

crates/coglet/src/transport/http/routes.rs

@@ -502,26 +507,32 @@ async fn create_prediction_with_id(
                })),
            )
        }


The NotReady error response is missing user metrics, which are included in the Failed and Cancelled branches. For API consistency, this should also include the metrics object.

ask-bonk · 2026-03-31T18:41:29Z

Posted my review. I identified one inconsistency: the NotReady error response at lines 498-509 doesn't include user metrics, while Failed and Cancelled branches do. For API consistency, all error responses should include metrics since user_metrics is already available in that scope.

github run

ask-bonk · 2026-04-01T15:26:30Z

@markphelps Bonk workflow was cancelled.

View workflow run · To retry, trigger Bonk again.

- Reject empty keys and empty dot-path segments in set_metric to prevent panics from invalid metric names (e.g. "", "a.", ".b", "a..b") - Fix integer increment precision loss by extracting native i64/u64 values instead of converting through f64 first - Always include metrics object in state snapshots for consistent payloads, even when no user metrics have been set - Include user metrics in error and cancellation HTTP responses, not just predict_time, so callers always get the full metrics picture

ask-bonk

This PR fixes metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.

I found one issue:

Inconsistent metrics in NotReady error response: The PredictionError::NotReady branch at line 498 doesn't include user metrics, while Failed and Cancelled branches do. For consistency with other error responses, it should include metrics too.

ask-bonk · 2026-04-01T16:36:09Z

crates/coglet/src/transport/http/routes.rs

@@ -502,26 +507,32 @@ async fn create_prediction_with_id(
                })),
            )
        }


The NotReady error response is missing user metrics, which are included in the Failed and Cancelled branches. For API consistency, this should also include the metrics object.

Suggested change

}

Err(PredictionError::NotReady) => {

let msg = PredictionError::NotReady.to_string();

let metrics = build_metrics(&user_metrics);

(

StatusCode::SERVICE_UNAVAILABLE,

Json(serde_json::json!({

"id": prediction_id,

"error": msg,

"logs": "",

"status": "failed",

"metrics": metrics

})),

)

}

ask-bonk · 2026-04-01T16:36:19Z

Posted my review. I identified one inconsistency: the NotReady error response at lines 498-509 doesn't include user metrics, while Failed and Cancelled branches do. For API consistency, all error responses should include metrics since user_metrics is already available in that scope.

github run

ask-bonk

This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.

I found one issue:

Inconsistent metrics in NotReady error response: The PredictionError::NotReady branch at line 498 doesn't include user metrics, while Failed and Cancelled branches do. For consistency with other error responses, it should include metrics too.

ask-bonk · 2026-04-07T14:23:57Z

crates/coglet/src/transport/http/routes.rs

@@ -502,26 +507,32 @@ async fn create_prediction_with_id(
                })),
            )
        }


The NotReady error response is missing user metrics, which are included in the Failed and Cancelled branches. For API consistency, this should also include the metrics object.

Suggested change

Err(PredictionError::NotReady) => {

let msg = PredictionError::NotReady.to_string();

let metrics = build_metrics(&user_metrics);

(

StatusCode::SERVICE_UNAVAILABLE,

Json(serde_json::json!({

"id": prediction_id,

"error": msg,

"logs": "",

"status": "failed",

"metrics": metrics

})),

)

}

ask-bonk · 2026-04-07T14:24:06Z

I'm Bonk, and I've done a quick review of your PR.

This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.

I found one issue:

Inconsistent metrics in NotReady error response: The PredictionError::NotReady branch at line 498 doesn't include user metrics, while Failed and Cancelled branches do. For consistency with other error responses, it should include metrics too.

github run

markphelps · 2026-04-07T14:35:51Z

I'm Bonk, and I've done a quick review of your PR.

This PR fixes several metrics bugs in coglet: rejecting invalid metric keys, fixing integer increment precision loss, ensuring consistent metrics in state snapshots, and including user metrics in error/cancel responses.

I found one issue:

Inconsistent metrics in NotReady error response: The PredictionError::NotReady branch at line 498 doesn't include user metrics, while Failed and Cancelled branches do. For consistency with other error responses, it should include metrics too.

github run

I dont think we need metrics for NotReady state

…icate/cog into mphelps/push-phase-progress * 'mphelps/push-phase-progress' of https://github.com/replicate/cog: (95 commits) feat: add metric name validation (#2911) Rename `cog run` to `cog exec` (#2916) chore(deps): bump github.com/google/go-containerregistry (#2884) fix: replace removed libgl1-mesa-glx in tensorflow integration test (#2914) ci: enforce stub freshness in CI, fix existing stub drift (#2912) feat: add schema-compare command to test harness (#2891) chore(deps): bump uuid from 1.22.0 to 1.23.0 in /crates (#2887) chore(deps): bump github.com/hashicorp/go-version from 1.7.0 to 1.9.0 (#2909) chore(deps): bump insta from 1.46.3 to 1.47.2 in /crates (#2908) fix: support list[X] | None inputs + integration tests for PEP 604 union File/Path coercion (#2882) ci: exclude Dependabot PRs from auto-code review (#2910) chore(deps): bump actions/checkout from 4 to 6 (#2904) chore(deps): bump github.com/testcontainers/testcontainers-go/modules/registry (#2886) fix: metrics bugs in coglet prediction server (#2896) Bump version to 0.17.2 (#2903) fix(coglet): propagate metric scope to async event loop thread (#2902) chore: remove unnecessary nolint directive in test (#2803) feat(coglet): add Sentry error reporting for infrastructure errors (#2865) fix: homebrew cask postflight xattr references wrong binary name (#2899) fix: include custom metrics in cog predict --json output (#2897) ...

ask-bonk bot reviewed Mar 31, 2026

View reviewed changes

markphelps marked this pull request as ready for review April 1, 2026 14:56

markphelps requested a review from a team as a code owner April 1, 2026 14:56

markphelps requested a review from michaeldwan April 1, 2026 14:56

michaeldwan force-pushed the fix/metrics-bugs branch from b63f326 to f9f9f4b Compare April 1, 2026 16:34

ask-bonk bot reviewed Apr 1, 2026

View reviewed changes

Merge branch 'main' into fix/metrics-bugs

ab3053c

ask-bonk bot reviewed Apr 7, 2026

View reviewed changes

michaeldwan approved these changes Apr 7, 2026

View reviewed changes

michaeldwan added this pull request to the merge queue Apr 7, 2026

Merged via the queue into main with commit 4e048a8 Apr 7, 2026
66 of 68 checks passed

michaeldwan deleted the fix/metrics-bugs branch April 7, 2026 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: metrics bugs in coglet prediction server#2896

fix: metrics bugs in coglet prediction server#2896
michaeldwan merged 2 commits intomainfrom
fix/metrics-bugs

markphelps commented Mar 31, 2026

Uh oh!

ask-bonk bot left a comment

Uh oh!

ask-bonk bot Mar 31, 2026

Uh oh!

ask-bonk bot commented Mar 31, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot left a comment

Uh oh!

ask-bonk bot Apr 1, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot left a comment

Uh oh!

ask-bonk bot Apr 7, 2026

Uh oh!

ask-bonk bot commented Apr 7, 2026

Uh oh!

markphelps commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

@@ @@ -502,26 +507,32 @@ async fn create_prediction_with_id( @@
                               })),
                           )
                       }

-        }
+        Err(PredictionError::NotReady) => {
+            let msg = PredictionError::NotReady.to_string();
+            let metrics = build_metrics(&user_metrics);
+            (
+                StatusCode::SERVICE_UNAVAILABLE,
+                Json(serde_json::json!({
+                    "id": prediction_id,
+                    "error": msg,
+                    "logs": "",
+                    "status": "failed",
+                    "metrics": metrics
+                })),
+            )
+        }

Conversation

markphelps commented Mar 31, 2026

Summary

Changes

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Mar 31, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Apr 7, 2026

Uh oh!

markphelps commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants