Add user-defined healthcheck by NikhilSinha1 · Pull Request #2652 · replicate/cog

NikhilSinha1 · 2026-01-24T00:25:09Z

Summary

We want to add a user-defined healthcheck function to allow users to tell us if their container is healthy or not, given they may have some more information about what determines a "healthy" system than we do. This PR adds hooks for the user to pass a healthcheck function to us and we can run it on behalf of the user whenever we hit the /health-check endpoint

Test Plan

Unit tests added to verify this works when a user's healthcheck succeeds, fails, times out and errors

meatballhat-cf · 2026-01-24T02:53:22Z

python/cog/server/worker.py

+                        timeout=HEALTHCHECK_TIMEOUT,
+                    )
+
+                if result is False or result is None:


I find this a bit confusing that None counts as unhealthy since it's the default value if the function has a bare return or no return statement at all. WDYT about only treating result is False as failure?

Can we require a non-None response instead? [I know, python, so no not rtequire but set typdefs to indicate not none?] - I agree with dan here. I'd encourage the response to be True for healthy, and false-y or exception = unhealthy.

Yeah I think false only simplifies it. But I think we should fail always if it's not a bool? Or should we succeed always

We decided to go with whatever bool(x) returns

meatballhat-cf · 2026-01-24T02:53:53Z

python/cog/server/worker.py

+        except asyncio.TimeoutError:
+            done.error = True
+            done.error_detail = f"Healthcheck failed: user-defined healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds"
+            print(f"Healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds")


Where is this output intended to go?

I think to the user visible logs right, since it's the user defined one I think it's good for them to see this. Guess it should be a log.warn instead though

meatballhat-cf · 2026-01-24T02:59:39Z

python/cog/server/http.py

+            custom_health_error = healthcheck_result.error_detail
+
+            if not custom_health_ok:
+                health = Health.SETUP_FAILED


You're sure we shouldn't introduce a new value like Health.UNHEALTHY?

Oh, whoops, I didn't even notice I did this

mfainberg-cf · 2026-01-26T17:23:54Z

We'll need to land this in rust as well. Let's add a .txtar (integration test) that skips the rust coglet to land this change and then we can unskip once rust hits parity.

mfainberg-cf · 2026-01-26T18:23:38Z

@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs?

NikhilSinha1 · 2026-01-26T20:09:38Z

@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs?

@mfainberg-cf happy to get some help from you here to understand what this should look like

mfainberg-cf · 2026-01-26T20:33:43Z

@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs?

@mfainberg-cf happy to get some help from you here to understand what this should look like

I'll get the txtar tests pushed up today and i'll document a couple approaches for the Rust side. In a followup comment, we can either add to this PR or in a followup then.

mfainberg-cf · 2026-01-26T21:47:21Z

I realized we didn't cover the cog-dataclass implementation. and for the rust side i think we need to do the following:

Have the worker suprocess poll the health-check regularly and push the result over IPC and then have the coglet server reference the value via the IPC Message. The IPC Message would be handled by orchestrator and set the health-check values/take additional action. You could invert and make it a pure active check, but that might be harder to do than having a task on the workerside. [Really either approach is fine]

I think both the cog-dataclass and rust implementation should be a follow-on PR, so the delta is targeted and the .txtar skip is removed for each as it's implemented.

mfainberg-cf · 2026-01-26T21:52:04Z

My review doesn't count as i was the last pusher. but signaling this can land once we have the review in place.

Nikhil Sinha added 2 commits January 23, 2026 15:27

Allow user to define healthcheck function

98fa399

Add integration tests to verify behavior

d99b958

NikhilSinha1 requested a review from a team as a code owner January 24, 2026 00:25

python 3.8 compatability fix

2aa67cf

meatballhat-cf reviewed Jan 24, 2026

View reviewed changes

replicate deleted a comment from tempusfrangit Jan 26, 2026

Merge branch 'main' of github.com:replicate/cog into nikhil/healthcheck

7b8c24b

Nikhil Sinha added 2 commits January 26, 2026 12:09

Accept suggestions from comments

7668e97

Fix tests

fffcf47

meatballhat-cf approved these changes Jan 26, 2026

View reviewed changes

test: cover healthcheck unhealthy statuses

9a9f48b

mfainberg-cf requested a review from meatballhat-cf January 26, 2026 21:51

mfainberg-cf approved these changes Jan 26, 2026

View reviewed changes

meatballhat-cf approved these changes Jan 26, 2026

View reviewed changes

mfainberg-cf merged commit 6ded6c4 into main Jan 26, 2026
32 checks passed

mfainberg-cf deleted the nikhil/healthcheck branch January 26, 2026 22:22

Conversation

NikhilSinha1 commented Jan 24, 2026

Summary

Test Plan

Uh oh!

meatballhat-cf Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

mfainberg-cf Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

NikhilSinha1 Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NikhilSinha1 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

meatballhat-cf Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

NikhilSinha1 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

meatballhat-cf Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

NikhilSinha1 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

mfainberg-cf commented Jan 26, 2026

Uh oh!

mfainberg-cf commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NikhilSinha1 commented Jan 26, 2026

Uh oh!

mfainberg-cf commented Jan 26, 2026

Uh oh!

mfainberg-cf commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfainberg-cf commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NikhilSinha1 Jan 26, 2026 •

edited

Loading

mfainberg-cf commented Jan 26, 2026 •

edited

Loading

mfainberg-cf commented Jan 26, 2026 •

edited

Loading