Skip to content

Add user-defined healthcheck#2652

Merged
mfainberg-cf merged 7 commits intomainfrom
nikhil/healthcheck
Jan 26, 2026
Merged

Add user-defined healthcheck#2652
mfainberg-cf merged 7 commits intomainfrom
nikhil/healthcheck

Conversation

@NikhilSinha1
Copy link
Copy Markdown
Contributor

Summary

We want to add a user-defined healthcheck function to allow users to tell us if their container is healthy or not, given they may have some more information about what determines a "healthy" system than we do. This PR adds hooks for the user to pass a healthcheck function to us and we can run it on behalf of the user whenever we hit the /health-check endpoint

Test Plan

Unit tests added to verify this works when a user's healthcheck succeeds, fails, times out and errors

@NikhilSinha1 NikhilSinha1 requested a review from a team as a code owner January 24, 2026 00:25
timeout=HEALTHCHECK_TIMEOUT,
)

if result is False or result is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit confusing that None counts as unhealthy since it's the default value if the function has a bare return or no return statement at all. WDYT about only treating result is False as failure?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we require a non-None response instead? [I know, python, so no not rtequire but set typdefs to indicate not none?] - I agree with dan here. I'd encourage the response to be True for healthy, and false-y or exception = unhealthy.

Copy link
Copy Markdown
Contributor Author

@NikhilSinha1 NikhilSinha1 Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think false only simplifies it. But I think we should fail always if it's not a bool? Or should we succeed always

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to go with whatever bool(x) returns

except asyncio.TimeoutError:
done.error = True
done.error_detail = f"Healthcheck failed: user-defined healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds"
print(f"Healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this output intended to go?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to the user visible logs right, since it's the user defined one I think it's good for them to see this. Guess it should be a log.warn instead though

custom_health_error = healthcheck_result.error_detail

if not custom_health_ok:
health = Health.SETUP_FAILED
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're sure we shouldn't introduce a new value like Health.UNHEALTHY?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, whoops, I didn't even notice I did this

@replicate replicate deleted a comment from tempusfrangit Jan 26, 2026
@mfainberg-cf
Copy link
Copy Markdown
Contributor

We'll need to land this in rust as well. Let's add a .txtar (integration test) that skips the rust coglet to land this change and then we can unskip once rust hits parity.

@mfainberg-cf
Copy link
Copy Markdown
Contributor

mfainberg-cf commented Jan 26, 2026

@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs?

@NikhilSinha1
Copy link
Copy Markdown
Contributor Author

@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs?

@mfainberg-cf happy to get some help from you here to understand what this should look like

@mfainberg-cf
Copy link
Copy Markdown
Contributor

@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs?

@mfainberg-cf happy to get some help from you here to understand what this should look like

I'll get the txtar tests pushed up today and i'll document a couple approaches for the Rust side. In a followup comment, we can either add to this PR or in a followup then.

@mfainberg-cf
Copy link
Copy Markdown
Contributor

mfainberg-cf commented Jan 26, 2026

I realized we didn't cover the cog-dataclass implementation. and for the rust side i think we need to do the following:

Have the worker suprocess poll the health-check regularly and push the result over IPC and then have the coglet server reference the value via the IPC Message. The IPC Message would be handled by orchestrator and set the health-check values/take additional action. You could invert and make it a pure active check, but that might be harder to do than having a task on the workerside. [Really either approach is fine]

I think both the cog-dataclass and rust implementation should be a follow-on PR, so the delta is targeted and the .txtar skip is removed for each as it's implemented.

@mfainberg-cf
Copy link
Copy Markdown
Contributor

My review doesn't count as i was the last pusher. but signaling this can land once we have the review in place.

@mfainberg-cf mfainberg-cf merged commit 6ded6c4 into main Jan 26, 2026
32 checks passed
@mfainberg-cf mfainberg-cf deleted the nikhil/healthcheck branch January 26, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants