Conversation
python/cog/server/worker.py
Outdated
| timeout=HEALTHCHECK_TIMEOUT, | ||
| ) | ||
|
|
||
| if result is False or result is None: |
There was a problem hiding this comment.
I find this a bit confusing that None counts as unhealthy since it's the default value if the function has a bare return or no return statement at all. WDYT about only treating result is False as failure?
There was a problem hiding this comment.
Can we require a non-None response instead? [I know, python, so no not rtequire but set typdefs to indicate not none?] - I agree with dan here. I'd encourage the response to be True for healthy, and false-y or exception = unhealthy.
There was a problem hiding this comment.
Yeah I think false only simplifies it. But I think we should fail always if it's not a bool? Or should we succeed always
There was a problem hiding this comment.
We decided to go with whatever bool(x) returns
python/cog/server/worker.py
Outdated
| except asyncio.TimeoutError: | ||
| done.error = True | ||
| done.error_detail = f"Healthcheck failed: user-defined healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds" | ||
| print(f"Healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds") |
There was a problem hiding this comment.
Where is this output intended to go?
There was a problem hiding this comment.
I think to the user visible logs right, since it's the user defined one I think it's good for them to see this. Guess it should be a log.warn instead though
python/cog/server/http.py
Outdated
| custom_health_error = healthcheck_result.error_detail | ||
|
|
||
| if not custom_health_ok: | ||
| health = Health.SETUP_FAILED |
There was a problem hiding this comment.
You're sure we shouldn't introduce a new value like Health.UNHEALTHY?
There was a problem hiding this comment.
Oh, whoops, I didn't even notice I did this
|
We'll need to land this in rust as well. Let's add a .txtar (integration test) that skips the rust coglet to land this change and then we can unskip once rust hits parity. |
|
@NikhilSinha1 do you want me to push some .txtar tests for this (would it help) or is it better to let you work on it to understand the format for the ITs? |
@mfainberg-cf happy to get some help from you here to understand what this should look like |
I'll get the txtar tests pushed up today and i'll document a couple approaches for the Rust side. In a followup comment, we can either add to this PR or in a followup then. |
|
I realized we didn't cover the cog-dataclass implementation. and for the rust side i think we need to do the following: Have the worker suprocess poll the health-check regularly and push the result over IPC and then have the coglet server reference the value via the IPC Message. The IPC Message would be handled by orchestrator and set the health-check values/take additional action. You could invert and make it a pure active check, but that might be harder to do than having a task on the workerside. [Really either approach is fine] I think both the cog-dataclass and rust implementation should be a follow-on PR, so the delta is targeted and the .txtar skip is removed for each as it's implemented. |
|
My review doesn't count as i was the last pusher. but signaling this can land once we have the review in place. |
Summary
We want to add a user-defined healthcheck function to allow users to tell us if their container is healthy or not, given they may have some more information about what determines a "healthy" system than we do. This PR adds hooks for the user to pass a
healthcheckfunction to us and we can run it on behalf of the user whenever we hit the/health-checkendpointTest Plan
Unit tests added to verify this works when a user's healthcheck succeeds, fails, times out and errors