-
Notifications
You must be signed in to change notification settings - Fork 674
feat: Add healthcheck command in nerdctl #4302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a0bfd2c
to
b03c1f6
Compare
bebb32a
to
cced4f6
Compare
CI is failing on Windows |
pkg/healthcheck/executor.go
Outdated
// updateHealthStatus updates the health status based on the health check result | ||
func updateHealthStatus(ctx context.Context, container containerd.Container, hcConfig *Healthcheck, hcResult *HealthcheckResult) error { | ||
// Get current health status from health log | ||
currentHealth, err := readHealthLog(ctx, container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading logs to get current health seems not so robust. i think we should atleast check the possibility of having a db.
@AkihiroSuda thoughts? or having a db for it is over engineering the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could the DB improve the robustness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ocihook
should run healthcheck commands periodically, and serve the health status via a Unix socket?
Then there does not need to be a static health log file nor a DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db would support atomicity of the writes.
in the case of oci hook is it stored in runc process memory? Not much familiar with it,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ocihook
is a process executed on events such as "createRuntime" and "postStop":
nerdctl/pkg/ocihook/ocihook.go
Lines 123 to 130 in b8c4b3d
switch event { | |
case "createRuntime": | |
return onCreateRuntime(opts) | |
case "postStop": | |
return onPostStop(opts) | |
default: | |
return fmt.Errorf("unexpected event %q", event) | |
} |
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle
ocihook is currently used for CNI, logging, etc.
Periodic health checker could be added here, and it could serve gRPC (or REST) over a Unix socket to provide the latest health status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db would support atomicity of the writes.
There is an effort to consolidate atomic, concurrency-safe filesystem operations in internal/filesystem
. If needed, these can be used instead of filesystem access in almost all cases.
Generally IMHO we should avoid using the filesystem as an API and instead rely on a higher-level abstraction (eg: store.Store
for eg). Leaking implementation details into the consumer (filesystem or db) is bad API, and instead having a storage API will allow swapping out the fs implementation to something else, if need-be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'll update the PR to store current health state and failing streak in the container labels, so we no longer need to read the log file during each health check to get current state.
Health check results are still written to a log file, and during inspect we fetch the last five entries to provide recent history. For health.json
operations, we’ll use internal/filesystem along with proper locking as recommended to ensure concurrency safety.
6b91b90
to
5fb121a
Compare
335753b
to
245c93e
Compare
LGTM. Lets fix the windows test. |
245c93e
to
de23c9f
Compare
de23c9f
to
5888af5
Compare
For the canary errors: compiler issue which the canary is running with go1.25rc. More details available in link |
5888af5
to
6dfe631
Compare
LGTM. Once the fixes are merged we can rebase and do a rebase and run the tests before merge. @AkihiroSuda can we get a +1 for this. Thank You |
6dfe631
to
7734a9c
Compare
select { | ||
case <-time.After(hc.Timeout): | ||
_ = process.Kill(ctx, syscall.SIGKILL) | ||
go func() { <-exitStatusC }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't do anything after waiting for the channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it doesn't do anything with the value it receives from the channel, it simply drains the channel to prevent a goroutine leak after the process is killed, allowing the goroutine that sends to the channel to complete.
@@ -41,6 +47,16 @@ func TestContainerFromNative(t *testing.T) { | |||
os.WriteFile(filepath.Join(tempStateDir, "resolv.conf"), []byte(""), 0644) | |||
defer os.RemoveAll(tempStateDir) | |||
|
|||
hc := &healthcheck.Healthcheck{ | |||
Test: []string{"CMD-SHELL", "curl -f http://localhost || exit 1"}, | |||
Interval: 30000000000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use time.Second
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, these durations are stored as nanoseconds in container labels internally, but I'll express the flag parameters in seconds for improved readability.
Signed-off-by: Subash Kotha <subash.kotha2@gmail.com>
7734a9c
to
c5805c5
Compare
Hi @AkihiroSuda @Shubhranshu153 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
This PR adds support for Docker-compatible health checks in
nerdctl run
,nerdctl create
, and introduces a newnerdctl container healthcheck
command to manually trigger health checks.Key features:
nerdctl container healthcheck
command to manually run health checksnerdctl/healthcheck
nerdctl/healthstate
for quick accesshealth.json
file in the container’s runtime state directoryFuture work (WIP):
Related issue - #4157
cc: @Shubhranshu153 @AkihiroSuda