Skip to content

feat: Add healthcheck command in nerdctl #4302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2025

Conversation

subashkotha
Copy link
Contributor

@subashkotha subashkotha commented May 29, 2025

This PR adds support for Docker-compatible health checks in nerdctl run, nerdctl create, and introduces a new nerdctl container healthcheck command to manually trigger health checks.

Key features:

  1. Health check configuration via CLI flags (--health-cmd, --health-interval, --health-timeout, etc.)
  2. Support for health checks defined in Docker images (HEALTHCHECK in the Dockerfile)
  3. Merging logic for health checks:
  • CLI flags override image-defined health checks
  • Fallback to image-defined health checks if CLI flags not set
  • No health check configured by default
  1. nerdctl container healthcheck command to manually run health checks
  2. Health check configuration is stored as an internal label nerdctl/healthcheck
  3. Health state (status and failing streak) is stored in internal label nerdctl/healthstate for quick access
  4. Health check results stored in a health.json file in the container’s runtime state directory
  5. Tests added for CLI flag parsing, merging behavior, execution of health checks, and health state inspection in nerdctl inspect

Future work (WIP):

  1. Automate periodic health checks using systemd.
  2. Generate systemd timer and service units based on container lifecycle events.
  3. Improve logic to store/fetch health check results.

Related issue - #4157

cc: @Shubhranshu153 @AkihiroSuda

@subashkotha subashkotha force-pushed the health_checks branch 2 times, most recently from a0bfd2c to b03c1f6 Compare May 29, 2025 17:52
@subashkotha subashkotha changed the title feat: Add healthcheck cmd, healthcheck related flags to create/run and include health config and status in inspect output feat: Add healthcheck support in nerdctl May 29, 2025
@subashkotha subashkotha force-pushed the health_checks branch 4 times, most recently from bebb32a to cced4f6 Compare May 29, 2025 21:31
@AkihiroSuda AkihiroSuda added this to the v2.1.3 milestone May 30, 2025
@AkihiroSuda
Copy link
Member

CI is failing on Windows

// updateHealthStatus updates the health status based on the health check result
func updateHealthStatus(ctx context.Context, container containerd.Container, hcConfig *Healthcheck, hcResult *HealthcheckResult) error {
// Get current health status from health log
currentHealth, err := readHealthLog(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading logs to get current health seems not so robust. i think we should atleast check the possibility of having a db.
@AkihiroSuda thoughts? or having a db for it is over engineering the problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could the DB improve the robustness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ocihook should run healthcheck commands periodically, and serve the health status via a Unix socket?

Then there does not need to be a static health log file nor a DB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db would support atomicity of the writes.
in the case of oci hook is it stored in runc process memory? Not much familiar with it,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocihook is a process executed on events such as "createRuntime" and "postStop":

switch event {
case "createRuntime":
return onCreateRuntime(opts)
case "postStop":
return onPostStop(opts)
default:
return fmt.Errorf("unexpected event %q", event)
}

https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle

ocihook is currently used for CNI, logging, etc.
Periodic health checker could be added here, and it could serve gRPC (or REST) over a Unix socket to provide the latest health status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db would support atomicity of the writes.

There is an effort to consolidate atomic, concurrency-safe filesystem operations in internal/filesystem. If needed, these can be used instead of filesystem access in almost all cases.

Generally IMHO we should avoid using the filesystem as an API and instead rely on a higher-level abstraction (eg: store.Store for eg). Leaking implementation details into the consumer (filesystem or db) is bad API, and instead having a storage API will allow swapping out the fs implementation to something else, if need-be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll update the PR to store current health state and failing streak in the container labels, so we no longer need to read the log file during each health check to get current state.

Health check results are still written to a log file, and during inspect we fetch the last five entries to provide recent history. For health.json operations, we’ll use internal/filesystem along with proper locking as recommended to ensure concurrency safety.

@subashkotha subashkotha force-pushed the health_checks branch 2 times, most recently from 6b91b90 to 5fb121a Compare June 6, 2025 22:29
@subashkotha subashkotha changed the title feat: Add healthcheck support in nerdctl feat: Add healthcheck command in nerdctl Jun 6, 2025
@subashkotha subashkotha force-pushed the health_checks branch 2 times, most recently from 335753b to 245c93e Compare June 9, 2025 17:41
@Shubhranshu153
Copy link
Contributor

LGTM. Lets fix the windows test.

@Shubhranshu153
Copy link
Contributor

For the canary errors: compiler issue which the canary is running with go1.25rc. More details available in link
google/go-licenses#312

@Shubhranshu153
Copy link
Contributor

Shubhranshu153 commented Jun 17, 2025

LGTM. Once the fixes are merged we can rebase and do a rebase and run the tests before merge.
Open PR for fixing the failing tests: #4318

@AkihiroSuda can we get a +1 for this. Thank You

select {
case <-time.After(hc.Timeout):
_ = process.Kill(ctx, syscall.SIGKILL)
go func() { <-exitStatusC }()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do anything after waiting for the channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it doesn't do anything with the value it receives from the channel, it simply drains the channel to prevent a goroutine leak after the process is killed, allowing the goroutine that sends to the channel to complete.

@@ -41,6 +47,16 @@ func TestContainerFromNative(t *testing.T) {
os.WriteFile(filepath.Join(tempStateDir, "resolv.conf"), []byte(""), 0644)
defer os.RemoveAll(tempStateDir)

hc := &healthcheck.Healthcheck{
Test: []string{"CMD-SHELL", "curl -f http://localhost || exit 1"},
Interval: 30000000000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use time.Second here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, these durations are stored as nanoseconds in container labels internally, but I'll express the flag parameters in seconds for improved readability.

Signed-off-by: Subash Kotha <subash.kotha2@gmail.com>
@subashkotha
Copy link
Contributor Author

Hi @AkihiroSuda @Shubhranshu153
Can I get approval on this PR if there are no comments, thanks!

@AkihiroSuda AkihiroSuda mentioned this pull request Jun 20, 2025
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@AkihiroSuda AkihiroSuda merged commit c46f1f4 into containerd:main Jun 20, 2025
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants