Skip to content

KEP-5307 Initial KEP for container restart exceptions #5308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

yuanwang04
Copy link
Contributor

@yuanwang04 yuanwang04 commented May 16, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 16, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 16, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @yuanwang04!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 16, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @yuanwang04. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 16, 2025
@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch from 6cb2b80 to 0cbfc18 Compare May 16, 2025 22:25
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 18, 2025
@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch from 0cbfc18 to 040ffef Compare May 19, 2025 21:45
@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch 5 times, most recently from bee6cce to 67df630 Compare May 23, 2025 01:36
@mimowo
Copy link
Contributor

mimowo commented May 28, 2025

@yuanwang04 thank you for the work, I like this proposal, AFAIK this approach is fully compatible with the Job's podFailurePolicy (at least if I'm not missing something), because when Pod's restartPolicy: Never, then Job's podFailurePolicy only analyzes pods which reach the "Failed" phase. Here, the pods avoid reaching the failed phase. Once they reach, they will be matched against podFailurePolicy which may decide to recreate the entire pod.

@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch from 67df630 to 5bbe5d6 Compare May 29, 2025 22:01
@thockin
Copy link
Member

thockin commented Jun 12, 2025

Capturing discussion:

What I WANT is a design that ties pod restartPolicy, container restartPolicy, and this new idea of rules together into a coherent story. If we end up in a place where restartPolicy=Never but the container restarts anyway, I am going to be grumpy.

We talked through options which add a new value to RestartPolicy, but given that RestartPolicy a) doesn't say "clients must handle unknown values" and b) overloads the meaning to include "this is a sidecar", I think we simply CAN'T add a new enum value. So I accept that I am going to be grumpy.

I think we CAN get to a place where container restartPolicy is less restricted, since any policy can be implemented in terms of rules. At least we can get rid of "this can only be set in initcontainers and only to Always".

Is it worth adding something to status to represent the result of these multiple fields, so we can encourage clients to infer less?

@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch 3 times, most recently from cac3f70 to cbcb7b2 Compare June 12, 2025 22:48
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM overall - all my comments are minor and this doesn't have to be final API
/approve

This KEP introduces restart rules for a container so kubelet can apply
those rules on container failure. This will allow users to configure special
exit codes of the container to be treated as non-failure and restart the
container in-place even if the Pod has a restartPolicy=Never. This scenario is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haircommander we can lament our past mistakes, but we can't always fix them. I grilled these guys for hours this week and the conclusion is that none of us can find a way that is "safe enough". :(


The initial proposal supports only exit code as requirement for the rules.

The proposed API is as following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding something to status which is less layered?

@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch from cbcb7b2 to 2c44788 Compare June 13, 2025 00:10
@wojtek-t wojtek-t self-assigned this Jun 16, 2025
@yuanwang04 yuanwang04 force-pushed the container-restart-policy branch from 2c44788 to 7aaf999 Compare June 18, 2025 09:53
@yuanwang04 yuanwang04 requested a review from wojtek-t June 18, 2025 09:54
@wojtek-t
Copy link
Member

/approve PRR

@mrunalp
Copy link
Contributor

mrunalp commented Jun 18, 2025

/approve
/hold
(if Dawn wants to make a pass)

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 18, 2025
@dchen1107
Copy link
Member

/lgtm
/approve

I remain some concerns, but those concerns shouldn't block we move forward this alpha feature.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, mrunalp, thockin, wojtek-t, yuanwang04

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SergeyKanzhelev
Copy link
Member

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025
@k8s-ci-robot k8s-ci-robot merged commit 323c4de into kubernetes:master Jun 18, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.