Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Wait for PLEG events along with requests to runtime #124953

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hshiina
Copy link

@hshiina hshiina commented May 19, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

xref: #124297

This fix makes pod workers wait for PLEG events to be received after
they request the runtime to start or stop containers.

Currently, a pod worker waits for the PLEG to update cache along with a
PLEG event by calling GetNewerThan() at every time syncing pod.
However, it does not look enough to wait for the cache to be updated
once without confirming if it is updated as expected. For example,
after a pod worker starts a container in SyncPod(), if the pod worker
enters syncPod() again before a container started event is delivered,
the worker tries to create the container again. Because a pod worker
does not remember what it requested the container runtime to do, the
worker computes actions in SyncPod() only by seeing cached container
statuses in which the container has not started. So, before entering
SyncPod() again, cached statuses of all containers that were requested
to start need to be updated by PLEG. However, just calling
GetNewerThan() once does not guarantee that all events are received by
PLEG, especially in a case of Evented PLEG where events are delivered
independently.

This problem actually happened with the Evented PLEG and might happen
with the Generic PLEG. In order to avoid an unexpected regression, the
behavior of pod workers is changed only when the Evented PLEG is
enabled.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


`CreatedAt` timestamp of `ContainerEventResponse` should be passed as
nanoseconds to `time.Unix()`.
There are some cases where a pod worker is woken up without a cache
update by the PLEG such as a pod termination. Then, the worker gets
stuck in `cache.GetNewerThan()` till the global cache timestamp is
updated by the PLEG. In order to unblock the stuck worker as early as
the Generic PLEG, this fix makes the Evented PLEG update the global
cache as frequently as the Generic PLEG.
This fix makes pod workers wait for PLEG events to be received after
they request the runtime to start or stop containers.

Currently, a pod worker waits for the PLEG to update cache along with a
PLEG event by calling `GetNewerThan()` at every time syncing pod.
However, it does not look enough to wait for the cache to be updated
once without confirming if it is updated as expected. For example,
after a pod worker starts a container in `SyncPod()`, if the pod worker
enters `syncPod()` again before a container started event is delivered,
the worker tries to create the container again. Because a pod worker
does not remember what it requested the container runtime to do, the
worker computes actions in `SyncPod()` only by seeing cached container
statuses in which the container has not started. So, before entering
`SyncPod()` again, cached statuses of all containers that were requested
to start need to be updated by PLEG. However, just calling
`GetNewerThan()` once does not guarantee that all events are received by
PLEG, especially in a case of Evented PLEG where events are delivered
independently.

This problem actually happened with the Evented PLEG and might happen
with the Generic PLEG. In order to avoid an unexpected regression, the
behavior of pod workers is changed only when the Evented PLEG is
enabled.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels May 19, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hshiina
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @hshiina. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pacoxu
Copy link
Member

pacoxu commented May 20, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 20, 2024
@k8s-ci-robot
Copy link
Contributor

@hshiina: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-verify fb5202e link true /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bart0sh bart0sh added this to WIP in SIG Node PR Triage May 20, 2024
@hshiina
Copy link
Author

hshiina commented May 21, 2024

/test pull-kubernetes-e2e-kind-evented-pleg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants