[WIP] Wait for PLEG events along with requests to runtime #124953

hshiina · 2024-05-19T18:52:31Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This fix makes pod workers wait for PLEG events to be received after
they request the runtime to start or stop containers.

Currently, a pod worker waits for the PLEG to update cache along with a
PLEG event by calling GetNewerThan() at every time syncing pod.
However, it does not look enough to wait for the cache to be updated
once without confirming if it is updated as expected. For example,
after a pod worker starts a container in SyncPod(), if the pod worker
enters syncPod() again before a container started event is delivered,
the worker tries to create the container again. Because a pod worker
does not remember what it requested the container runtime to do, the
worker computes actions in SyncPod() only by seeing cached container
statuses in which the container has not started. So, before entering
SyncPod() again, cached statuses of all containers that were requested
to start need to be updated by PLEG. However, just calling
GetNewerThan() once does not guarantee that all events are received by
PLEG, especially in a case of Evented PLEG where events are delivered
independently.

This problem actually happened with the Evented PLEG and might happen
with the Generic PLEG. In order to avoid an unexpected regression, the
behavior of pod workers is changed only when the Evented PLEG is
enabled.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.

There are some cases where a pod worker is woken up without a cache update by the PLEG such as a pod termination. Then, the worker gets stuck in `cache.GetNewerThan()` till the global cache timestamp is updated by the PLEG. In order to unblock the stuck worker as early as the Generic PLEG, this fix makes the Evented PLEG update the global cache as frequently as the Generic PLEG.

This fix makes pod workers wait for PLEG events to be received after they request the runtime to start or stop containers. Currently, a pod worker waits for the PLEG to update cache along with a PLEG event by calling `GetNewerThan()` at every time syncing pod. However, it does not look enough to wait for the cache to be updated once without confirming if it is updated as expected. For example, after a pod worker starts a container in `SyncPod()`, if the pod worker enters `syncPod()` again before a container started event is delivered, the worker tries to create the container again. Because a pod worker does not remember what it requested the container runtime to do, the worker computes actions in `SyncPod()` only by seeing cached container statuses in which the container has not started. So, before entering `SyncPod()` again, cached statuses of all containers that were requested to start need to be updated by PLEG. However, just calling `GetNewerThan()` once does not guarantee that all events are received by PLEG, especially in a case of Evented PLEG where events are delivered independently. This problem actually happened with the Evented PLEG and might happen with the Generic PLEG. In order to avoid an unexpected regression, the behavior of pod workers is changed only when the Evented PLEG is enabled.

k8s-ci-robot · 2024-05-19T18:52:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hshiina
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-05-19T18:52:39Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-05-19T18:52:40Z

Hi @hshiina. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pacoxu · 2024-05-20T03:23:31Z

/ok-to-test

k8s-ci-robot · 2024-05-20T03:25:23Z

@hshiina: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-verify	`fb5202e`	link	true	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hshiina · 2024-05-21T18:10:21Z

/test pull-kubernetes-e2e-kind-evented-pleg

hshiina added 3 commits May 19, 2024 17:45

Pass event created timestamp correctly to cache

91b71b0

`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.

k8s-ci-robot added the area/kubelet label May 19, 2024

k8s-ci-robot requested review from mrunalp and mtaufen May 19, 2024 18:52

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2024

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2024

hshiina mentioned this pull request May 19, 2024

Pass event created timestamp correctly to cache #124297

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 20, 2024

bart0sh added this to WIP in SIG Node PR Triage May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Wait for PLEG events along with requests to runtime #124953

[WIP] Wait for PLEG events along with requests to runtime #124953

hshiina commented May 19, 2024

k8s-ci-robot commented May 19, 2024

k8s-ci-robot commented May 19, 2024

k8s-ci-robot commented May 19, 2024

pacoxu commented May 20, 2024

k8s-ci-robot commented May 20, 2024

hshiina commented May 21, 2024

[WIP] Wait for PLEG events along with requests to runtime #124953

Are you sure you want to change the base?

[WIP] Wait for PLEG events along with requests to runtime #124953

Conversation

hshiina commented May 19, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented May 19, 2024

k8s-ci-robot commented May 19, 2024

k8s-ci-robot commented May 19, 2024

pacoxu commented May 20, 2024

k8s-ci-robot commented May 20, 2024

hshiina commented May 21, 2024