-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Wait for PLEG events along with requests to runtime #124953
base: master
Are you sure you want to change the base?
Conversation
`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.
There are some cases where a pod worker is woken up without a cache update by the PLEG such as a pod termination. Then, the worker gets stuck in `cache.GetNewerThan()` till the global cache timestamp is updated by the PLEG. In order to unblock the stuck worker as early as the Generic PLEG, this fix makes the Evented PLEG update the global cache as frequently as the Generic PLEG.
This fix makes pod workers wait for PLEG events to be received after they request the runtime to start or stop containers. Currently, a pod worker waits for the PLEG to update cache along with a PLEG event by calling `GetNewerThan()` at every time syncing pod. However, it does not look enough to wait for the cache to be updated once without confirming if it is updated as expected. For example, after a pod worker starts a container in `SyncPod()`, if the pod worker enters `syncPod()` again before a container started event is delivered, the worker tries to create the container again. Because a pod worker does not remember what it requested the container runtime to do, the worker computes actions in `SyncPod()` only by seeing cached container statuses in which the container has not started. So, before entering `SyncPod()` again, cached statuses of all containers that were requested to start need to be updated by PLEG. However, just calling `GetNewerThan()` once does not guarantee that all events are received by PLEG, especially in a case of Evented PLEG where events are delivered independently. This problem actually happened with the Evented PLEG and might happen with the Generic PLEG. In order to avoid an unexpected regression, the behavior of pod workers is changed only when the Evented PLEG is enabled.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hshiina The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @hshiina. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
@hshiina: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/test pull-kubernetes-e2e-kind-evented-pleg |
What type of PR is this?
/kind bug
What this PR does / why we need it:
xref: #124297
This fix makes pod workers wait for PLEG events to be received after
they request the runtime to start or stop containers.
Currently, a pod worker waits for the PLEG to update cache along with a
PLEG event by calling
GetNewerThan()
at every time syncing pod.However, it does not look enough to wait for the cache to be updated
once without confirming if it is updated as expected. For example,
after a pod worker starts a container in
SyncPod()
, if the pod workerenters
syncPod()
again before a container started event is delivered,the worker tries to create the container again. Because a pod worker
does not remember what it requested the container runtime to do, the
worker computes actions in
SyncPod()
only by seeing cached containerstatuses in which the container has not started. So, before entering
SyncPod()
again, cached statuses of all containers that were requestedto start need to be updated by PLEG. However, just calling
GetNewerThan()
once does not guarantee that all events are received byPLEG, especially in a case of Evented PLEG where events are delivered
independently.
This problem actually happened with the Evented PLEG and might happen
with the Generic PLEG. In order to avoid an unexpected regression, the
behavior of pod workers is changed only when the Evented PLEG is
enabled.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: