-
Notifications
You must be signed in to change notification settings - Fork 436
OCPBUGS-57024: use channel to signal controller shutdown #5104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-57024: use channel to signal controller shutdown #5104
Conversation
@cheesesashimi: This pull request references Jira Issue OCPBUGS-57024, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
8dd30ae
to
11aa8da
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM |
de4e63c
to
e3954aa
Compare
/retest |
Under certain circumstances, the machine-os-builder controller needs to perform cleanup operations such as deleting any ephemeral objects when it shuts down. However, the shutdown process must also complete cleanly because otherwise, the leader lease will not be released. Instead of using a time delay, this uses a channel to block the shutdown function until the controller has finished the cleanup process. It should be noted that we cannot guarantee that the controller will always complete its shutdown process under all circumstances before the kubelet forcibly stops the pod. Consequently, we also cannot guarantee that the leader lease will be released under all circumstances either. However, this puts us in a much better place than before since we no longer need a hard-coded time delay.
e3954aa
to
d460172
Compare
The idea behind shutdowndelayhandler is to check for any orphaned ephemeral build objects before shutting down the build controller. An orphaned object in this situation means that the MachineOSConfig and the MachineOSBuild they are associated with are either deleted or are pending deletion and the child objects themselves are not pending deletion. In that situation, we should delay the shutdown process until the child objects are pending deletion.
d460172
to
ab3d5bc
Compare
We tried to verify it but we found some issues When the os-builder pod is restarted a new machineosbuild resource is generated, a new image is built and it is applied again causing several restarts on the nodes. It can be reproduced consistently. It can be manually reproduced. When we enable OCL, when the mosb is finished, the job is removed, and it starts applying the image to the nodes, we can manually delete the osbuilder pod and a new machineosbuild is created. We tried to reproduce it in 4.20 without this fix, but we were not able to reproduce it without this PR. |
Verified using IPI on AWS We can see the os-builder can take the lease without problems after being evicted while applying the new osImage
We can see the os-builder pod releasing the lease when deleted
We can see that the os-builder pod correctly releases the lease when we remove the MOSC resoruce
All OCL e2e test cases have passed. The issue described in previous comments is not related to this PR, is is reported here: https://issues.redhat.com/browse/OCPBUGS-58191 We still see some weird behaviour, we still need more testing |
We can see the same weird behaviour without this fix. We can approve it. /label qe-approved |
/retest-required |
@cheesesashimi: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- What I did
Under certain circumstances, the machine-os-builder controller needs to
perform cleanup operations such as deleting any ephemeral objects when
it shuts down. However, the shutdown process must also complete cleanly
because otherwise, the leader lease will not be released.
Instead of using a time delay, this uses a channel to block the shutdown
function until the controller has finished the cleanup process. It
should be noted that we cannot guarantee that the controller will always
complete its shutdown process under all circumstances before the kubelet
forcibly stops the pod. Consequently, we also cannot guarantee that the
leader lease will be released under all circumstances either. However,
this puts us in a much better place than before since we no longer need
a hard-coded time delay.
Additionally, I added a shutdown delay handler to the build controller which, upon receipt of a shutdown signal, interrogates all of the objects that it created. The purpose is to ensure that when a MachineOSConfig is deleted, that the deletion cascades to all of the child objects including all MachineOSBuilds, Jobs, ConfigMaps, and Secrets. It will do this multiple times for up to 10 seconds, at which point it will exit. Specifically, it is looking to ensure that all child objects are either nonexistent or that they are pending deletion; that is, they have a deletion timestamp. However, it does not wait for the deletion to actually complete.
- How to verify it
$ oc get leases -n openshift-machine-config-operator
.- Description for the changelog
Use channel instead of time delay for machine-os-builder shutdown