-
Notifications
You must be signed in to change notification settings - Fork 95
[release-4.16] OCPBUGS-51017: Delete CNI configuration file when SDN pod is stopped #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.16
Are you sure you want to change the base?
[release-4.16] OCPBUGS-51017: Delete CNI configuration file when SDN pod is stopped #653
Conversation
@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign @martinkennelly @dougbtv |
/jira refresh |
@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/label backport-risk-assessed |
@pperiyasamy: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-staff-engineers] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is valid. 7 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice approach Peri!
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dougbtv, martinkennelly, pperiyasamy The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for ovn-k we do this via a preStop
hook in the DaemonSet (in cluster-network-operator). Should we do the same thing here?
@@ -121,6 +122,10 @@ func (sdn *openShiftSDN) run(c *cobra.Command, errout io.Writer, stopCh chan str | |||
|
|||
<-stopCh | |||
time.Sleep(500 * time.Millisecond) // gracefully shut down | |||
err = sdn.deleteConfigFile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we delete before the Sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, moved deleting the file before the sleep, this may help to improve the scenario.
2fedca3
to
3146b18
Compare
The openshift sdn CNI configuration file /etc/cni/net.d/80-openshift-network.conf is written on the node upon first time of sdn pod deployment and remains to be there until preStop container hook removes it. This makes multus to assume sdn is in ready state for a shorter period during sdn pod reboot scenarios. Because of this, for a pod delete request, CNI DELETE request fails with connection refused error, this may end up in entries for the pod may not get cleaned up (example: IP address allocated for the pod by host-local IPAM plugin). Hence this commit cleans up the cni config file as soon as sigterm signal is received, it makes multus to wait until sdn plugin is actually ready and then invoke CNI requests. Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
3146b18
to
119b1d4
Compare
@danwinship Yes, I just now noticed sdn container prehook is also removing the conf file https://github.com/openshift/cluster-network-operator/blob/release-4.16/bindata/network/openshift-sdn/sdn.yaml#L321, so removing the file a bit early (this is what PR is doing now) might improve the scenario. |
@pperiyasamy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Um... I think that's a bug in Multus. The only examples the spec gives of not returning errors is (a) don't return an error if you find that the pod no longer exists, and (b) DHCP should maybe not return an error if it's unable to release the DHCP lease. I actually think (b) is wrong. At any rate, if a plugin is unable to do cleanup at all, it absolutely should return an error, and Multus absolutely should propagate that. |
@danwinship we are already discussed this scenario with multus team and not propagating error to crio is done intentionally to comply with CNI spec https://issues.redhat.com/browse/OCPBUGS-51017?focusedId=26631509&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26631509. WDYT ? |
I commented there, and I'm going to bring it up at the CNI meeting today... |
The openshift sdn CNI configuration file
/etc/cni/net.d/80-openshift-network.conf
is written on the node upon first time of sdn pod deployment and remains to be there on the node during its lifetime.This makes multus to assume sdn is in ready state though sdn pod is coming up during reboot scenarios. Because of this, for a pod delete request, CNI DELETE request fails with connection refused error when sdn cni server socket is being created, this may end up in entries for the pod may not get cleaned up (example: IP address allocated for the pod by host-local IPAM plugin).
Previously (4.13) this wasn't a problem because multus propagates the error for cni delete to crio/kubelet so that retries for delete happens for a failure, but this is not a case in 4.14 with this multus change: https://github.com/openshift/multus-cni/blob/release-4.14/pkg/server/api/shim.go#L68-L69.
This PR cleans up the cni config file when sdn pod is stopped. So when multus received cni delete request, it waits upto 45s (https://github.com/openshift/multus-cni/blob/release-4.14/pkg/multus/multus.go#L841-L845) until sdn plugin is actually ready and then delegate the request to sdn plugin.