[release-4.16] OCPBUGS-51017: Delete CNI configuration file when SDN pod is stopped #653

pperiyasamy · 2025-02-24T17:56:23Z

The openshift sdn CNI configuration file /etc/cni/net.d/80-openshift-network.conf is written on the node upon first time of sdn pod deployment and remains to be there on the node during its lifetime.
This makes multus to assume sdn is in ready state though sdn pod is coming up during reboot scenarios. Because of this, for a pod delete request, CNI DELETE request fails with connection refused error when sdn cni server socket is being created, this may end up in entries for the pod may not get cleaned up (example: IP address allocated for the pod by host-local IPAM plugin).
Previously (4.13) this wasn't a problem because multus propagates the error for cni delete to crio/kubelet so that retries for delete happens for a failure, but this is not a case in 4.14 with this multus change: https://github.com/openshift/multus-cni/blob/release-4.14/pkg/server/api/shim.go#L68-L69.
This PR cleans up the cni config file when sdn pod is stopped. So when multus received cni delete request, it waits upto 45s (https://github.com/openshift/multus-cni/blob/release-4.14/pkg/multus/multus.go#L841-L845) until sdn plugin is actually ready and then delegate the request to sdn plugin.

openshift-ci-robot · 2025-02-24T17:56:31Z

@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is invalid:

expected the bug to target the "4.16.z" version, but no target version was set
release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
expected Jira Issue OCPBUGS-51017 to depend on a bug targeting a version in 4.17.0, 4.17.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The openshift sdn CNI configuration file /etc/cni/net.d/80-openshift-network.conf is written on the node upon first time of sdn pod deployment and remains to be there on the node during its lifetime. This makes multus to assume sdn is in ready state though sdn pod is coming up during reboot scenarios. Because of this, for a pod delete request, CNI DELETE request fails with connection refused error when sdn cni server socket is being created, this may end up in entries for the pod may not get cleaned up (example: IP address allocated for the pod by host-local IPAM plugin). Hence this commit cleans up the cni config file when sdn pod is stopped, it makes multus to wait until sdn plugin is actually ready and then invoke CNI requests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pperiyasamy · 2025-02-24T18:00:45Z

/assign @martinkennelly @dougbtv

pperiyasamy · 2025-02-25T08:02:50Z

/jira refresh

openshift-ci-robot · 2025-02-25T08:02:57Z

@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is invalid:

release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
expected dependent Jira Issue OCPBUGS-51257 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is Closed (Won't Do) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pperiyasamy · 2025-02-25T08:04:37Z

/jira refresh

openshift-ci-robot · 2025-02-25T08:04:46Z

@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.z) matches configured target version for branch (4.16.z)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
release note type set to "Release Note Not Required"
dependent bug Jira Issue OCPBUGS-51257 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
dependent Jira Issue OCPBUGS-51257 targets the "4.17.z" version, which is one of the valid target versions: 4.17.0, 4.17.z
bug has dependents

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pperiyasamy · 2025-02-25T08:05:10Z

/label backport-risk-assessed

openshift-ci · 2025-02-25T08:06:18Z

@pperiyasamy: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/label backport-risk-assessed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-02-25T08:27:34Z

@pperiyasamy: This pull request references Jira Issue OCPBUGS-51017, which is valid.

7 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.z) matches configured target version for branch (4.16.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
release note type set to "Release Note Not Required"
dependent bug Jira Issue OCPBUGS-51257 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
dependent Jira Issue OCPBUGS-51257 targets the "4.17.z" version, which is one of the valid target versions: 4.17.0, 4.17.z
bug has dependents

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

The openshift sdn CNI configuration file /etc/cni/net.d/80-openshift-network.conf is written on the node upon first time of sdn pod deployment and remains to be there on the node during its lifetime.
This makes multus to assume sdn is in ready state though sdn pod is coming up during reboot scenarios. Because of this, for a pod delete request, CNI DELETE request fails with connection refused error when sdn cni server socket is being created, this may end up in entries for the pod may not get cleaned up (example: IP address allocated for the pod by host-local IPAM plugin).
Previously (4.13) this wasn't a problem because multus propagates the error for cni delete to crio/kubelet so that retries for delete happens for a failure, but this is not a case in 4.14 with this multus change: https://github.com/openshift/multus-cni/blob/release-4.14/pkg/server/api/shim.go#L68-L69.
This PR cleans up the cni config file when sdn pod is stopped. So when multus received cni delete request, it waits upto 45s (https://github.com/openshift/multus-cni/blob/release-4.14/pkg/multus/multus.go#L841-L845) until sdn plugin is actually ready and then delegate the request to sdn plugin.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dougbtv

Really nice approach Peri!

openshift-ci · 2025-03-18T15:21:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dougbtv, martinkennelly, pperiyasamy
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danwinship

for ovn-k we do this via a preStop hook in the DaemonSet (in cluster-network-operator). Should we do the same thing here?

danwinship · 2025-05-06T12:42:16Z

pkg/cmd/openshift-sdn-node/cmd.go

@@ -121,6 +122,10 @@ func (sdn *openShiftSDN) run(c *cobra.Command, errout io.Writer, stopCh chan str

 	<-stopCh
 	time.Sleep(500 * time.Millisecond) // gracefully shut down
+	err = sdn.deleteConfigFile()


should we delete before the Sleep?

yes, moved deleting the file before the sleep, this may help to improve the scenario.

The openshift sdn CNI configuration file /etc/cni/net.d/80-openshift-network.conf is written on the node upon first time of sdn pod deployment and remains to be there until preStop container hook removes it. This makes multus to assume sdn is in ready state for a shorter period during sdn pod reboot scenarios. Because of this, for a pod delete request, CNI DELETE request fails with connection refused error, this may end up in entries for the pod may not get cleaned up (example: IP address allocated for the pod by host-local IPAM plugin). Hence this commit cleans up the cni config file as soon as sigterm signal is received, it makes multus to wait until sdn plugin is actually ready and then invoke CNI requests. Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>

pperiyasamy · 2025-06-11T09:22:09Z

for ovn-k we do this via a preStop hook in the DaemonSet (in cluster-network-operator). Should we do the same thing here?

@danwinship Yes, I just now noticed sdn container prehook is also removing the conf file https://github.com/openshift/cluster-network-operator/blob/release-4.16/bindata/network/openshift-sdn/sdn.yaml#L321, so removing the file a bit early (this is what PR is doing now) might improve the scenario.

openshift-ci · 2025-06-11T13:46:03Z

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-live-migration-sdn-ovn-rollback	`119b1d4`	link	false	`/test e2e-aws-live-migration-sdn-ovn-rollback`
ci/prow/security	`119b1d4`	link	false	`/test security`
ci/prow/e2e-gcp-sdn	`119b1d4`	link	true	`/test e2e-gcp-sdn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

danwinship · 2025-06-11T14:14:40Z

Previously (4.13) this wasn't a problem because multus propagates the error for cni delete to crio/kubelet so that retries for delete happens for a failure, but this is not a case in 4.14 with this multus change: https://github.com/openshift/multus-cni/blob/release-4.14/pkg/server/api/shim.go#L68-L69.

Um... I think that's a bug in Multus. The only examples the spec gives of not returning errors is (a) don't return an error if you find that the pod no longer exists, and (b) DHCP should maybe not return an error if it's unable to release the DHCP lease. I actually think (b) is wrong.

At any rate, if a plugin is unable to do cleanup at all, it absolutely should return an error, and Multus absolutely should propagate that.

pperiyasamy · 2025-06-16T12:29:01Z

Previously (4.13) this wasn't a problem because multus propagates the error for cni delete to crio/kubelet so that retries for delete happens for a failure, but this is not a case in 4.14 with this multus change: https://github.com/openshift/multus-cni/blob/release-4.14/pkg/server/api/shim.go#L68-L69.

Um... I think that's a bug in Multus. The only examples the spec gives of not returning errors is (a) don't return an error if you find that the pod no longer exists, and (b) DHCP should maybe not return an error if it's unable to release the DHCP lease. I actually think (b) is wrong.

At any rate, if a plugin is unable to do cleanup at all, it absolutely should return an error, and Multus absolutely should propagate that.

@danwinship we are already discussed this scenario with multus team and not propagating error to crio is done intentionally to comply with CNI spec https://issues.redhat.com/browse/OCPBUGS-51017?focusedId=26631509&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26631509. WDYT ?

danwinship · 2025-06-16T12:46:07Z

I commented there, and I'm going to bring it up at the CNI meeting today...

openshift-ci bot requested review from trozet and tssurya February 24, 2025 17:56

openshift-ci bot assigned dougbtv and martinkennelly Feb 24, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 25, 2025

openshift-ci bot requested a review from zhaozhanqi February 25, 2025 08:04

dougbtv approved these changes Feb 26, 2025

View reviewed changes

martinkennelly approved these changes Mar 18, 2025

View reviewed changes

danwinship reviewed May 6, 2025

View reviewed changes

pperiyasamy force-pushed the remove-cni-file-upon-stop-4.16 branch from 2fedca3 to 3146b18 Compare June 11, 2025 08:59

pperiyasamy force-pushed the remove-cni-file-upon-stop-4.16 branch from 3146b18 to 119b1d4 Compare June 11, 2025 09:17

[release-4.16] OCPBUGS-51017: Delete CNI configuration file when SDN pod is stopped #653

Are you sure you want to change the base?

[release-4.16] OCPBUGS-51017: Delete CNI configuration file when SDN pod is stopped #653

Uh oh!

Conversation

pperiyasamy commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 24, 2025

Uh oh!

pperiyasamy commented Feb 24, 2025

Uh oh!

pperiyasamy commented Feb 25, 2025

Uh oh!

openshift-ci-robot commented Feb 25, 2025

Uh oh!

pperiyasamy commented Feb 25, 2025

Uh oh!

openshift-ci-robot commented Feb 25, 2025

Uh oh!

pperiyasamy commented Feb 25, 2025

Uh oh!

openshift-ci bot commented Feb 25, 2025

Uh oh!

openshift-ci-robot commented Feb 25, 2025

Uh oh!

dougbtv left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 18, 2025

Uh oh!

danwinship left a comment

Choose a reason for hiding this comment

Uh oh!

danwinship May 6, 2025

Choose a reason for hiding this comment

Uh oh!

pperiyasamy Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

pperiyasamy commented Jun 11, 2025

Uh oh!

openshift-ci bot commented Jun 11, 2025

Uh oh!

danwinship commented Jun 11, 2025

Uh oh!

pperiyasamy commented Jun 16, 2025

Uh oh!

danwinship commented Jun 16, 2025

Uh oh!

Uh oh!

pperiyasamy commented Feb 24, 2025 •

edited

Loading