Pods getting stuck into "Terminating" state on deletion after the cluster is upgraded to k8s v1.24 #38270

rishabhmsra · 2022-07-14T09:43:01Z

Rancher Server Setup

Rancher version: v2.6.6 -> v2.6-head(2c21373)
Installation option (Docker install/Helm Chart): Docker
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: v1.23.7-rancher1-1 -> v1.24.2-rancher1-1
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): ec2 custom cluster with 1-cp, 1-ectd, 2-worker

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin
- If custom, define the set of permissions:

Describe the bug

After upgrading the rancher from v2.6.6 to v2.6-head(2c21373) and then upgrading the downstream cluster from v1.23 to v1.24 the pods deployed before the rancher upgrade gets stuck into Terminating state on deletion with below error:

error killing pod: failed to "KillPodSandbox" for "3b4c7244-42ad-469d-960b-59ebb51a6563" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"ngnixtest-7769bd65df-7g7rn_default\" network: could not retrieve port mappings: key is not found"

Note: During the v1.24 KDM upgrade checks this issue was occurring on all the downstream clusters with network providers -> canal, flannel, weave, calico.

To Reproduce

Deploy a rancher server v2.6.6
Create a downstream custom cluster(1-cp, 1-etcd, 2-w) using k8s v1.23.7-rancher1-1
Create a Deployment in defalut ns using nginx(with clusterIp port-80) and create an ingress pointing to it.
Scale the nginx deployment to 2:

Upgrade the rancher server from v2.6.6 to v2.6-head(2c21373)
Upgrade the downstream cluster from v1.23.7-rancher1-1 to v1.24.2-rancher1-1
Now scale the nginx deployment to let's say 5.
From the nginx Deployment delete the older pods, NOT the one's created after scaling it up.
Pods will get stuck into Terminating state.

Result

The pods will get stuck into Terminating state with below error in recent events:

error killing pod: failed to "KillPodSandbox" for "3b4c7244-42ad-469d-960b-59ebb51a6563" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"ngnixtest-7769bd65df-7g7rn_default\" network: could not retrieve port mappings: key is not found"

Expected Result

Pods should terminate successfully on deletion.

Additional context

The text was updated successfully, but these errors were encountered:

sowmyav27 · 2022-07-14T15:30:52Z

Note: This happens on 2.6head 2c21373 with no rancher upgrade, but on a k8s upgrade from 1.23 to 1.24

kinarashah · 2022-07-14T22:39:21Z

Mirantis/cri-dockerd#52

Sahota1225 · 2022-07-19T18:10:37Z

We may have to release note this issue as this is upstream issue, and not a release blocker. Terminating pods are not blocking anything

markusewalker · 2022-08-01T15:30:31Z

Already discussed in depth, but wanted to verify I also noted this behavior.

Reproduced this on v2.6-head 104e111.

ENVIRONMENT DETAILS

Rancher Install: Docker
Rancher version: v2.6-head 104e111
Browser: Chrome

TEST RESULT
REPRODUCED

REPRODUCTION STEPS

Setup Rancher and navigate to the UI in a browser.
Create a standard user and login to Rancher as that user.
Provision a downstream RKE1 node driver cluster with 1.22.11-rancher1-1; verified that it comes up as Active.
- Used 3 etcd, 2 cp, 3 worker nodes
Upgrade the cluster to v1.24.2-rancher1-1; verified the cluster, workloads and nodes remain as Active.
Navigated to Cluster Explorer > cluster > Workloads > Pods.
Restarted the nginx-ingress pods, but they never come back up:

snasovich · 2022-08-03T16:30:52Z

This is getting fixed by bumping cri-dockerd as part of rancher/rke-tools#155.
Also, the severity of this issue was misjudged and it should have been considered a release blocker. Luckily, new release of cri-dockerd with the fix has just became available and we're good to include it into 2.6.7.

markusewalker · 2022-08-08T18:54:07Z

Verified that this is addressed on v2.6-head 6a73921.

ENVIRONMENT DETAILS

Rancher Install: Docker
Rancher version: v2.6-head 6a73921
Browser: Chrome

TEST RESULT
PASS

VERIFICATION STEPS

Setup Rancher and navigate to the UI in a browser.
Created a standard user and login to Rancher as that user.
Provisioned a downstream RKE1 node driver cluster with v1.22.11-rancher1-1.
Upgraded the cluster to v1.24.2-rancher1-1; verified the cluster and nodes come up as Active and correctly reflect v1.24.2.
Navigated to Cluster Explorer > cluster > Workloads > Pods.
Restarted the nginx-ingress pods; verified that they come back up as Active:

rishabhmsra added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels Jul 14, 2022

rishabhmsra added this to the v2.6.7 milestone Jul 14, 2022

rishabhmsra self-assigned this Jul 14, 2022

sowmyav27 added status/release-blocker regression labels Jul 14, 2022

sowmyav27 assigned kinarashah Jul 14, 2022

Sahota1225 added the [zube]: Next Up label Jul 14, 2022

kinarashah added the [zube]: Working label Jul 15, 2022

zube bot removed the [zube]: Next Up label Jul 15, 2022

snasovich removed the regression label Jul 15, 2022

Sahota1225 added the release-note Note this issue in the milestone's release notes label Jul 19, 2022

Sahota1225 removed the status/release-blocker label Jul 19, 2022

Jono-SUSE-Rancher added [zube]: Release Note and removed [zube]: Working labels Jul 19, 2022

Josh-Diamond mentioned this issue Jul 19, 2022

RKE - Add support for provisioning 1.24 clusters rancher/rke#2936

Closed

9 tasks

zube bot assigned Josh-Diamond and unassigned Josh-Diamond Jul 21, 2022

zube bot added the QA/XS label Jul 21, 2022

Jono-SUSE-Rancher added the v2.6.7 label Jul 28, 2022

sowmyav27 mentioned this issue Aug 3, 2022

Upgrade to 1.24.x results in older (pre-upgrade) pods unable to terminate due to "Error deleting network when building cni runtime conf: could not retrieve port mappings: key is not found" rancher/rke#2998

Closed

snasovich mentioned this issue Aug 3, 2022

Bump cri-dockerd to v0.2.4 rancher/rke-tools#155

Merged

snasovich removed the release-note Note this issue in the milestone's release notes label Aug 3, 2022

snasovich assigned Oats87 and unassigned kinarashah Aug 3, 2022

snasovich added the [zube]: Review label Aug 3, 2022

zube bot removed the [zube]: Release Note label Aug 3, 2022

Oats87 mentioned this issue Aug 3, 2022

Bump rke-tools to v0.1.85 rancher/kontainer-driver-metadata#943

Merged

Oats87 added the [zube]: To Test label Aug 5, 2022

zube bot removed the [zube]: Review label Aug 5, 2022

markusewalker mentioned this issue Aug 5, 2022

Upgrades to Kubernetes >= v1.22.x leave duplicate iptables rules for hostport pods on pod termination with Calico or Canal rancher/rke#2999

Closed

sowmyav27 added [zube]: QA Next up and removed [zube]: To Test labels Aug 8, 2022

sowmyav27 assigned markusewalker Aug 8, 2022

markusewalker added [zube]: QA Working and removed [zube]: QA Next up labels Aug 8, 2022

markusewalker closed this as completed Aug 8, 2022

zube bot added [zube]: Done and removed [zube]: QA Working labels Aug 8, 2022

zube bot removed the [zube]: Done label Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods getting stuck into "Terminating" state on deletion after the cluster is upgraded to k8s v1.24 #38270

Pods getting stuck into "Terminating" state on deletion after the cluster is upgraded to k8s v1.24 #38270

rishabhmsra commented Jul 14, 2022

sowmyav27 commented Jul 14, 2022

kinarashah commented Jul 14, 2022

Sahota1225 commented Jul 19, 2022

markusewalker commented Aug 1, 2022

snasovich commented Aug 3, 2022

markusewalker commented Aug 8, 2022

Pods getting stuck into "Terminating" state on deletion after the cluster is upgraded to k8s v1.24 #38270

Pods getting stuck into "Terminating" state on deletion after the cluster is upgraded to k8s v1.24 #38270

Comments

rishabhmsra commented Jul 14, 2022

sowmyav27 commented Jul 14, 2022

kinarashah commented Jul 14, 2022

Sahota1225 commented Jul 19, 2022

markusewalker commented Aug 1, 2022

snasovich commented Aug 3, 2022

markusewalker commented Aug 8, 2022