Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods getting stuck into "Terminating" state on deletion after the cluster is upgraded to k8s v1.24 #38270

Closed
rishabhmsra opened this issue Jul 14, 2022 · 6 comments
Assignees
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@rishabhmsra
Copy link
Contributor

Rancher Server Setup

  • Rancher version: v2.6.6 -> v2.6-head(2c21373)
  • Installation option (Docker install/Helm Chart): Docker
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: v1.23.7-rancher1-1 -> v1.24.2-rancher1-1
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): ec2 custom cluster with 1-cp, 1-ectd, 2-worker

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin
    • If custom, define the set of permissions:

Describe the bug

  • After upgrading the rancher from v2.6.6 to v2.6-head(2c21373) and then upgrading the downstream cluster from v1.23 to v1.24 the pods deployed before the rancher upgrade gets stuck into Terminating state on deletion with below error:
error killing pod: failed to "KillPodSandbox" for "3b4c7244-42ad-469d-960b-59ebb51a6563" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"ngnixtest-7769bd65df-7g7rn_default\" network: could not retrieve port mappings: key is not found"
  • Note: During the v1.24 KDM upgrade checks this issue was occurring on all the downstream clusters with network providers -> canal, flannel, weave, calico.

To Reproduce

  • Deploy a rancher server v2.6.6
  • Create a downstream custom cluster(1-cp, 1-etcd, 2-w) using k8s v1.23.7-rancher1-1
  • Create a Deployment in defalut ns using nginx(with clusterIp port-80) and create an ingress pointing to it.
  • Scale the nginx deployment to 2:

nginx-scale

  • Upgrade the rancher server from v2.6.6 to v2.6-head(2c21373)
  • Upgrade the downstream cluster from v1.23.7-rancher1-1 to v1.24.2-rancher1-1
  • Now scale the nginx deployment to let's say 5.
  • From the nginx Deployment delete the older pods, NOT the one's created after scaling it up.
  • Pods will get stuck into Terminating state.

pods-stuck1

Result

  • The pods will get stuck into Terminating state with below error in recent events:
error killing pod: failed to "KillPodSandbox" for "3b4c7244-42ad-469d-960b-59ebb51a6563" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"ngnixtest-7769bd65df-7g7rn_default\" network: could not retrieve port mappings: key is not found"

Expected Result

  • Pods should terminate successfully on deletion.

Additional context

@rishabhmsra rishabhmsra added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels Jul 14, 2022
@rishabhmsra rishabhmsra added this to the v2.6.7 milestone Jul 14, 2022
@rishabhmsra rishabhmsra self-assigned this Jul 14, 2022
@sowmyav27
Copy link
Contributor

Note: This happens on 2.6head 2c21373 with no rancher upgrade, but on a k8s upgrade from 1.23 to 1.24

@kinarashah
Copy link
Member

Mirantis/cri-dockerd#52

@zube zube bot removed the [zube]: Next Up label Jul 15, 2022
@Sahota1225 Sahota1225 added the release-note Note this issue in the milestone's release notes label Jul 19, 2022
@Sahota1225
Copy link
Contributor

We may have to release note this issue as this is upstream issue, and not a release blocker. Terminating pods are not blocking anything

@markusewalker
Copy link
Contributor

Already discussed in depth, but wanted to verify I also noted this behavior.

Reproduced this on v2.6-head 104e111.

ENVIRONMENT DETAILS

  • Rancher Install: Docker
  • Rancher version: v2.6-head 104e111
  • Browser: Chrome

TEST RESULT
REPRODUCED

REPRODUCTION STEPS

  1. Setup Rancher and navigate to the UI in a browser.
  2. Create a standard user and login to Rancher as that user.
  3. Provision a downstream RKE1 node driver cluster with 1.22.11-rancher1-1; verified that it comes up as Active.
    • Used 3 etcd, 2 cp, 3 worker nodes
  4. Upgrade the cluster to v1.24.2-rancher1-1; verified the cluster, workloads and nodes remain as Active.
  5. Navigated to Cluster Explorer > cluster > Workloads > Pods.
  6. Restarted the nginx-ingress pods, but they never come back up:
    image

@snasovich
Copy link
Collaborator

This is getting fixed by bumping cri-dockerd as part of rancher/rke-tools#155.
Also, the severity of this issue was misjudged and it should have been considered a release blocker. Luckily, new release of cri-dockerd with the fix has just became available and we're good to include it into 2.6.7.

@snasovich snasovich assigned Oats87 and unassigned kinarashah Aug 3, 2022
@markusewalker
Copy link
Contributor

Verified that this is addressed on v2.6-head 6a73921.

ENVIRONMENT DETAILS

  • Rancher Install: Docker
  • Rancher version: v2.6-head 6a73921
  • Browser: Chrome

TEST RESULT
PASS

VERIFICATION STEPS

  1. Setup Rancher and navigate to the UI in a browser.
  2. Created a standard user and login to Rancher as that user.
  3. Provisioned a downstream RKE1 node driver cluster with v1.22.11-rancher1-1.
  4. Upgraded the cluster to v1.24.2-rancher1-1; verified the cluster and nodes come up as Active and correctly reflect v1.24.2.
  5. Navigated to Cluster Explorer > cluster > Workloads > Pods.
  6. Restarted the nginx-ingress pods; verified that they come back up as Active:
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

9 participants