fix(collector): solving pod stuck in terminating with CNI issue #1196

DexterYan · 2023-05-31T06:15:46Z

Description, Motivation and Context

this pr is targeted on two collectors runPod and copyFromHost
Make sure the run pod and daemonset have been deleted before the troubleshoot.sh finished running
Poll every second to check if the Pod has been deleted. The maximum grace period is 1 Minute.
If it reached 1 Minute, all the pods created by those two collectors will be force deleted
add klog.v(2) for reporting process

Fixes:
#1188
#1186

Previous PR:
#1172
#1194

Checklist

New and existing tests pass locally with introduced changes.
Tests for the changes have been added (for bug fixes / features)
The commit message(s) are informative and highlight any breaking changes
Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

Yes
No

DexterYan · 2023-06-01T04:33:07Z

Reproduce the issue

run curl https://kurl.sh/ad42372 | sudo bash
remove weave daemontset

Default Spec Example

./bin/support-bundle https://raw.githubusercontent.com/replicatedhq/troubleshoot-specs/main/in-cluster/default.yaml -v2

Output

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
copy-from-host/kurl-host-preflights (S)      : 31,520ms
copy-from-host/copy apiserver audit logs (S) : 31,413ms
copy-from-host/copy kURL logs (S)            : 31,396ms
collectd/collectd (S)                        : 31,317ms
cluster-resources (S)                        : 6,101ms
run-pod (F)                                  : 1,309ms
logs/kurl-control-plane (S)                  : 601ms
http/replicated.app-health-check (S)         : 488ms
logs/kotsadm (S)                             : 135ms
logs/ekc-operator (S)                        : 127ms
configmap/kurl-current-config (S)            : 118ms
logs/rqlite-logs (S)                         : 115ms
exec/kotsadm-rqlite-db (S)                   : 108ms
ceph (S)                                     : 78ms
logs/weave-net (S)                           : 51ms
logs/rook-ceph-logs (S)                      : 46ms
exec/goldpinger-statistics (S)               : 46ms
logs/kotsadm-api (S)                         : 43ms
logs/velero-logs (S)                         : 43ms
logs/kotsadm-operator (S)                    : 42ms
exec/kotsadm-postgres-db-dump (S)            : 42ms
logs/kotsadm-postgres-db (S)                 : 41ms
logs/kotsadm-dex (S)                         : 40ms
configmap/ekco-config (S)                    : 39ms
logs/kotsadm-s3-ops (S)                      : 39ms
configmap/kubelet-config (S)                 : 39ms
exec/kotsadm-operator-goroutines (S)         : 39ms
logs/registry (S)                            : 39ms
logs/kotsadm-fs-minio (S)                    : 39ms
logs/minio (S)                               : 39ms
configmap/kubeadm-config (S)                 : 38ms
configmap/coredns (S)                        : 38ms
configmap/kurl-last-config (S)               : 38ms
configmap/kurl-config (S)                    : 38ms
configmap/weave-net (S)                      : 38ms
logs/kurl-proxy-kotsadm (S)                  : 38ms
exec/weave-status (S)                        : 37ms
logs/projectcontour-logs (S)                 : 37ms
exec/kotsadm-goroutines (S)                  : 37ms
configmap/kube-proxy (S)                     : 37ms
exec/weave-report (S)                        : 37ms
longhorn (S)                                 : 37ms
cluster-info (S)                             : 37ms
secret/kotsadm-replicated-registry (S)       : 1ms

============ Redactors summary =============
In-cluster collectors : 614ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
Cluster Pod Status (S)                              : 1ms
Deployment Status (S)                               : 0ms
Statefulset Status (S)                              : 0ms
Check EKCO is operational (S)                       : 0ms
Job Status (S)                                      : 0ms
Container Runtime (S)                               : 0ms
contour pods unhealthy (S)                          : 0ms
Node status check (S)                               : 0ms
Minio disk full (S)                                 : 0ms
https://replicated.app host health check (S)        : 0ms
Longhorn analyzer (S)                               : 0ms
Ceph Status (S)                                     : 0ms
Check installed EKCO version for critical fixes (S) : 0ms
Inter-pod Networking (S)                            : 0ms
ReplicaSet Status (S)                               : 0ms
Weave CNI (S)                                       : 0ms
Rook rbd filesystem consistency (S)                 : 0ms
longhorn multipath conflict (S)                     : 0ms
Weave Status (S)                                    : 0ms
Known issue with Rook < 1.4 (S)                     : 0ms
Weave Report (S)                                    : 0ms
Weave IP Allocation (S)                             : 0ms

Duration: 141,128ms

DexterYan · 2023-06-01T04:35:16Z

Hey @banjoh, do we actually need to wait for copyFromHost pod to be deleted? Or I think we can force delete it in the first place to reduce waiting time, since it is not a critical pod.

DexterYan · 2023-06-01T04:42:46Z

Since we set MAX_TIME_TO_WAIT_FOR_POD_DELETION to be 1 minute, you can use this simple spec for testing

Simple Spec

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: default
spec:
  collectors:
    - runPod:
        name: ekco-resources
        namespace: kurl
        podSpec:
          containers:
            - name: inspect-ekco-pods
              image: adamancini/netshoot
              command: ["sh", "-c", "--"]
              args:
                [
                  "kubectl get pod -n kurl --selector app=ekc-operator --field-selector status.phase=Running -o json | jq -r .items[]",
                ]
          restartPolicy: Never
          dnsPolicy: ClusterFirst
          serviceAccount: ekco
    - copyFromHost:
        collectorName: "copy apiserver audit logs"
        image: alpine
        hostPath: "/var/log/apiserver/"
        name: "logs"
        extractArchive: true

Output

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
copy-from-host/copy apiserver audit logs (S) : 90,472ms
run-pod (F)                                  : 60,375ms
cluster-resources (S)                        : 6,814ms
cluster-info (S)                             : 107ms

============ Redactors summary =============
In-cluster collectors : 490ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 158,685ms

banjoh · 2023-06-01T09:32:10Z

pkg/collect/run_pod.go

@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul
 		if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
 			klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)


I think we need to return from the function here

banjoh · 2023-06-01T09:33:17Z

pkg/collect/run_pod.go

@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul
 		if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {


Nit pick: This function is big enough to be extracted into its own method in CollectRunPod class. func (c *CollectRunPod) DeletePod() perhaps?

banjoh · 2023-06-01T09:38:29Z

pkg/collect/run_pod.go

+			}); err != nil {
+				klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
+			}
+		}


Suggested change

}

}

klog.V(2).Infof("Pod %s in %s namespace has been deleted", pod.Name, pod.Namespace)

banjoh · 2023-06-01T09:39:39Z

pkg/collect/run_pod.go

+			if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{
+				GracePeriodSeconds: &zeroGracePeriod,
+			}); err != nil {
+				klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)


Suggested change

klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)

klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)

return

banjoh · 2023-06-01T09:44:09Z

pkg/collect/copy_from_host.go

 	if err != nil {
 		return "", cleanup, errors.Wrap(err, "create daemonset")
 	}
 	cleanupFuncs = append(cleanupFuncs, func() {
+		klog.V(2).Infof("Daemonset %s has been scheduled for deletion", createdDS.Name)
 		if err := client.AppsV1().DaemonSets(namespace).Delete(context.Background(), createdDS.Name, metav1.DeleteOptions{}); err != nil {
 			klog.Errorf("Failed to delete daemonset %s: %v", createdDS.Name, err)


I think we need to return from the function here

banjoh · 2023-06-01T09:45:41Z

pkg/collect/copy_from_host.go

 	if err != nil {
 		return "", cleanup, errors.Wrap(err, "create daemonset")
 	}
 	cleanupFuncs = append(cleanupFuncs, func() {
+		klog.V(2).Infof("Daemonset %s has been scheduled for deletion", createdDS.Name)


Nit pick: Extract to own function. Easier to review code.

banjoh · 2023-06-01T09:46:58Z

pkg/collect/copy_from_host.go

+					GracePeriodSeconds: &zeroGracePeriod,
+				})
+				if err != nil {
+					klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)


Suggested change

klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)

klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)

return

banjoh · 2023-06-01T09:48:20Z

pkg/collect/copy_from_host.go

+				if err != nil {
+					klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
+				}
+			}


Suggested change

}

}

klog.V(2).Infof("Daemonset pod %s in %s namespace has been deleted", pod.Name, pod.Namespace)

banjoh · 2023-06-01T10:19:02Z

Hey @banjoh, do we actually need to wait for copyFromHost pod to be deleted? Or I think we can force delete it in the first place to reduce waiting time, since it is not a critical pod.

I suggest we have a shorter grace period instead of force deletion. 3s perhaps? kubectl for example uses 1s to tell k8s to delete the pod "immediately".

DexterYan · 2023-06-02T03:28:30Z

@banjoh Thank you! I have updated the PR to reflect those changes.

I agree for a shorter grace period, maybe better for our troubleshoot performance. I will create another PR to follow that.

xavpaice · 2023-06-05T07:02:56Z

The last test failed with error - failed to run collector: run-pod/static-hi: failed to run pod: failed to create pod: pods "static-hi" is forbidden: error looking up service account default/default: serviceaccount "default" not found - which looks like a different error. Is this something we should look at as part of the change here?

DexterYan · 2023-06-07T04:51:05Z

The last test failed with error - failed to run collector: run-pod/static-hi: failed to run pod: failed to create pod: pods "static-hi" is forbidden: error looking up service account default/default: serviceaccount "default" not found - which looks like a different error. Is this something we should look at as part of the change here?

I have checked our code, it should be related to https://github.com/replicatedhq/troubleshoot/blob/401dfe2c571cc9bc024861939f94836ab880c90d/pkg/collect/run.go#LL41C1-L41C1
We are using the default service account in default namespace without any checking to create a running pod. I will create another PR to fix it. This should not be related to this change.

However, in k8s, a default service account is automatically created for the default namespace.This error should be very strange.

error looking up service account default/default: serviceaccount "default" not found

I will check our replicatedhq/action-k3s to see if we can do anything to prevent it.
For now, I think we can merge this PR.

xavpaice · 2023-06-14T03:52:07Z

This needs tests for the new functions - unfortunately those files don't have any tests right now so that's breaking new ground. Is there any e2e test we can do for this, or is it just a matter of making sure the current ones don't fail any more?

DexterYan requested a review from a team as a code owner May 31, 2023 06:15

DexterYan marked this pull request as draft May 31, 2023 06:15

DexterYan added type::bug Something isn't working bug::regression labels May 31, 2023

DexterYan mentioned this pull request Jun 1, 2023

fix(collector): wait until run pod has been deleted #1194

Closed

6 tasks

DexterYan marked this pull request as ready for review June 1, 2023 04:42

DexterYan force-pushed the dx/fix-copy-host-retry branch from 609ecb7 to ca24031 Compare June 1, 2023 04:43

banjoh reviewed Jun 1, 2023

View reviewed changes

banjoh closed this Jun 1, 2023

banjoh reopened this Jun 1, 2023

DexterYan added 4 commits June 7, 2023 16:51

feat(collector): add FailedCreatePodSandBox checking for pending run pod

bf46880

feat(collector): add klog v2 and clean all pods

8c06c6d

feat(collector): add missing return

3e0ea50

feat(support-bundle): migrate deletePod and deleteDaemonSet functions

7ce9593

DexterYan force-pushed the dx/fix-copy-host-retry branch from 25c38b7 to 7ce9593 Compare June 7, 2023 04:51

DexterYan mentioned this pull request Jun 14, 2023

feat(collector): checking existing service account before create running pod #1222

Merged

6 tasks

feat(support-bundle): add e2e test

d00fdaa

xavpaice approved these changes Jun 15, 2023

View reviewed changes

xavpaice merged commit 5b1e482 into main Jun 15, 2023
21 checks passed

xavpaice deleted the dx/fix-copy-host-retry branch June 15, 2023 02:00

xavpaice mentioned this pull request Jun 22, 2023

RunPod collector improvements #1177

Open

banjoh mentioned this pull request Jul 24, 2023

Performance: support-bundle takes > 3 mins #1269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(collector): solving pod stuck in terminating with CNI issue #1196

fix(collector): solving pod stuck in terminating with CNI issue #1196

DexterYan commented May 31, 2023 •

edited

DexterYan commented Jun 1, 2023 •

edited

DexterYan commented Jun 1, 2023

DexterYan commented Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh Jun 1, 2023

banjoh commented Jun 1, 2023 •

edited

DexterYan commented Jun 2, 2023

xavpaice commented Jun 5, 2023

DexterYan commented Jun 7, 2023 •

edited

xavpaice commented Jun 14, 2023

		@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul
		if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
		klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)

	}
	}
	klog.V(2).Infof("Pod %s in %s namespace has been deleted", pod.Name, pod.Namespace)

	klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
	klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
	return

	}
	}
	klog.V(2).Infof("Daemonset pod %s in %s namespace has been deleted", pod.Name, pod.Namespace)

fix(collector): solving pod stuck in terminating with CNI issue #1196

fix(collector): solving pod stuck in terminating with CNI issue #1196

Conversation

DexterYan commented May 31, 2023 • edited

Description, Motivation and Context

Checklist

Does this PR introduce a breaking change?

DexterYan commented Jun 1, 2023 • edited

Reproduce the issue

Default Spec Example

Output

DexterYan commented Jun 1, 2023

DexterYan commented Jun 1, 2023

Simple Spec

Output

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banjoh commented Jun 1, 2023 • edited

DexterYan commented Jun 2, 2023

xavpaice commented Jun 5, 2023

DexterYan commented Jun 7, 2023 • edited

xavpaice commented Jun 14, 2023

DexterYan commented May 31, 2023 •

edited

DexterYan commented Jun 1, 2023 •

edited

banjoh commented Jun 1, 2023 •

edited

DexterYan commented Jun 7, 2023 •

edited