Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(collector): solving pod stuck in terminating with CNI issue #1196

Merged
merged 5 commits into from
Jun 15, 2023

Conversation

DexterYan
Copy link
Member

@DexterYan DexterYan commented May 31, 2023

Description, Motivation and Context

  • this pr is targeted on two collectors runPod and copyFromHost
  • Make sure the run pod and daemonset have been deleted before the troubleshoot.sh finished running
  • Poll every second to check if the Pod has been deleted. The maximum grace period is 1 Minute.
  • If it reached 1 Minute, all the pods created by those two collectors will be force deleted
  • add klog.v(2) for reporting process

Fixes:
#1188
#1186

Previous PR:
#1172
#1194

Checklist

  • New and existing tests pass locally with introduced changes.
  • Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

@DexterYan DexterYan requested a review from a team as a code owner May 31, 2023 06:15
@DexterYan DexterYan marked this pull request as draft May 31, 2023 06:15
@DexterYan DexterYan added type::bug Something isn't working bug::regression labels May 31, 2023
@DexterYan
Copy link
Member Author

DexterYan commented Jun 1, 2023

Reproduce the issue

  1. run curl https://kurl.sh/ad42372 | sudo bash
  2. remove weave daemontset

Default Spec Example

./bin/support-bundle https://raw.githubusercontent.com/replicatedhq/troubleshoot-specs/main/in-cluster/default.yaml -v2

Output

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
copy-from-host/kurl-host-preflights (S)      : 31,520ms
copy-from-host/copy apiserver audit logs (S) : 31,413ms
copy-from-host/copy kURL logs (S)            : 31,396ms
collectd/collectd (S)                        : 31,317ms
cluster-resources (S)                        : 6,101ms
run-pod (F)                                  : 1,309ms
logs/kurl-control-plane (S)                  : 601ms
http/replicated.app-health-check (S)         : 488ms
logs/kotsadm (S)                             : 135ms
logs/ekc-operator (S)                        : 127ms
configmap/kurl-current-config (S)            : 118ms
logs/rqlite-logs (S)                         : 115ms
exec/kotsadm-rqlite-db (S)                   : 108ms
ceph (S)                                     : 78ms
logs/weave-net (S)                           : 51ms
logs/rook-ceph-logs (S)                      : 46ms
exec/goldpinger-statistics (S)               : 46ms
logs/kotsadm-api (S)                         : 43ms
logs/velero-logs (S)                         : 43ms
logs/kotsadm-operator (S)                    : 42ms
exec/kotsadm-postgres-db-dump (S)            : 42ms
logs/kotsadm-postgres-db (S)                 : 41ms
logs/kotsadm-dex (S)                         : 40ms
configmap/ekco-config (S)                    : 39ms
logs/kotsadm-s3-ops (S)                      : 39ms
configmap/kubelet-config (S)                 : 39ms
exec/kotsadm-operator-goroutines (S)         : 39ms
logs/registry (S)                            : 39ms
logs/kotsadm-fs-minio (S)                    : 39ms
logs/minio (S)                               : 39ms
configmap/kubeadm-config (S)                 : 38ms
configmap/coredns (S)                        : 38ms
configmap/kurl-last-config (S)               : 38ms
configmap/kurl-config (S)                    : 38ms
configmap/weave-net (S)                      : 38ms
logs/kurl-proxy-kotsadm (S)                  : 38ms
exec/weave-status (S)                        : 37ms
logs/projectcontour-logs (S)                 : 37ms
exec/kotsadm-goroutines (S)                  : 37ms
configmap/kube-proxy (S)                     : 37ms
exec/weave-report (S)                        : 37ms
longhorn (S)                                 : 37ms
cluster-info (S)                             : 37ms
secret/kotsadm-replicated-registry (S)       : 1ms

============ Redactors summary =============
In-cluster collectors : 614ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
Cluster Pod Status (S)                              : 1ms
Deployment Status (S)                               : 0ms
Statefulset Status (S)                              : 0ms
Check EKCO is operational (S)                       : 0ms
Job Status (S)                                      : 0ms
Container Runtime (S)                               : 0ms
contour pods unhealthy (S)                          : 0ms
Node status check (S)                               : 0ms
Minio disk full (S)                                 : 0ms
https://replicated.app host health check (S)        : 0ms
Longhorn analyzer (S)                               : 0ms
Ceph Status (S)                                     : 0ms
Check installed EKCO version for critical fixes (S) : 0ms
Inter-pod Networking (S)                            : 0ms
ReplicaSet Status (S)                               : 0ms
Weave CNI (S)                                       : 0ms
Rook rbd filesystem consistency (S)                 : 0ms
longhorn multipath conflict (S)                     : 0ms
Weave Status (S)                                    : 0ms
Known issue with Rook < 1.4 (S)                     : 0ms
Weave Report (S)                                    : 0ms
Weave IP Allocation (S)                             : 0ms

Duration: 141,128ms

@DexterYan
Copy link
Member Author

Hey @banjoh, do we actually need to wait for copyFromHost pod to be deleted? Or I think we can force delete it in the first place to reduce waiting time, since it is not a critical pod.

@DexterYan
Copy link
Member Author

Since we set MAX_TIME_TO_WAIT_FOR_POD_DELETION to be 1 minute, you can use this simple spec for testing

Simple Spec

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: default
spec:
  collectors:
    - runPod:
        name: ekco-resources
        namespace: kurl
        podSpec:
          containers:
            - name: inspect-ekco-pods
              image: adamancini/netshoot
              command: ["sh", "-c", "--"]
              args:
                [
                  "kubectl get pod -n kurl --selector app=ekc-operator --field-selector status.phase=Running -o json | jq -r .items[]",
                ]
          restartPolicy: Never
          dnsPolicy: ClusterFirst
          serviceAccount: ekco
    - copyFromHost:
        collectorName: "copy apiserver audit logs"
        image: alpine
        hostPath: "/var/log/apiserver/"
        name: "logs"
        extractArchive: true

Output

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
copy-from-host/copy apiserver audit logs (S) : 90,472ms
run-pod (F)                                  : 60,375ms
cluster-resources (S)                        : 6,814ms
cluster-info (S)                             : 107ms

============ Redactors summary =============
In-cluster collectors : 490ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 158,685ms

@DexterYan DexterYan marked this pull request as ready for review June 1, 2023 04:42
@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to return from the function here

@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick: This function is big enough to be extracted into its own method in CollectRunPod class. func (c *CollectRunPod) DeletePod() perhaps?

}); err != nil {
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}
klog.V(2).Infof("Pod %s in %s namespace has been deleted", pod.Name, pod.Namespace)

if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{
GracePeriodSeconds: &zeroGracePeriod,
}); err != nil {
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
return

if err != nil {
return "", cleanup, errors.Wrap(err, "create daemonset")
}
cleanupFuncs = append(cleanupFuncs, func() {
klog.V(2).Infof("Daemonset %s has been scheduled for deletion", createdDS.Name)
if err := client.AppsV1().DaemonSets(namespace).Delete(context.Background(), createdDS.Name, metav1.DeleteOptions{}); err != nil {
klog.Errorf("Failed to delete daemonset %s: %v", createdDS.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to return from the function here

if err != nil {
return "", cleanup, errors.Wrap(err, "create daemonset")
}
cleanupFuncs = append(cleanupFuncs, func() {
klog.V(2).Infof("Daemonset %s has been scheduled for deletion", createdDS.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick: Extract to own function. Easier to review code.

GracePeriodSeconds: &zeroGracePeriod,
})
if err != nil {
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
return

if err != nil {
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}
klog.V(2).Infof("Daemonset pod %s in %s namespace has been deleted", pod.Name, pod.Namespace)

@banjoh
Copy link
Member

banjoh commented Jun 1, 2023

Hey @banjoh, do we actually need to wait for copyFromHost pod to be deleted? Or I think we can force delete it in the first place to reduce waiting time, since it is not a critical pod.

I suggest we have a shorter grace period instead of force deletion. 3s perhaps? kubectl for example uses 1s to tell k8s to delete the pod "immediately".

@banjoh banjoh closed this Jun 1, 2023
@banjoh banjoh reopened this Jun 1, 2023
@DexterYan
Copy link
Member Author

@banjoh Thank you! I have updated the PR to reflect those changes.

I agree for a shorter grace period, maybe better for our troubleshoot performance. I will create another PR to follow that.

@xavpaice
Copy link
Member

xavpaice commented Jun 5, 2023

The last test failed with error - failed to run collector: run-pod/static-hi: failed to run pod: failed to create pod: pods "static-hi" is forbidden: error looking up service account default/default: serviceaccount "default" not found - which looks like a different error. Is this something we should look at as part of the change here?

@DexterYan
Copy link
Member Author

DexterYan commented Jun 7, 2023

The last test failed with error - failed to run collector: run-pod/static-hi: failed to run pod: failed to create pod: pods "static-hi" is forbidden: error looking up service account default/default: serviceaccount "default" not found - which looks like a different error. Is this something we should look at as part of the change here?

I have checked our code, it should be related to https://github.com/replicatedhq/troubleshoot/blob/401dfe2c571cc9bc024861939f94836ab880c90d/pkg/collect/run.go#LL41C1-L41C1
We are using the default service account in default namespace without any checking to create a running pod. I will create another PR to fix it. This should not be related to this change.

However, in k8s, a default service account is automatically created for the default namespace.This error should be very strange.

error looking up service account default/default: serviceaccount "default" not found

I will check our replicatedhq/action-k3s to see if we can do anything to prevent it.
For now, I think we can merge this PR.

@xavpaice
Copy link
Member

This needs tests for the new functions - unfortunately those files don't have any tests right now so that's breaking new ground. Is there any e2e test we can do for this, or is it just a matter of making sure the current ones don't fail any more?

@xavpaice xavpaice merged commit 5b1e482 into main Jun 15, 2023
21 checks passed
@xavpaice xavpaice deleted the dx/fix-copy-host-retry branch June 15, 2023 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug::regression type::bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants