-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(collector): solving pod stuck in terminating with CNI issue #1196
Conversation
Reproduce the issue
Default Spec Example
Output
|
Hey @banjoh, do we actually need to wait for copyFromHost pod to be deleted? Or I think we can force delete it in the first place to reduce waiting time, since it is not a critical pod. |
Since we set MAX_TIME_TO_WAIT_FOR_POD_DELETION to be 1 minute, you can use this simple spec for testing Simple Spec
Output
|
609ecb7
to
ca24031
Compare
pkg/collect/run_pod.go
Outdated
@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul | |||
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil { | |||
klog.Errorf("Failed to delete pod %s: %v", pod.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to return from the function here
pkg/collect/run_pod.go
Outdated
@@ -58,6 +60,33 @@ func (c *CollectRunPod) Collect(progressChan chan<- interface{}) (CollectorResul | |||
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit pick: This function is big enough to be extracted into its own method in CollectRunPod
class. func (c *CollectRunPod) DeletePod()
perhaps?
pkg/collect/run_pod.go
Outdated
}); err != nil { | ||
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
klog.V(2).Infof("Pod %s in %s namespace has been deleted", pod.Name, pod.Namespace) |
pkg/collect/run_pod.go
Outdated
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{ | ||
GracePeriodSeconds: &zeroGracePeriod, | ||
}); err != nil { | ||
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) | |
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) | |
return |
pkg/collect/copy_from_host.go
Outdated
if err != nil { | ||
return "", cleanup, errors.Wrap(err, "create daemonset") | ||
} | ||
cleanupFuncs = append(cleanupFuncs, func() { | ||
klog.V(2).Infof("Daemonset %s has been scheduled for deletion", createdDS.Name) | ||
if err := client.AppsV1().DaemonSets(namespace).Delete(context.Background(), createdDS.Name, metav1.DeleteOptions{}); err != nil { | ||
klog.Errorf("Failed to delete daemonset %s: %v", createdDS.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to return from the function here
pkg/collect/copy_from_host.go
Outdated
if err != nil { | ||
return "", cleanup, errors.Wrap(err, "create daemonset") | ||
} | ||
cleanupFuncs = append(cleanupFuncs, func() { | ||
klog.V(2).Infof("Daemonset %s has been scheduled for deletion", createdDS.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit pick: Extract to own function. Easier to review code.
pkg/collect/copy_from_host.go
Outdated
GracePeriodSeconds: &zeroGracePeriod, | ||
}) | ||
if err != nil { | ||
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) | |
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) | |
return |
pkg/collect/copy_from_host.go
Outdated
if err != nil { | ||
klog.Errorf("Failed to wait for pod %s deletion: %v", pod.Name, err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
klog.V(2).Infof("Daemonset pod %s in %s namespace has been deleted", pod.Name, pod.Namespace) |
I suggest we have a shorter grace period instead of force deletion. |
@banjoh Thank you! I have updated the PR to reflect those changes. I agree for a shorter grace period, maybe better for our troubleshoot performance. I will create another PR to follow that. |
The last test failed with |
I have checked our code, it should be related to https://github.com/replicatedhq/troubleshoot/blob/401dfe2c571cc9bc024861939f94836ab880c90d/pkg/collect/run.go#LL41C1-L41C1 However, in k8s, a default service account is automatically created for the default namespace.This error should be very strange.
I will check our |
25c38b7
to
7ce9593
Compare
This needs tests for the new functions - unfortunately those files don't have any tests right now so that's breaking new ground. Is there any e2e test we can do for this, or is it just a matter of making sure the current ones don't fail any more? |
Description, Motivation and Context
runPod
andcopyFromHost
Fixes:
#1188
#1186
Previous PR:
#1172
#1194
Checklist
Does this PR introduce a breaking change?