Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check pods status deep equal before update #824

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

cwdsuzhou
Copy link
Contributor

Check pods status deep equal before update. If we do not check it, we would update pod status frequently and reach the qps limit quickly.

node/pod.go Outdated
@@ -185,6 +186,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
kPod.Lock()
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
kPod.Unlock()
if reflect.DeepEqual(podFromKubernetes.Status, podFromProvider.Status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use cmp here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review.

Good idea, updated

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cwdsuzhou
Copy link
Contributor Author

ping @cpuguy83 @sargun , is this PR ready to be merged

node/pod.go Outdated
@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
kPod.Lock()
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
kPod.Unlock()
if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we keep track of the last pod we received from provider. Can we do the deduping there? That we look at the last pod from we received from the provider, and if it wasn't the same, then we dedupe? Otherwise, this mechanism can lead to "false updates" and "missed updates" due to lag.

Also, I think this behaviour should be behind a config, because right now, firing away an update is a way to force a pod status update from VK -- which a provider may want to do if something interferes with the write to pod status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/virtual-kubelet/virtual-kubelet/blob/master/node/pod.go#L180
We have checked here, so it would not directly lead to "false updates" or "missed updates". This commit just add one more check before updating status to K8S.

@cpuguy83 wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that without some level of checking here, you can get into a situation where you can miss updates.

For example, if you have a pod which has a start, and a container fails, and it's restarted? You could fail to send the successful start because you haven't received the update from API server indicating it's failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If restart, pod running time would change. Other messages in status.conditions would also change. What's more, if pods' status has not changed, why we need update the status to apiserver.

As you said, start- fail - restart, update status or not would not change the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add it to NotifyPods? And then we can hash this one out later?

Sure, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to:

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):

  1. We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.
  2. From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?

Thanks, I will move pod update dedupe to a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to:

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):

  1. We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.
  2. From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.

Do we have a plan to change this logic, e.g. a map map[pod.UID]pod to make it possible to process sync pod status in parallel.

node/pod.go Outdated
@@ -213,7 +216,9 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue
if obj, ok := pc.knownPods.Load(key); ok {
kpod := obj.(*knownPod)
kpod.Lock()
kpod.lastPodStatusReceivedFromProvider = pod
if !cmp.Equal(kpod.lastPodStatusReceivedFromProvider, pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cmp.equal, then we don't need to call q.AddRateLimited below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL

node/pod.go Outdated
@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
kPod.Lock()
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
kPod.Unlock()
if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?

node/pod.go Outdated
@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
kPod.Lock()
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
kPod.Unlock()
if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to:

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):

  1. We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.
  2. From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.

@cwdsuzhou cwdsuzhou force-pushed the March/check_pod_equal branch 2 times, most recently from 1dc1e02 to 081fbbd Compare April 20, 2020 03:10
@cwdsuzhou cwdsuzhou requested a review from sargun April 20, 2020 03:11
Copy link
Contributor

@sargun sargun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two quick questions.

kpod.lastPodStatusReceivedFromProvider = pod
kpod.Unlock()
q.AddRateLimited(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, do you know if this function blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check the implementation. This q is base on a chan with size 1000. So if not full, would not block.

@@ -213,8 +213,11 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue
if obj, ok := pc.knownPods.Load(key); ok {
kpod := obj.(*knownPod)
kpod.Lock()
defer kpod.Unlock()
if cmp.Equal(kpod.lastPodStatusReceivedFromProvider, pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to compare the entire pod, or just the pod status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be safer to compare the entire pod, which would not change the original behavior.

@sargun
Copy link
Contributor

sargun commented Apr 20, 2020

Thank you for working through this. I appreciate the contribution.

@cwdsuzhou
Copy link
Contributor Author

cwdsuzhou commented Apr 20, 2020

Thank you for working through this. I appreciate the contribution.

Thanks, I would like open another PR to add check in UpdatePodStatus

@cwdsuzhou
Copy link
Contributor Author

@sargun @cpuguy83 could you help merge this PR? thanks

@sargun
Copy link
Contributor

sargun commented Apr 21, 2020

I’ll wait a day in case @cpuguy83 has any issues. Is there a way to write a test for this by any chance?

node/pod.go Outdated
@@ -213,8 +213,11 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue
if obj, ok := pc.knownPods.Load(key); ok {
kpod := obj.(*knownPod)
kpod.Lock()
defer kpod.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defer could keep kpod locked if AddRateLimited is blocked.
I would prefer if we play this safer and just unblock immediately after cmp.Equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems safer, done!

@cwdsuzhou
Copy link
Contributor Author

I’ll wait a day in case @cpuguy83 has any issues. Is there a way to write a test for this by any chance?

I go through the test codes, it seems hard now. I prefer to open another PR to add tests for this

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cpuguy83 cpuguy83 merged commit d9193e2 into virtual-kubelet:master Apr 21, 2020
@cwdsuzhou
Copy link
Contributor Author

@sargun @cpuguy83 can you take a look at this PR
#830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants