Check pods status deep equal before update #824

cwdsuzhou · 2020-03-18T01:50:21Z

Check pods status deep equal before update. If we do not check it, we would update pod status frequently and reach the qps limit quickly.

sargun · 2020-03-20T10:04:31Z

node/pod.go

@@ -185,6 +186,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
 	kPod.Lock()
 	podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
 	kPod.Unlock()
+	if reflect.DeepEqual(podFromKubernetes.Status, podFromProvider.Status) {


Any reason not to use cmp here?

Thanks for review.

Good idea, updated

cpuguy83

LGTM

cwdsuzhou · 2020-03-24T11:14:08Z

ping @cpuguy83 @sargun , is this PR ready to be merged

sargun · 2020-04-02T17:30:16Z

node/pod.go

@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
 	kPod.Lock()
 	podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
 	kPod.Unlock()
+	if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) {


I think we keep track of the last pod we received from provider. Can we do the deduping there? That we look at the last pod from we received from the provider, and if it wasn't the same, then we dedupe? Otherwise, this mechanism can lead to "false updates" and "missed updates" due to lag.

Also, I think this behaviour should be behind a config, because right now, firing away an update is a way to force a pod status update from VK -- which a provider may want to do if something interferes with the write to pod status.

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

https://github.com/virtual-kubelet/virtual-kubelet/blob/master/node/pod.go#L180
We have checked here, so it would not directly lead to "false updates" or "missed updates". This commit just add one more check before updating status to K8S.

@cpuguy83 wdyt?

I think that without some level of checking here, you can get into a situation where you can miss updates.

For example, if you have a pod which has a start, and a container fails, and it's restarted? You could fail to send the successful start because you haven't received the update from API server indicating it's failed?

If restart, pod running time would change. Other messages in status.conditions would also change. What's more, if pods' status has not changed, why we need update the status to apiserver.

As you said, start- fail - restart, update status or not would not change the result.

Do you want to add it to NotifyPods? And then we can hash this one out later?

Sure, done

Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?

In regards to:

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):

We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.

From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.

Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?

Thanks, I will move pod update dedupe to a different PR

In regards to:

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):

We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.

From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.

Do we have a plan to change this logic, e.g. a map map[pod.UID]pod to make it possible to process sync pod status in parallel.

sargun · 2020-04-20T02:04:28Z

node/pod.go

@@ -213,7 +216,9 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue
 		if obj, ok := pc.knownPods.Load(key); ok {
 			kpod := obj.(*knownPod)
 			kpod.Lock()
-			kpod.lastPodStatusReceivedFromProvider = pod
+			if !cmp.Equal(kpod.lastPodStatusReceivedFromProvider, pod) {


If cmp.equal, then we don't need to call q.AddRateLimited below.

Yes, thanks

sargun · 2020-04-20T02:05:17Z

node/pod.go

@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
 	kPod.Lock()
 	podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
 	kPod.Unlock()
+	if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) {


Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?

sargun · 2020-04-20T02:07:03Z

node/pod.go

@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes
 	kPod.Lock()
 	podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy()
 	kPod.Unlock()
+	if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) {


In regards to:

IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.

I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):

We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.

From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.

sargun

LGTM. Two quick questions.

sargun · 2020-04-20T03:23:00Z

node/pod.go

 			kpod.lastPodStatusReceivedFromProvider = pod
-			kpod.Unlock()
 			q.AddRateLimited(key)


Dumb question, do you know if this function blocks?

I check the implementation. This q is base on a chan with size 1000. So if not full, would not block.

sargun · 2020-04-20T03:24:38Z

node/pod.go

@@ -213,8 +213,11 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue
 		if obj, ok := pc.knownPods.Load(key); ok {
 			kpod := obj.(*knownPod)
 			kpod.Lock()
+			defer kpod.Unlock()
+			if cmp.Equal(kpod.lastPodStatusReceivedFromProvider, pod) {


Do we need to compare the entire pod, or just the pod status?

I think it may be safer to compare the entire pod, which would not change the original behavior.

sargun · 2020-04-20T03:27:31Z

Thank you for working through this. I appreciate the contribution.

cwdsuzhou · 2020-04-20T05:05:16Z

Thank you for working through this. I appreciate the contribution.

Thanks, I would like open another PR to add check in UpdatePodStatus

cwdsuzhou · 2020-04-21T02:31:22Z

@sargun @cpuguy83 could you help merge this PR? thanks

sargun · 2020-04-21T02:32:25Z

I’ll wait a day in case @cpuguy83 has any issues. Is there a way to write a test for this by any chance?

cpuguy83 · 2020-04-21T02:37:23Z

node/pod.go

@@ -213,8 +213,11 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue
 		if obj, ok := pc.knownPods.Load(key); ok {
 			kpod := obj.(*knownPod)
 			kpod.Lock()
+			defer kpod.Unlock()


This defer could keep kpod locked if AddRateLimited is blocked.
I would prefer if we play this safer and just unblock immediately after cmp.Equal

Seems safer, done!

cwdsuzhou · 2020-04-21T02:52:53Z

I’ll wait a day in case @cpuguy83 has any issues. Is there a way to write a test for this by any chance?

I go through the test codes, it seems hard now. I prefer to open another PR to add tests for this

cpuguy83

LGTM

cwdsuzhou · 2020-04-26T03:41:27Z

@sargun @cpuguy83 can you take a look at this PR
#830

sargun reviewed Mar 20, 2020

View reviewed changes

cwdsuzhou force-pushed the March/check_pod_equal branch from 5140148 to dbc9b7a Compare March 20, 2020 11:32

cwdsuzhou requested a review from sargun March 20, 2020 11:33

cpuguy83 approved these changes Mar 20, 2020

View reviewed changes

sargun reviewed Apr 2, 2020

View reviewed changes

cwdsuzhou requested a review from sargun April 19, 2020 02:48

cwdsuzhou force-pushed the March/check_pod_equal branch from 478d86e to f5f9337 Compare April 20, 2020 01:58

sargun requested changes Apr 20, 2020

View reviewed changes

cwdsuzhou force-pushed the March/check_pod_equal branch 2 times, most recently from 1dc1e02 to 081fbbd Compare April 20, 2020 03:10

cwdsuzhou requested a review from sargun April 20, 2020 03:11

sargun approved these changes Apr 20, 2020

View reviewed changes

cpuguy83 requested changes Apr 21, 2020

View reviewed changes

Check pod status equal before enqueue

30e31c0

cwdsuzhou force-pushed the March/check_pod_equal branch from 081fbbd to 30e31c0 Compare April 21, 2020 02:49

cwdsuzhou requested a review from cpuguy83 April 21, 2020 02:53

cwdsuzhou mentioned this pull request Apr 21, 2020

dedup in updatePodStatus #830

Merged

cpuguy83 approved these changes Apr 21, 2020

View reviewed changes

cpuguy83 merged commit d9193e2 into virtual-kubelet:master Apr 21, 2020

turkenh mentioned this pull request Nov 6, 2020

Pod status out of sync after being marked as not ready by controller manager #899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check pods status deep equal before update #824

Check pods status deep equal before update #824

cwdsuzhou commented Mar 18, 2020

sargun Mar 20, 2020

cwdsuzhou Mar 20, 2020

cpuguy83 left a comment

cwdsuzhou commented Mar 24, 2020

sargun Apr 2, 2020

cwdsuzhou Apr 3, 2020

cwdsuzhou Apr 19, 2020

sargun Apr 19, 2020

cwdsuzhou Apr 19, 2020

cwdsuzhou Apr 20, 2020

sargun Apr 20, 2020

sargun Apr 20, 2020

cwdsuzhou Apr 20, 2020

cwdsuzhou Apr 20, 2020

sargun Apr 20, 2020

cwdsuzhou Apr 20, 2020

cwdsuzhou Apr 20, 2020

sargun Apr 20, 2020

sargun Apr 20, 2020

sargun left a comment

sargun Apr 20, 2020

cwdsuzhou Apr 20, 2020

sargun Apr 20, 2020

cwdsuzhou Apr 20, 2020

sargun commented Apr 20, 2020

cwdsuzhou commented Apr 20, 2020 •

edited

cwdsuzhou commented Apr 21, 2020

sargun commented Apr 21, 2020

cpuguy83 Apr 21, 2020

cwdsuzhou Apr 21, 2020

cwdsuzhou commented Apr 21, 2020

cpuguy83 left a comment

cwdsuzhou commented Apr 26, 2020

Check pods status deep equal before update #824

Check pods status deep equal before update #824

Conversation

cwdsuzhou commented Mar 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuguy83 left a comment

Choose a reason for hiding this comment

cwdsuzhou commented Mar 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sargun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sargun commented Apr 20, 2020

cwdsuzhou commented Apr 20, 2020 • edited

cwdsuzhou commented Apr 21, 2020

sargun commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwdsuzhou commented Apr 21, 2020

cpuguy83 left a comment

Choose a reason for hiding this comment

cwdsuzhou commented Apr 26, 2020

cwdsuzhou commented Apr 20, 2020 •

edited