-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check pods status deep equal before update #824
Check pods status deep equal before update #824
Conversation
node/pod.go
Outdated
@@ -185,6 +186,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes | |||
kPod.Lock() | |||
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy() | |||
kPod.Unlock() | |||
if reflect.DeepEqual(podFromKubernetes.Status, podFromProvider.Status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use cmp here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review.
Good idea, updated
5140148
to
dbc9b7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
node/pod.go
Outdated
@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes | |||
kPod.Lock() | |||
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy() | |||
kPod.Unlock() | |||
if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we keep track of the last pod we received from provider. Can we do the deduping there? That we look at the last pod from we received from the provider, and if it wasn't the same, then we dedupe? Otherwise, this mechanism can lead to "false updates" and "missed updates" due to lag.
Also, I think this behaviour should be behind a config, because right now, firing away an update is a way to force a pod status update from VK -- which a provider may want to do if something interferes with the write to pod status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, lastPodStatusReceivedFromProvider
seems not like a good design. As you mentioned, that may cause false updates
and false updates
. Why not we change it to a map
, not just a variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/virtual-kubelet/virtual-kubelet/blob/master/node/pod.go#L180
We have checked here, so it would not directly lead to "false updates" or "missed updates". This commit just add one more check before updating status to K8S.
@cpuguy83 wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that without some level of checking here, you can get into a situation where you can miss updates.
For example, if you have a pod which has a start, and a container fails, and it's restarted? You could fail to send the successful start because you haven't received the update from API server indicating it's failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If restart, pod running time would change. Other messages in status.conditions
would also change. What's more, if pods' status has not changed, why we need update the status to apiserver.
As you said, start- fail - restart
, update status or not would not change the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add it to NotifyPods? And then we can hash this one out later?
Sure, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to:
IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.
I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):
- We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.
- From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?
Thanks, I will move pod update dedupe
to a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to:
IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.
I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):
- We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.
- From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.
Do we have a plan to change this logic, e.g. a map map[pod.UID]pod
to make it possible to process sync pod status in parallel.
478d86e
to
f5f9337
Compare
node/pod.go
Outdated
@@ -213,7 +216,9 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue | |||
if obj, ok := pc.knownPods.Load(key); ok { | |||
kpod := obj.(*knownPod) | |||
kpod.Lock() | |||
kpod.lastPodStatusReceivedFromProvider = pod | |||
if !cmp.Equal(kpod.lastPodStatusReceivedFromProvider, pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cmp.equal, then we don't need to call q.AddRateLimited below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL
node/pod.go
Outdated
@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes | |||
kPod.Lock() | |||
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy() | |||
kPod.Unlock() | |||
if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to do the pod update dedupe in a different PR, or drop this dedupe behaviour in this PR to get it through?
node/pod.go
Outdated
@@ -185,6 +185,9 @@ func (pc *PodController) updatePodStatus(ctx context.Context, podFromKubernetes | |||
kPod.Lock() | |||
podFromProvider := kPod.lastPodStatusReceivedFromProvider.DeepCopy() | |||
kPod.Unlock() | |||
if cmp.Equal(podFromKubernetes.Status, podFromProvider.Status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to:
IMO, lastPodStatusReceivedFromProvider seems not like a good design. As you mentioned, that may cause false updates and false updates. Why not we change it to a map, not just a variable.
I think that there are two reasons behind this design (and I will not be one to defend it as being perfect):
- We want thee ability to perform critical code operations on the k8s pod, and the provider pod under the same lock.
- From a performance perspective, having to hold a global lock while doing these (relatively) simple operations gets slow quick.
1dc1e02
to
081fbbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Two quick questions.
kpod.lastPodStatusReceivedFromProvider = pod | ||
kpod.Unlock() | ||
q.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, do you know if this function blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check the implementation. This q
is base on a chan
with size 1000
. So if not full, would not block.
@@ -213,8 +213,11 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue | |||
if obj, ok := pc.knownPods.Load(key); ok { | |||
kpod := obj.(*knownPod) | |||
kpod.Lock() | |||
defer kpod.Unlock() | |||
if cmp.Equal(kpod.lastPodStatusReceivedFromProvider, pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to compare the entire pod, or just the pod status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be safer to compare the entire pod, which would not change the original behavior.
Thank you for working through this. I appreciate the contribution. |
Thanks, I would like open another PR to add check in |
I’ll wait a day in case @cpuguy83 has any issues. Is there a way to write a test for this by any chance? |
node/pod.go
Outdated
@@ -213,8 +213,11 @@ func (pc *PodController) enqueuePodStatusUpdate(ctx context.Context, q workqueue | |||
if obj, ok := pc.knownPods.Load(key); ok { | |||
kpod := obj.(*knownPod) | |||
kpod.Lock() | |||
defer kpod.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defer could keep kpod
locked if AddRateLimited
is blocked.
I would prefer if we play this safer and just unblock immediately after cmp.Equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems safer, done!
081fbbd
to
30e31c0
Compare
I go through the test codes, it seems hard now. I prefer to open another PR to add tests for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Check pods status deep equal before update. If we do not check it, we would update pod status frequently and reach the qps limit quickly.