Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Skip update events without change #811

Merged

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Dec 8, 2022

Signed-off-by: Sihan Wang sihanwang41@gmail.com

Why are these changes needed?

Ignore K8s reconcile when there is no config/label/annotation change

Related issue number

#28652

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Local testing, all reconciles are under 2 seconds interval

@sihanwang41 sihanwang41 changed the title [RayService] Add event filter [RayService][Testing] Add event filter Dec 8, 2022
@kevin85421
Copy link
Member

kevin85421 commented Dec 8, 2022

It looks like we should add a RateLimiter rather than an event filter.

https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/controller#Options

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@sihanwang41 sihanwang41 changed the title [RayService][Testing] Add event filter [RayService][Testing] Ignore state change from reconcil Dec 8, 2022
@sihanwang41 sihanwang41 changed the title [RayService][Testing] Ignore state change from reconcil [RayService][Testing] Ignore state change from reconcile Dec 8, 2022
@sihanwang41 sihanwang41 changed the title [RayService][Testing] Ignore state change from reconcile [RayService][Testing] Skip update events without change Dec 8, 2022
@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 8, 2022

Skipping metadata and status changes seems fine as well -- but are you sure you want to skip metadata changes?
See

WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{}, predicate.AnnotationChangedPredicate{})).
for what the RayCluster controller currently does, which is check for spec, labels, or annotations changes.

@DmitriGekhtman DmitriGekhtman added this to the v0.4.0 release milestone Dec 8, 2022
@DmitriGekhtman
Copy link
Collaborator

It looks like we should add a RateLimiter rather than an event filter.

https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/controller#Options

For later, I think we should consider this too.

@brucez-anyscale
Copy link
Contributor

Skipping metadata and status changes seems fine as well -- but are you sure you want to skip metadata changes? See

WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{}, predicate.AnnotationChangedPredicate{})).

for what the RayCluster controller currently does, which is check for spec, labels, or annotations changes.

RayService is a self driven reconciler. It reconciles every 2 seconds.

@DmitriGekhtman
Copy link
Collaborator

RayService is a self driven reconciler. It reconciles every 2 seconds.

Right, reconciliation happens periodically by default anyways, but reconcilers should try to respond ASAP to relevant changes. I think we do use annotations in the KubeRay operator, so it makes sense to respond to changes to those. We might use labels in the future.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 8, 2022

Anyways, my complaint is a nit for the reason that @brucez-anyscale points out -- reconciliation of metadata will happen either way, though it potentially could be delayed by a couple of seconds.
Also we don't expect users to update metadata very often if at all.
I'd support keeping the event filter as is -- up to you.

@kevin85421
Copy link
Member

If RayService controller will reconcile every two seconds, this solution makes sense. As mentioned above, the downside is the long latency. One quick question, where do we specify the 2 seconds period (SyncPeriod?)? cc @brucez-anyscale

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@sihanwang41 sihanwang41 changed the title [RayService][Testing] Skip update events without change [RayService] Skip update events without change Dec 9, 2022
@brucez-anyscale
Copy link
Contributor

If RayService controller will reconcile every two seconds, this solution makes sense. As mentioned above, the downside is the long latency. One quick question, where do we specify the 2 seconds period (SyncPeriod?)? cc @brucez-anyscale

@sihanwang41 updates the pr. So there should not be long latency issue.

2 second is here
ray-operator/controllers/ray/rayservice_controller.go:39
ServiceDefaultRequeueDuration = 2 * time.Second

@kevin85421
Copy link
Member

@sihanwang41 updates the pr. So there should not be long latency issue.

It looks like the latency is still the same, but 2 seconds is okay for me. Resource events with the same generation are necessary for KubeRay because KubeRay is not an ideal operator (stateless & idempotent).

2 second is here ray-operator/controllers/ray/rayservice_controller.go:39 ServiceDefaultRequeueDuration = 2 * time.Second

Thank you for your reply!

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 9, 2022

Resource events with the same generation are necessary for KubeRay because KubeRay is not an ideal operator (stateless & idempotent).

I agree that the KubeRay operator's reconciliation is not 100% correct. We identified some of the gaps in the GitHub issues.

Why does that imply that that the reconciling multiple times with the same resource version is necessary?

@DmitriGekhtman
Copy link
Collaborator

I think we can merge this PR, since it solves the immediate issue in a reasonable-enough way.
We can revisit triggers for reconciliation, rate limiters, etc. later.

@kevin85421
Copy link
Member

Resource events with the same generation are necessary for KubeRay because KubeRay is not an ideal operator (stateless & idempotent).

Why does that imply that reconciling multiple times with the same resource version is necessary?

In some cases, KubeRay needs various reconciliation iterations with same CR spec to reach the goal state.

Example1: Take #704 as an example, the extra Pod will be killed in the next reconciliation. However, there is no generation update between these two reconciliations.

Example2: Take RayJob as an example, if the first iteration creates RayCluster successfully but fails in a job launch, the second reconciliation needs to use HTTP client to create the job again.

We can think about this question from a high-level perspective. Replicated update resource events can be considered the same event only when the reconciliation logic is stateless. In other words, the operator will help Kubernetes reach the same goal state no matter the previous cluster state.

For example, we cannot discard (2) and (3) although they are the same function. On the other hand, we can ignore (5) and (6) because they are idempotent. We can imagine that KubeRay is a big non-idempotent function.

event 1. x += 1
event 2. x += 1
event 3. x += 1

--- 
event 4. x = 1
event 5. x = 1
event 6. x = 1

@DmitriGekhtman
Copy link
Collaborator

Example1: Take #704 as an example, the extra Pod will be killed in the next reconciliation. However, there is no generation update between these two reconciliations.

The reconciler will still respond to pod creation and deletion -- the event filter old affects response to RayCluster resource changes.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 9, 2022

For the pod reconciliation issue, I think it's better if the operator reconciles less frequently since the problem is a stale cache. Reconciling less frequently would minimize the chances of read-after-write inconsistency.

Of course, the ideal solution is to actually fix the pod reconciliation logic.

@DmitriGekhtman DmitriGekhtman merged commit 0df4d8a into ray-project:master Dec 9, 2022
DmitriGekhtman pushed a commit to DmitriGekhtman/kuberay that referenced this pull request Dec 9, 2022
Skip RayService reconcile when there is no config/label/annotation change

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
DmitriGekhtman pushed a commit to DmitriGekhtman/kuberay that referenced this pull request Dec 9, 2022
Skip RayService reconcile when there is no config/label/annotation change

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
DmitriGekhtman added a commit that referenced this pull request Dec 9, 2022
Skip RayService reconcile when there is no config/label/annotation change

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Skip RayService reconcile when there is no config/label/annotation change

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants