Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize indexer #996

Closed
wants to merge 1 commit into from

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Mar 29, 2023

Why are these changes needed?

This PR is used to improve #946. This PR description will discuss the reasons why I make the change and why I do not make the change. This may be more memory-efficient (I do not have a chance to benchmark it.)

Indexer

This PR is to optimize indexer (cache) based on kubebuilder's tutorial (tutorial, code). In #946, we create a local cache for all Pods, including Ray Pods and other Pods. Currently, we only index Ray Pods.

resource event handler (not update)

  • See here for more context.

Initially, I thought that we should implement all filter logic in the resource event handler (i.e. CreateFunc) and return true only when an event is issued by a Ray Pod. With this approach, we can make the function eventReconcile only handle event processing and not worry about event filtering. Therefore, the label common.RayNodeLabelKey is not necessary.

However, I have since realized that the owners of the unhealthy events are Pods, which may include Ray Pods and other Pods. Therefore, we cannot use GetControllerOf to get the owner and determine whether the event belongs to a Ray Pod or not by checking whether the owner's kind is "RayCluster" or not.

Hence, the implementation will be more complex. If we want to distinguish whether an event belongs to a Ray Pod or not, we need to

  • Get the UID of the event's InvolvedObject.
  • Use the UID to look up Pods from the local cache.
  • Use GetControllerOf(pod) to get the owner, and check whether the kind is RayCluster or not.

It will make the function eventReconcile easier to understand, but we need to have more I/O (read local cache) which may cause performance overhead. In summary, we should use common.RayNodeLabelKey to distinguish Ray Pods in eventReconcile until we understand the cost of the look-up operation.

Fake client limitation

The fake client included in controller-runtime does not support the use of client.MatchingFields with ListOptions. For more details, see kubernetes-sigs/controller-runtime#866. We need to figure out all limitations of the fake client in the future, and may be we should figure out a better solution.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Pass the test in #946.

@kevin85421 kevin85421 marked this pull request as ready for review March 29, 2023 05:13
@@ -46,15 +46,19 @@ var (
PrioritizeWorkersToDelete bool
ForcedClusterUpgrade bool
EnableBatchScheduler bool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change PodUIDIndexField to podUIDIndexField

Copy link
Collaborator

@wilsonwang371 wilsonwang371 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the variable and we can merge them after this.

@kevin85421
Copy link
Member Author

Thank you for the follow-up. I will update this PR after I finish the v0.5.0-rc.0 release process. In addition, this PR will not be included in v0.5.0 because I did not have a chance to benchmark the memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants