Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: .status.availableWorkerReplicas #887

Merged
merged 2 commits into from
Feb 5, 2023
Merged

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Jan 28, 2023

to not count head Pods. closes #885

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

func CalculateAvailableReplicas(pods corev1.PodList) int32 {
count := int32(0)
for _, pod := range pods.Items {
if val, ok := pod.Annotations["ray.io/node-type"]; !ok || val != string(rayiov1alpha1.WorkerNode) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to use ray.io/node-type because importing common.RayNodeTypeLabelKey caused import cycle.

-: import cycle not allowed: import stack: [
  github.com/ray-project/kuberay/ray-operator
  github.com/ray-project/kuberay/ray-operator/controllers/ray
  github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler
  github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler/volcano
  github.com/ray-project/kuberay/ray-operator/controllers/ray/common
  github.com/ray-project/kuberay/ray-operator/controllers/ray/utils
  github.com/ray-project/kuberay/ray-operator/controllers/ray/common
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Maybe we need to separate constant.go into a separate module in the future. cc @DmitriGekhtman @Jeffwan @architkulkarni

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should track solving that.
It would be better if we could eliminate common and utils by refactoring the package structure -- https://dave.cheney.net/2019/01/08/avoid-package-names-like-base-util-or-common

@davidxia davidxia marked this pull request as ready for review January 28, 2023 17:45
@davidxia
Copy link
Contributor Author

@kevin85421

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Would you mind adding a screenshot of kubectl get raycluster?

@@ -312,3 +312,34 @@ func TestReconcile_CheckNeedRemoveOldPod(t *testing.T) {

assert.Equal(t, PodNotMatchingTemplate(pod, workerTemplate), true, "expect template & pod not matching")
}

func TestCalculateAvailableReplicas(t *testing.T) {
podList := corev1.PodList{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding comments for the definition of available replicas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to controllers/ray/utils/util.go. Lmk if this it's good.

A worker is available if its Pod is running or pending.

Is it correct to count pending Pods or only running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the straightforward definition of "available replicas" is the number of worker Pods that are ready to serve actors and tasks. Also, we can refer to the definition of AvailableReplicas in ReplicaSet.

[1] Kubernetes source code: How to calculate AvailableReplicas?
[2] Understanding the Available condition of a Kubernetes deployment: I did not have a chance to read this article, but this may be helpful.

Do you have any thoughts about the definition of "available"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

straightforward definition of "available replicas" is the number of worker Pods that are ready to serve actors and tasks

This makes sense. We shouldn't count pending Pods here then, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. We shouldn't count pending Pods here then, right?

Yes. If referring to the definition of AvailableReplicas in ReplicaSet makes sense to you, we shouldn't count pending Pods.

Copy link
Contributor Author

@davidxia davidxia Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Updated. Ready for another review

ray-operator/controllers/ray/utils/util.go Outdated Show resolved Hide resolved
@kevin85421 kevin85421 assigned kevin85421 and unassigned kevin85421 Jan 29, 2023
@davidxia
Copy link
Contributor Author

tested manually

kubectl get rayclusters
NAME        DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
dxia-test   3                 3                   ready    6d17h

kubectl get rayclusters dxia-test -o yaml

...
status:
  availableWorkerReplicas: 3
  desiredWorkerReplicas: 3
...

@davidxia davidxia force-pushed the patch6 branch 2 times, most recently from b5eb1cd to ecb7901 Compare January 30, 2023 00:54
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Test this PR manually.

Screen Shot 2023-02-04 at 8 55 59 PM

@kevin85421 kevin85421 merged commit abd95d7 into ray-project:master Feb 5, 2023
@kevin85421
Copy link
Member

CI failure is due to #852.

Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/kuberay that referenced this pull request Feb 9, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Define availableWorkerReplicas.
@davidxia davidxia deleted the patch6 branch October 31, 2023 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] .status.availableWorkerReplicas incorrectly counts head Pods
3 participants