Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add volcano taskSpec annotations to pod #1754

Merged
merged 2 commits into from
Dec 20, 2023

Conversation

Tongruizhe
Copy link
Contributor

@Tongruizhe Tongruizhe commented Dec 15, 2023

Why are these changes needed?

Related issue number

Fixes #1752

Checks

  • [√] I've made sure the tests are passing.
  • Testing Strategy
    • [√] Unit tests
    • [√] Manual tests
    • This PR is not tested :(

@Tongruizhe
Copy link
Contributor Author

cc @kevin85421

@kevin85421 kevin85421 self-requested a review December 15, 2023 17:05
@kevin85421 kevin85421 self-assigned this Dec 15, 2023
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation in #1752! It makes sense to me. I will try to reproduce it manually before I merge this PR.

@kevin85421
Copy link
Member

kevin85421 commented Dec 20, 2023

I can reproduce the issue:

# Step 1: Create a Kind cluster.
kind create cluster --image=kindest/node:v1.26.0 --config kind-config.yaml

# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker

# Step 2: Install Volcano
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

# Step 3: Add labels to the Kubernetes nodes
kubectl label nodes kind-control-plane type=kind-control-plane
kubectl label nodes kind-worker type=kind-worker
kubectl label nodes kind-worker2 type=kind-worker2

# Step 4: Install KubeRay operator (path: helm-chart/kuberay-operator)
helm install kuberay-operator . --set batchScheduler.enabled=true

# Step 5: Create a RayCluster with this gist https://gist.github.com/kevin85421/8904263d1861c0d8140c516244bb4382.
# The RayCluster has a head Pod without any NodeSelector, a worker group "small-group" with a NodeSelector that
# binds to "kind-worker", and a worker group "small-group-2" with a NodeSelector that binds to "kind-worker2".
# However, it will always be pending as shown in the screenshot.
Screen Shot 2023-12-19 at 9 47 46 PM

See here for Volcano logs.

  • test-cluster-0-worker-small-group-ntgf5 successfully binds to node kind-worker.
  • test-cluster-0-head-hg6nc doesn't have any affinity information, so directly binds to node kind-worker.
  • test-cluster-0-worker-small-group-2-4ccnh only checks kind-worker and fails. Consequently, Volcano discards the operations. It hasn't checked with kind-worker2.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR manually, and it worked as expected. LGTM. Thank you for the contribution!

@kevin85421 kevin85421 merged commit d950d59 into ray-project:master Dec 20, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Volcano batch scheduler marks pods unschedulable when the ray cluster has multiple workerGroups
2 participants