Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayCluster] Fix RAY_REDIS_ADDRESS parsing with redis scheme and… #1556

Merged
merged 1 commit into from
Oct 30, 2023

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Oct 22, 2023

Why are these changes needed?

The current parsing of RAY_REDIS_ADDRESS in a redis cleanup job is not correct and may result in the error reported in the slack channel: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1697556705340069

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: too many values to unpack (expected 2)

This is because the current parsing:

host, port = os.getenv('RAY_REDIS_ADDRESS').rsplit(':')

may result in more than two values in the following cases:

  1. RAY_REDIS_ADDRESS contains multiple addresses.
  2. RAY_REDIS_ADDRESS contains IPv6 addresses.
  3. RAY_REDIS_ADDRESS contains a username and a password.

Related issues

#1557

A Potential Solution

We could use the private method get_address, which actually parses a redis address, from the upstream ray. This would let us stay consistent with the upstream. However, I think importing the get_address is not suitable right now for the following reasons:

  1. The naming of get_address is too vague. I believe it will be changed to something like get_redis_address soon.
  2. The get_address doesn't handle multiple addresses which are allowed in the RAY_REDIS_ADDRESS according to here.
  3. The get_address implementation doesn't seem to be quite right to me. It doesn't consider most properties allowed in the redis uri spec, such as password, and database number. A valid redis uri looks like this redis://user:secret@localhost:6379/0 and it will not be parsed corrected by the get_address.
  4. The get_address was introduced only 2 months ago. I am afraid that it doesn't exist in most ray releases.

Proposed Solution

I propose using the standard urlparse to handle the parsing. It will help us extract the scheme, hostname, port, and even username and password more correctly. I also recommend contributing this method back to the upstream ray.

In this PR, I did the following

  1. To handle multiple addresses according to here, I also use the .split(',') and only pick the first part.
  2. If the address doesn't have a scheme, then I add a redis:// scheme prefix to it. This is required by the urlparse.
  3. Use the urlparse to extract scheme, hostname, port, and password from the address and ignore other properties that aren't supported by the cleanup_redis_storage. Note that I also enable the use_ssl flag if the scheme is rediss.

Testing

To test this change, I added a new test, duplicated from the original ray-operator/config/samples/ray-cluster.external-redis.yaml, and added a redis scheme redis:// to its RAY_REDIS_ADDRESS.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian force-pushed the fix-redis-uri-in-cleanup-job branch from 7f1949d to a36a45c Compare October 28, 2023 05:24
@rueian
Copy link
Contributor Author

rueian commented Oct 28, 2023

Hi @kevin85421,

Here is the result of running python3 tests/test_sample_raycluster_yamls.py. All 14 tests including the new ray-cluster.external-redis-uri.yaml are passed.

▶ RAY_IMAGE=rayproject/ray:2.7.0 OPERATOR_IMAGE=kuberay/operator:nightly python3 tests/test_sample_raycluster_yamls.py
2023-10-28:12:55:01,755 INFO     [test_sample_raycluster_yamls.py:62] {'ray-image': 'rayproject/ray:2.7.0', 'kuberay-operator-image': 'kuberay/operator:nightly'}
2023-10-28:12:55:01,755 INFO     [test_sample_raycluster_yamls.py:64] Build a test plan ...
2023-10-28:12:55:01,756 INFO     [test_sample_raycluster_yamls.py:68] [SKIP TEST 0] ray-cluster.autoscaler.large.yaml: Skip this test because it requires a lot of resources.
2023-10-28:12:55:01,756 INFO     [test_sample_raycluster_yamls.py:68] [SKIP TEST 1] ray-cluster.gke-bucket.yaml: Skip this test because it requires GKE and k8s service accounts.
2023-10-28:12:55:01,756 INFO     [test_sample_raycluster_yamls.py:70] [TEST 2]: ray-cluster.py-spy.yaml
2023-10-28:12:55:01,771 INFO     [test_sample_raycluster_yamls.py:70] [TEST 3]: ray-cluster.tls.yaml
2023-10-28:12:55:01,779 INFO     [test_sample_raycluster_yamls.py:68] [SKIP TEST 4] ray-cluster-tpu.yaml: Skip this test because it requires TPU resources.
2023-10-28:12:55:01,780 INFO     [test_sample_raycluster_yamls.py:68] [SKIP TEST 5] ray-cluster.complete.large.yaml: Skip this test because it requires a lot of resources.
2023-10-28:12:55:01,780 INFO     [test_sample_raycluster_yamls.py:70] [TEST 6]: ray-cluster.heterogeneous.yaml
2023-10-28:12:55:01,788 INFO     [test_sample_raycluster_yamls.py:70] [TEST 7]: ray-cluster.mini.yaml
2023-10-28:12:55:01,796 INFO     [test_sample_raycluster_yamls.py:70] [TEST 8]: ray-cluster.autoscaler.yaml
2023-10-28:12:55:01,804 INFO     [test_sample_raycluster_yamls.py:70] [TEST 9]: ray-cluster.external-redis-uri.yaml
2023-10-28:12:55:01,811 INFO     [test_sample_raycluster_yamls.py:70] [TEST 10]: ray-cluster.embed-grafana.yaml
2023-10-28:12:55:01,819 INFO     [test_sample_raycluster_yamls.py:70] [TEST 11]: ray-cluster.external-redis.yaml
2023-10-28:12:55:01,827 INFO     [test_sample_raycluster_yamls.py:70] [TEST 12]: ray-cluster.head-command.yaml
2023-10-28:12:55:01,835 INFO     [test_sample_raycluster_yamls.py:70] [TEST 13]: ray-cluster.volcano-scheduler-queue.yaml
2023-10-28:12:55:01,843 INFO     [test_sample_raycluster_yamls.py:70] [TEST 14]: ray-cluster.custom-head-service.yaml
2023-10-28:12:55:01,852 INFO     [test_sample_raycluster_yamls.py:70] [TEST 15]: ray-cluster.separate-ingress.yaml
2023-10-28:12:55:01,860 INFO     [test_sample_raycluster_yamls.py:70] [TEST 16]: ray-cluster.volcano-scheduler.yaml
2023-10-28:12:55:01,869 INFO     [test_sample_raycluster_yamls.py:70] [TEST 17]: ray-cluster.complete.yaml
2023-10-28:12:55:01,877 INFO     [utils.py:320] Execute command: kind delete cluster
Deleting cluster "kind" ...
Deleted nodes: ["kind-control-plane"]
2023-10-28:12:55:07,58 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
error: context "kind-kind" does not exist
2023-10-28:12:55:07,452 INFO     [utils.py:320] Execute command: kind create cluster --wait 900s --config /Users/ruian/Code/go/kuberay/tests/framework/config/kind-config.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Waiting ≤ 15m0s for control-plane = Ready ⏳
 • Ready after 16s 💚
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊
2023-10-28:12:55:53,836 INFO     [utils.py:259] Download Docker images: {'ray-image': 'rayproject/ray:2.7.0', 'kuberay-operator-image': 'kuberay/operator:nightly'}
2023-10-28:12:55:53,836 INFO     [utils.py:320] Execute command: docker image inspect rayproject/ray:2.7.0 > /dev/null
2023-10-28:12:55:53,925 INFO     [utils.py:272] Image rayproject/ray:2.7.0 exists
2023-10-28:12:55:53,926 INFO     [utils.py:320] Execute command: docker image inspect kuberay/operator:nightly > /dev/null
2023-10-28:12:55:54,14 INFO     [utils.py:272] Image kuberay/operator:nightly exists
2023-10-28:12:55:54,15 INFO     [utils.py:275] Load images into KinD cluster
2023-10-28:12:55:54,15 INFO     [utils.py:320] Execute command: kind load docker-image rayproject/ray:2.7.0
Image: "rayproject/ray:2.7.0" with ID "sha256:c9419e64c139a08bc5afe5beb982addc9dcceaa92124ec4b58c1776911ebadb8" not yet present on node "kind-control-plane", loading...
2023-10-28:12:58:14,49 INFO     [utils.py:320] Execute command: kind load docker-image kuberay/operator:nightly
Image: "kuberay/operator:nightly" with ID "sha256:0059dd9fa648ecd04f9761a2d0c1ceae66f209caee75a9478354ebc9bdc36c4b" not yet present on node "kind-control-plane", loading...
2023-10-28:12:58:17,600 INFO     [utils.py:300] Install both CRD and KubeRay operator by kuberay-operator chart
2023-10-28:12:58:17,600 INFO     [utils.py:320] Execute command: helm install -n default -f /var/folders/83/kqwzb_5j7t729zvyjhbfhzjr0000gn/T/tmp3rsmdi3f_values.yaml kuberay-operator /Users/ruian/Code/go/kuberay/helm-chart/kuberay-operator/ --set image.repository=kuberay/operator,image.tag=nightly
NAME: kuberay-operator
LAST DEPLOYED: Sat Oct 28 12:58:23 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
2023-10-28:12:58:25,35 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.py-spy.yaml
raycluster.ray.io/raycluster-py-spy created
configmap/ray-example created
2023-10-28:12:59:56,619 INFO     [prototype.py:379] --- RayClusterAddCREvent 91.10193181037903 seconds ---
2023-10-28:12:59:56,640 INFO     [utils.py:320] Execute command: kubectl exec raycluster-py-spy-head-jjmfh -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 21:59:57,414	INFO worker.py:1335 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 21:59:57,414	INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...
[2023-10-27 21:59:59,425 W 40 40] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-10-27 22:00:00,427 W 40 40] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-10-27 22:00:01,428 W 40 40] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2023-10-27 22:00:02,551	INFO worker.py:1636 -- Connected to Ray cluster.
{'memory': 2147483648.0, 'object_store_memory': 558419558.0, 'CPU': 1.0, 'node:10.244.0.6': 1.0}
2023-10-28:13:00:35,688 INFO     [prototype.py:409] --- Cleanup RayCluster 30.74703598022461 seconds ---
2023-10-28:13:00:35,688 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.py-spy.yaml --ignore-not-found=true
configmap "ray-example" deleted
.2023-10-28:13:00:35,849 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:00:35,993 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.tls.yaml
raycluster.ray.io/raycluster-tls created
secret/ca-tls created
configmap/tls created
2023-10-28:13:00:48,728 INFO     [prototype.py:379] --- RayClusterAddCREvent 12.374605655670166 seconds ---
2023-10-28:13:00:48,752 INFO     [utils.py:320] Execute command: kubectl exec raycluster-tls-head-dxqrf -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:00:49,439	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:00:49,439	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.7:6379...
2023-10-27 22:00:49,466	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.7:8265
{'node:10.244.0.7': 1.0, 'CPU': 1.0, 'object_store_memory': 491856691.0, 'memory': 2000000000.0, 'node:__internal_head__': 1.0}
2023-10-28:13:01:26,17 INFO     [prototype.py:409] --- Cleanup RayCluster 35.22021770477295 seconds ---
2023-10-28:13:01:26,18 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.tls.yaml --ignore-not-found=true
secret "ca-tls" deleted
configmap "tls" deleted
.2023-10-28:13:01:26,191 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:01:26,346 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.heterogeneous.yaml
configmap/ray-code created
raycluster.ray.io/raycluster-heterogeneous created
2023-10-28:13:01:39,252 INFO     [prototype.py:379] --- RayClusterAddCREvent 12.588768005371094 seconds ---
2023-10-28:13:01:39,278 INFO     [utils.py:320] Execute command: kubectl exec raycluster-heterogeneous-head-n6tbp -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:01:40,274	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:01:40,274	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.9:6379...
2023-10-27 22:01:40,307	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.9:8265
{'node:10.244.0.13': 1.0, 'CPU': 6.0, 'object_store_memory': 12330059365.0, 'memory': 27959726082.0, 'node:10.244.0.12': 1.0, 'node:10.244.0.9': 1.0, 'node:__internal_head__': 1.0, 'node:10.244.0.11': 1.0, 'node:10.244.0.10': 1.0}
2023-10-28:13:02:13,526 INFO     [prototype.py:409] --- Cleanup RayCluster 31.43476891517639 seconds ---
2023-10-28:13:02:13,526 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.heterogeneous.yaml --ignore-not-found=true
configmap "ray-code" deleted
.2023-10-28:13:02:13,682 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:02:13,824 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.mini.yaml
raycluster.ray.io/raycluster-mini created
2023-10-28:13:02:15,178 INFO     [prototype.py:379] --- RayClusterAddCREvent 1.032729148864746 seconds ---
2023-10-28:13:02:15,198 INFO     [utils.py:320] Execute command: kubectl exec raycluster-mini-head-msccf -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:02:16,337	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:02:16,338	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.14:6379...
[2023-10-27 22:02:19,409 W 13 13] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-10-27 22:02:20,410 W 13 13] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2023-10-27 22:02:21,417	INFO worker.py:1642 -- Connected to Ray cluster.
{'node:__internal_head__': 1.0, 'object_store_memory': 559873228.0, 'memory': 2147483648.0, 'node:10.244.0.14': 1.0, 'CPU': 1.0}
2023-10-28:13:02:55,583 INFO     [prototype.py:409] --- Cleanup RayCluster 31.684444904327393 seconds ---
2023-10-28:13:02:55,584 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.mini.yaml --ignore-not-found=true
.2023-10-28:13:02:55,724 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:02:55,860 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.autoscaler.yaml
raycluster.ray.io/raycluster-autoscaler created
configmap/ray-example created
2023-10-28:13:02:58,233 INFO     [prototype.py:379] --- RayClusterAddCREvent 2.045454978942871 seconds ---
2023-10-28:13:02:58,258 INFO     [utils.py:320] Execute command: kubectl exec raycluster-autoscaler-head-m942r -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Defaulting container name to ray-head.
Use 'kubectl describe pod/raycluster-autoscaler-head-m942r -n default' to see all of the containers in this pod.
2023-10-27 22:02:58,929	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:02:58,929	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.15:6379...
[2023-10-27 22:03:00,942 W 19 19] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2023-10-27 22:03:02,010	INFO worker.py:1642 -- Connected to Ray cluster.
{'node:10.244.0.15': 1.0, 'object_store_memory': 528251289.0, 'memory': 2000000000.0, 'node:__internal_head__': 1.0}
2023-10-28:13:03:38,850 INFO     [prototype.py:409] --- Cleanup RayCluster 35.88992714881897 seconds ---
2023-10-28:13:03:38,850 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.autoscaler.yaml --ignore-not-found=true
configmap "ray-example" deleted
.2023-10-28:13:03:39,1 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:03:39,137 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.external-redis-uri.yaml
configmap/redis-config created
service/redis created
deployment.apps/redis created
secret/redis-password-secret created
raycluster.ray.io/raycluster-external-redis-uri created
configmap/ray-example-uri created
2023-10-28:13:04:15,320 INFO     [prototype.py:379] --- RayClusterAddCREvent 35.82306885719299 seconds ---
2023-10-28:13:04:15,342 INFO     [utils.py:320] Execute command: kubectl exec raycluster-external-redis-uri-head-vx9vf -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:04:15,937	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:04:15,937	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.17:6379...
2023-10-27 22:04:15,951	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.17:8265
{'node:10.244.0.18': 1.0, 'CPU': 1.0, 'object_store_memory': 4904135884.0, 'memory': 10633172584.0, 'node:__internal_head__': 1.0, 'node:10.244.0.17': 1.0}
2023-10-28:13:04:53,819 INFO     [prototype.py:409] --- Cleanup RayCluster 37.06717300415039 seconds ---
2023-10-28:13:04:53,819 INFO     [utils.py:320] Execute command: test $(kubectl exec deploy/redis -- redis-cli --no-auth-warning -a $(kubectl get secret redis-password-secret -o jsonpath='{.data.password}' | base64 --decode) DBSIZE) = '0'
2023-10-28:13:04:54,186 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.external-redis-uri.yaml --ignore-not-found=true
configmap "redis-config" deleted
service "redis" deleted
deployment.apps "redis" deleted
secret "redis-password-secret" deleted
configmap "ray-example-uri" deleted
.2023-10-28:13:04:54,383 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:04:54,517 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.embed-grafana.yaml
raycluster.ray.io/raycluster-embed-grafana created
2023-10-28:13:05:06,94 INFO     [prototype.py:379] --- RayClusterAddCREvent 11.267714023590088 seconds ---
2023-10-28:13:05:06,114 INFO     [utils.py:320] Execute command: kubectl exec raycluster-embed-grafana-head-nv4lx -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:05:06,681	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:05:06,682	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.20:6379...
2023-10-27 22:05:06,696	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.20:8265
{'node:__internal_head__': 1.0, 'CPU': 2.0, 'node:10.244.0.20': 1.0, 'memory': 3000000000.0, 'object_store_memory': 799749733.0, 'node:10.244.0.21': 1.0}
2023-10-28:13:05:43,906 INFO     [prototype.py:409] --- Cleanup RayCluster 36.00530767440796 seconds ---
2023-10-28:13:05:43,907 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.embed-grafana.yaml --ignore-not-found=true
.2023-10-28:13:05:44,46 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:05:44,178 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.external-redis.yaml
configmap/redis-config created
service/redis created
deployment.apps/redis created
secret/redis-password-secret created
raycluster.ray.io/raycluster-external-redis created
configmap/ray-example created
2023-10-28:13:06:20,435 INFO     [prototype.py:379] --- RayClusterAddCREvent 35.870128870010376 seconds ---
2023-10-28:13:06:20,457 INFO     [utils.py:320] Execute command: kubectl exec raycluster-external-redis-head-cxtmq -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:06:21,116	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:06:21,116	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.23:6379...
2023-10-27 22:06:21,130	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.23:8265
{'node:__internal_head__': 1.0, 'memory': 10632816231.0, 'object_store_memory': 4903955251.0, 'node:10.244.0.23': 1.0, 'CPU': 1.0, 'node:10.244.0.24': 1.0}
2023-10-28:13:06:59,72 INFO     [prototype.py:409] --- Cleanup RayCluster 37.13770318031311 seconds ---
2023-10-28:13:06:59,72 INFO     [utils.py:320] Execute command: test $(kubectl exec deploy/redis -- redis-cli --no-auth-warning -a $(kubectl get secret redis-password-secret -o jsonpath='{.data.password}' | base64 --decode) DBSIZE) = '0'
2023-10-28:13:06:59,622 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.external-redis.yaml --ignore-not-found=true
configmap "redis-config" deleted
service "redis" deleted
deployment.apps "redis" deleted
secret "redis-password-secret" deleted
configmap "ray-example" deleted
.2023-10-28:13:06:59,832 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:06:59,973 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
raycluster.ray.io/raycluster-mini created
configmap/ray-example created
2023-10-28:13:07:14,578 INFO     [prototype.py:379] --- RayClusterAddCREvent 14.281078815460205 seconds ---
2023-10-28:13:07:14,622 INFO     [utils.py:320] Execute command: kubectl exec raycluster-mini-head-zcctn -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:07:15,213	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:07:15,213	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.26:6379...
2023-10-27 22:07:15,229	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.26:8265
{'object_store_memory': 557763379.0, 'memory': 2147483648.0, 'node:10.244.0.26': 1.0, 'node:__internal_head__': 1.0, 'CPU': 1.0}
2023-10-28:13:07:47,819 INFO     [prototype.py:409] --- Cleanup RayCluster 31.789110898971558 seconds ---
2023-10-28:13:07:47,820 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml --ignore-not-found=true
configmap "ray-example" deleted
.2023-10-28:13:07:47,972 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:07:48,114 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.volcano-scheduler-queue.yaml
raycluster.ray.io/test-cluster-0 created
2023-10-28:13:08:00,791 INFO     [prototype.py:379] --- RayClusterAddCREvent 12.372737884521484 seconds ---
2023-10-28:13:08:00,811 INFO     [utils.py:320] Execute command: kubectl exec test-cluster-0-head-x4jdq -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:08:01,463	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:08:01,463	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.27:6379...
2023-10-27 22:08:01,477	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.27:8265
{'CPU': 3.0, 'node:10.244.0.28': 1.0, 'memory': 4294967296.0, 'object_store_memory': 1159838514.0, 'node:10.244.0.29': 1.0, 'node:__internal_head__': 1.0, 'node:10.244.0.27': 1.0}
2023-10-28:13:08:33,836 INFO     [prototype.py:409] --- Cleanup RayCluster 31.040663957595825 seconds ---
2023-10-28:13:08:33,836 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.volcano-scheduler-queue.yaml --ignore-not-found=true
.2023-10-28:13:08:35,302 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:08:35,446 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.custom-head-service.yaml
raycluster.ray.io/raycluster-custom-head-service.yaml created
2023-10-28:13:08:37,795 INFO     [prototype.py:379] --- RayClusterAddCREvent 2.0454230308532715 seconds ---
2023-10-28:13:08:37,814 INFO     [utils.py:320] Execute command: kubectl exec raycluster-custom-head-service.yaml-head-86zvl -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:08:38,392	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:08:38,393	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.30:6379...
[2023-10-27 22:08:40,434 W 47 47] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2023-10-27 22:08:41,441	INFO worker.py:1642 -- Connected to Ray cluster.
{'node:10.244.0.30': 1.0, 'node:__internal_head__': 1.0, 'CPU': 1.0, 'memory': 2147483648.0, 'object_store_memory': 561701683.0}
2023-10-28:13:09:16,209 INFO     [prototype.py:409] --- Cleanup RayCluster 31.784589052200317 seconds ---
2023-10-28:13:09:16,209 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.custom-head-service.yaml --ignore-not-found=true
.2023-10-28:13:09:16,475 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:09:16,616 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.separate-ingress.yaml
raycluster.ray.io/raycluster-ingress created
ingress.networking.k8s.io/raycluster-ingress-head-ingress created
2023-10-28:13:09:18,998 INFO     [prototype.py:379] --- RayClusterAddCREvent 2.054396152496338 seconds ---
2023-10-28:13:09:19,17 INFO     [utils.py:320] Execute command: kubectl exec raycluster-ingress-head-jfv2x -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:09:19,627	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:09:19,628	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.31:6379...
[2023-10-27 22:09:21,641 W 47 47] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-10-27 22:09:22,642 W 47 47] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2023-10-27 22:09:23,650	INFO worker.py:1642 -- Connected to Ray cluster.
{'CPU': 1.0, 'node:10.244.0.31': 1.0, 'node:__internal_head__': 1.0, 'memory': 4839864731.0, 'object_store_memory': 2419932364.0}
2023-10-28:13:09:56,771 INFO     [prototype.py:409] --- Cleanup RayCluster 30.745362758636475 seconds ---
2023-10-28:13:09:56,771 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.separate-ingress.yaml --ignore-not-found=true
ingress.networking.k8s.io "raycluster-ingress-head-ingress" deleted
.2023-10-28:13:09:56,935 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:09:57,72 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.volcano-scheduler.yaml
raycluster.ray.io/test-cluster-0 created
2023-10-28:13:09:58,423 INFO     [prototype.py:379] --- RayClusterAddCREvent 1.029797077178955 seconds ---
2023-10-28:13:09:58,442 INFO     [utils.py:320] Execute command: kubectl exec test-cluster-0-head-vqtwx -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:09:59,657	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:09:59,657	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.32:6379...
[2023-10-27 22:10:02,734 W 13 13] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-10-27 22:10:03,735 W 13 13] global_state_accessor.cc:401: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2023-10-27 22:10:04,744	INFO worker.py:1642 -- Connected to Ray cluster.
{'node:10.244.0.32': 1.0, 'node:__internal_head__': 1.0, 'CPU': 1.0, 'object_store_memory': 561404313.0, 'memory': 2147483648.0}
2023-10-28:13:10:37,754 INFO     [prototype.py:409] --- Cleanup RayCluster 30.706429958343506 seconds ---
2023-10-28:13:10:37,754 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.volcano-scheduler.yaml --ignore-not-found=true
.2023-10-28:13:10:37,987 INFO     [utils.py:320] Execute command: kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:63998
CoreDNS is running at https://127.0.0.1:63998/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2023-10-28:13:10:38,129 INFO     [utils.py:320] Execute command: kubectl apply -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.complete.yaml
raycluster.ray.io/raycluster-complete created
2023-10-28:13:10:49,784 INFO     [prototype.py:379] --- RayClusterAddCREvent 11.286155939102173 seconds ---
2023-10-28:13:10:49,804 INFO     [utils.py:320] Execute command: kubectl exec raycluster-complete-head-552hq -n default -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
2023-10-27 22:10:50,426	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2023-10-27 22:10:50,426	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.33:6379...
2023-10-27 22:10:50,442	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.33:8265
{'CPU': 2.0, 'object_store_memory': 799140249.0, 'memory': 3000000000.0, 'node:__internal_head__': 1.0, 'node:10.244.0.33': 1.0, 'node:10.244.0.34': 1.0}
2023-10-28:13:11:27,742 INFO     [prototype.py:409] --- Cleanup RayCluster 36.10106801986694 seconds ---
2023-10-28:13:11:27,743 INFO     [utils.py:320] Execute command: kubectl delete -n default -f /Users/ruian/Code/go/kuberay/ray-operator/config/samples/ray-cluster.complete.yaml --ignore-not-found=true
.
----------------------------------------------------------------------
Ran 14 tests in 986.095s

OK

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. #1556 (comment) shows that ray-cluster.external-redis-uri.yaml passes sample YAML tests.

@kevin85421 kevin85421 merged commit 0e959cf into ray-project:master Oct 30, 2023
23 checks passed
kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Nov 2, 2023
kevin85421 added a commit that referenced this pull request Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants