Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Ray version change breaks SkyPilot cluster #2722

Closed
romilbhardwaj opened this issue Oct 19, 2023 · 3 comments · Fixed by #3575
Closed

[core] Ray version change breaks SkyPilot cluster #2722

romilbhardwaj opened this issue Oct 19, 2023 · 3 comments · Fixed by #3575
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Oct 19, 2023

If the user's setup installs a newer version of ray than the one running on SkyPilot remote cluster, SkyPilot will get stuck at streaming logs. Many packages (e.g., vllm==0.2.0) tend to install a newer version of ray.

SkyPilot should:

  1. Protect the ray installation/use a custom env for the core cluster operation
  2. Raise a better error is the remote ray cluster malfunctions.

Minimal repro:

setup: |
  pip install "ray==2.7.1"
run: |
  echo hi
$ sky launch task.yaml

...

I 10-18 21:37:59 cloud_vm_ray_backend.py:3358] Job submitted with Job ID: 1
I 10-19 04:38:01 log_lib.py:431] Start streaming logs for job 1.
<Stuck>

Workaround

Install all dependencies in a new conda env.

Debug

ssh into cluster and run ray status:

(base) gcpuser@ray-test-2ea4-head-60220eef-compute:~$ ray status
Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2929, in ray._raylet.check_health
  File "python/ray/_raylet.pyx", line 457, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: Ray cluster at 10.128.0.7:6380 has version 2.4.0, but this processis running Ray version 2.7.1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1974, in status
    if not ray._raylet.check_health(address):
  File "python/ray/_raylet.pyx", line 2935, in ray._raylet.check_health
RuntimeError: System error: Ray cluster at 10.128.0.7:6380 has version 2.4.0, but this processis running Ray version 2.7.1.
@Michaelvll
Copy link
Collaborator

Michaelvll commented Nov 11, 2023

We should install skypilot's remote dependency in the environment other than base to avoid this kind of issue. Another user encountered similar issue due to installing some package in the base environment.

@Michaelvll Michaelvll added the good first issue Good for newcomers label Nov 11, 2023
@concretevitamin
Copy link
Member

We should install skypilot's remote dependency in the environment other than base to avoid this kind of issue. Another user encountered similar issue due to installing some package in the base environment.

To add a minimal repro

git checkout 51a831c6b8c5
sky launch --cloud gcp -c dbg
ssh dbg

Inside dbg

pip install 'pydantic>2'

This breaks

# Ray job has an issue with pydantic>2.0.0, due to API changes of pydantic. See
# https://github.com/ray-project/ray/issues/36990
# >=1.10.8 is needed for ray>=2.6. See
# https://github.com/ray-project/ray/issues/35661
'pydantic <2.0, >=1.10.8',

and results in job submission failures.

@concretevitamin concretevitamin added the help wanted Extra attention is needed label Nov 12, 2023

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
3 participants