Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed inference on multi machine (error Invalid peer device id) #2795

Closed
bieenr opened this issue Feb 7, 2024 · 9 comments · Fixed by #2760
Closed

Distributed inference on multi machine (error Invalid peer device id) #2795

bieenr opened this issue Feb 7, 2024 · 9 comments · Fixed by #2760
Labels
bug Something isn't working

Comments

@bieenr
Copy link

bieenr commented Feb 7, 2024

I'm a newbie, and I'm running an example at https://docs.vllm.ai/en/latest/serving/distributed_serving.html locally with 2 machines, each with an RTX 3090 GPU. I changed tensor_parallel_size to 2 and model to "vinai/PhoGPT-4B".
On the head node, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --head.
On the other nodes, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --address='10.0.0.1'.
Then, on the head node, when I run the example code: python main.py, I get the following error:

Traceback (most recent call last):
  File "/data2/bientd/vllm/test.py", line 25, in <module>
    llm = LLM(model="facebook/opt-13b", tensor_parallel_size=2,download_dir='/data2/bientd/')#,pipeline_parallel_size=3 don't support
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args
    engine = cls(*engine_configs,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/worker/worker.py", line 87, in init_model
    init_custom_ar()
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 44, in init_custom_ar
    if not _can_p2p(rank, world_size):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 137, in _can_p2p
    if not torch.cuda.can_device_access_peer(rank, i):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 464, in can_device_access_peer
    raise AssertionError("Invalid peer device id")
AssertionError: Invalid peer device id
@umarbutler
Copy link

umarbutler commented Feb 8, 2024

I am experiencing the same issue trying to load a model across two EC2 instances. I get the same traceback as well. I'm running the latest versions of both vLLM and Ray with Python 3.11.

@Kaotic3
Copy link

Kaotic3 commented Feb 8, 2024

I too have encountered this issue - I would add that the Ray Cluster is connected, and a "View the dashboard at ......" appears

But I too then receive the 'AssertionError: Invalid peer device id'

I did notice the line above was "if not torch.cuda.can_device_access_peer(rank, i):"

So I spooled up a python instance and tried this out:

torch.cuda.can_device_access_peer(0, 0) - False
torch.cuda.can_device_access_peer(0, 1) - True
torch.cuda.can_device_access_peer(0, 2) - Invalid peer device id
torch.cuda.can_device_access_peer(1, 0) - True
torch.cuda.can_device_access_peer(1, 1) - False
torch.cuda.can_device_access_peer(1, 2) - Invalid peer device id

Hopefully that will help someone narrow down on what the issue is?

@umarbutler
Copy link

umarbutler commented Feb 9, 2024

It looks like can_device_access_peer.py was created two weeks ago (#2192). I have just downgraded to version 0.2.7 of vLLM and I can confirm that Ray now works.

@Kaotic3
Copy link

Kaotic3 commented Feb 9, 2024

Thanks @umarbutler I also downgraded and got past the device ID issue.

Now I am connecting to the cluster but it is hanging at Initializing an LLM engine - hangs for the longest time and eventually I get the message (and I am talking 20 minutes later):

(RayWorkerVllm pid=5003, ip=x.x.x.x) [E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.1.1, 35005).

I then have to kill all the processes as it simply won't stop...

This is attempting to run two machines in the same network using ray start --head and ray start --address=x.x.x.x:6379 - which I thought would be the simplest way to run this.

ray status

Shows the nodes, all connected, I can ping between the machines, the ports are open - I don't know why it is failing at this point.

@umarbutler
Copy link

@Kaotic3 I didn't encounter that issue after downgrading however I did encounter an issue wherein my Ray nodes were not connecting to one another, I think because I also downgraded my Ray thinking the latest version would cause compatibility issues with version 0.2.7 of vLLM. As it turns out, upgrading to the latest version of Ray solved that. So perhaps you could try that?

In terms of the commands you're running, those are the same commands I'm running so not sure what the issue could be.

@bieenr
Copy link
Author

bieenr commented Feb 10, 2024

Thanks @umarbutler I also downgraded and run successfully. With issue of @Kaotic3, i run "NCCL_SOCKET_IFNAME=eth0 python test.py" instead of "python test.py". I hope it will help you.

@bieenr bieenr closed this as completed Feb 10, 2024
@bieenr bieenr reopened this Feb 10, 2024
@umarbutler
Copy link

@bieenr This issue should remain open just in case closing it was not a mistake, we still need it fixed in subsequent versions. Clearly, the fact that it works on the previous version indicates that a bug was introduced on the side of vLLM.

@WoosukKwon
Copy link
Collaborator

This problem is due to the custom all reduce kernel, which is now disabled by default. In v0.3.1, this problem should not persist.

@WoosukKwon WoosukKwon added the bug Something isn't working label Feb 14, 2024
@hanzhi713
Copy link
Contributor

@WoosukKwon Not a bug with kernel itself. It's just that P2P check cannot run properly when CUDA_VISIBLE_DEVICES is set. Will be fixed by #2760.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants