Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221

zhuohan123 · 2023-12-20T14:18:15Z

This PR modifies vLLM to use NCCL instead of ray for control-plane communication. The architectural change of vLLM with this PR can be summarized in the following figure:

Before this change, vLLM has one driver process (only on GPU), and N worker processes, each of which is a ray actor that manages one GPU. After this change, we will move one worker into the driver process (driver worker), and keep N-1 ray workers. All the control messages will be broadcast from the driver worker to all the remaining workers with NCCL. This avoids the high sterilization cost of ray communication.

For the throughput benchmark of LLaMA-70B on 8 A100-40G GPUs:

On ShareGPT Dataset
  Before this PR: 3.01 reqs/s
  With this PR: 5.08 reqs/s
With Batch size 512, input len 1031, output len 317
  Before this PR: 2.48 reqs/s
  With this PR: 3.48 reqs/s

This PR has passed most of the tests and is ready for review.

Should be merged after #2270, #2273.

hanzhi713 · 2023-12-21T06:14:35Z

vllm/model_executor/parallel_utils/communication_op.py

+
+def broadcast(input_, src=0):
+    """Broadcast the input tensor."""
+    world_size = torch.distributed.get_world_size


This is a function. You need to call it

😅 great catch. Will fix

Lucky catch. Browsed through the code in 10s.

Hello, When can this branch be merged into the main branch, and will it bring significant performance improvements?

Hi @Lvjinhong Please find the performance numbers in the description of this PR. This PR is waiting for review and will be merged soon.

esmeetu · 2023-12-27T09:01:04Z

Hi @zhuohan123, this PR doesn't work for me.
python -m vllm.entrypoints.openai.api_server --model /Llama-2-7b-chat-hf --tensor-parallel-size 4 --dtype half --enforce-eager


  File "/home/roy/vllm/vllm/engine/llm_engine.py", line 599, in _process_model_outputs
    for seq_group, outputs in zip(scheduled_seq_groups, output):
TypeError: 'NoneType' object is not iterable

zhuohan123 · 2023-12-27T09:34:52Z

Hi @zhuohan123, this PR doesn't work for me. python -m vllm.entrypoints.openai.api_server --model /Llama-2-7b-chat-hf --tensor-parallel-size 4 --dtype half --enforce-eager
  File "/home/roy/vllm/vllm/engine/llm_engine.py", line 599, in _process_model_outputs
    for seq_group, outputs in zip(scheduled_seq_groups, output):
TypeError: 'NoneType' object is not iterable

@esmeetu Sorry this is a bug! Can you test the latest commit again?

Lccwr · 2024-01-03T03:18:15Z

Hello, encountered some problems while testing on your branch.
python benchmark_latency.py --model llama-70B-hf --tensor-parallel-size 4 --input-len 1536 --output-len 512 --batch-size 120 --num-iters 2 --enforce-eager

WoosukKwon

@zhuohan123 Awesome! Thanks for addressing my comments. Let's merge this asap after addressing the remaining minor comments from @njhill!

vllm/model_executor/layers/sampler.py

vllm/worker/model_runner.py

vllm/engine/ray_utils.py

Yard1 · 2024-01-03T17:10:31Z

@zhuohan123 Can you give me a few hours to look over this? Thanks!

zhuohan123 · 2024-01-03T17:14:57Z

vllm/engine/ray_utils.py

+        placement_group_specs = ([{
+            "GPU": 1,
+            "node:__internal_head__": 0.01
+        }] + [{
            "GPU": 1
-        }] * parallel_config.world_size)
+        }] * (parallel_config.world_size - 1))
+        current_placement_group = ray.util.placement_group(
+            placement_group_specs)


@Yard1 In this PR, we assume the placement group will have one bundle that includes "node:internal_head". This is to reserve a GPU for the driver worker. This may conflict with some of the existing logic using placement groups.

zhuohan123 · 2024-01-03T17:17:14Z

vllm/engine/llm_engine.py

+            if (bundle.get("node:__internal_head__", 0) > 0
+                    and self.driver_dummy_worker is None):
+                self.driver_dummy_worker = ray.remote(
+                    num_cpus=0,
+                    num_gpus=num_gpus,
+                    scheduling_strategy=scheduling_strategy,
+                    **ray_remote_kwargs,
+                )(RayWorkerVllm).remote()
+                continue


@Yard1 This is where we use the bundle with the node:__internal_head__ resource to hold the resource for the driver worker.

zhuohan123 · 2024-01-03T17:18:31Z

@zhuohan123 Can you give me a few hours to look over this? Thanks!

Sure! Just highlighted several places where we changed the logic of how we use placement groups, which I think should be important for you to take a look.

vllm/engine/llm_engine.py

vllm/engine/ray_utils.py

Yard1 · 2024-01-03T19:01:48Z

vllm/engine/llm_engine.py

            worker = ray.remote(
                num_cpus=0,
                num_gpus=num_gpus,
                scheduling_strategy=scheduling_strategy,
                **ray_remote_kwargs,
            )(RayWorkerVllm).remote(self.model_config.trust_remote_code)
-            self.workers.append(worker)
+
+            worker_ip = ray.get(worker.get_node_ip.remote())


As a minor optimization you can make it another loop so that the Workers can be initialized in a non-blocking fashion but considering that there's nothing really happening in the __init__ I think it's ok to leave it in (though it is an anti-pattern).

yeah also this only happens once so I think this should not relate to the performance.

Yard1

thanks, looks good!

…ialization overhead (vllm-project#2221)

zhuohan123 added 7 commits December 18, 2023 17:01

small test

7265829

test ray_pg

20274cc

update ray test

1b73dd7

implement driver worker

0d89354

broadcast swap info

e0c4c4e

Broadcast inputmetadata as well

1baf87b

fix bugs

c947fa0

zhuohan123 changed the title ~~Do not use ray for collective communication to remove serialization overhead~~ [WIP] Do not use ray for collective communication to remove serialization overhead Dec 20, 2023

zhuohan123 changed the title ~~[WIP] Do not use ray for collective communication to remove serialization overhead~~ [WIP] Do not use ray for control-plane communication to remove serialization overhead Dec 21, 2023

zhuohan123 mentioned this pull request Dec 21, 2023

Disable Ray usage stats collection #2206

Merged

hanzhi713 reviewed Dec 21, 2023

View reviewed changes

zhuohan123 added 13 commits December 25, 2023 08:31

fix comments

761584b

remove unused files

19110fb

fix async llm engine

7b05ec6

fix format

5f90351

Merge branch 'main' into remove-serialization-overhead

6f7ea32

[BUGFIX] Fix API server test

966e366

fix and remove print

fe2c29a

fix test_cache

5557cdb

Merge branch 'fix-test-api-server' into remove-serialization-overhead

d92b38d

fix api test

c7f6c21

[BUGFIX] Fix the path of test prompts

332d370

Merge branch 'fix-test-prompt-path' into remove-serialization-overhead

9a8c16f

fix test_model_runner

6ea2a42

zhuohan123 requested a review from WoosukKwon December 26, 2023 16:29

zhuohan123 changed the title ~~[WIP] Do not use ray for control-plane communication to remove serialization overhead~~ Use NCCL instead of ray for control-plane communication to remove serialization overhead Dec 26, 2023

Merge branch 'main' into remove-serialization-overhead

0434a76

Fix async llm engine

95bb1d3

WoosukKwon approved these changes Jan 3, 2024

View reviewed changes

zhuohan123 added 2 commits January 3, 2024 14:55

allgather -> gather

3d3a547

fix

680c8d9

zhuohan123 commented Jan 3, 2024

View reviewed changes

Yard1 requested changes Jan 3, 2024

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

vllm/engine/ray_utils.py Outdated Show resolved Hide resolved

zhuohan123 added 4 commits January 3, 2024 17:36

fix and revert unnecessary changes

5280a61

fix

03b2734

fix

0ca5e07

fix review comments

ddb0795

WoosukKwon mentioned this pull request Jan 3, 2024

[v0.2.7] Release Tracker #2332

Closed

2 tasks

Yard1 reviewed Jan 3, 2024

View reviewed changes

Yard1 approved these changes Jan 3, 2024

View reviewed changes

zhuohan123 merged commit fd4ea8e into main Jan 3, 2024
2 checks passed

WoosukKwon mentioned this pull request Jan 3, 2024

[Minor] Revert the changes in test_cache #2335

Merged

jedibrillo pushed a commit to jedibrillo/vllm that referenced this pull request Jan 5, 2024

Use NCCL instead of ray for control-plane communication to remove ser…

ccd3ca3

…ialization overhead (vllm-project#2221)

quanliu1991 added a commit to quanliu1991/vllm that referenced this pull request Jan 6, 2024

Updated, compatible with vllm-project#2221

c0fdbd6

yudian0504 mentioned this pull request Jan 12, 2024

v2.0.7 becomed slower than v2.0.6 #2367

Closed

zhuohan123 mentioned this pull request Jan 12, 2024

[Do not merge] Trying out rpyc as a replacement for ray #1318

Closed

njhill mentioned this pull request Jan 13, 2024

[Ray Integration] Integrate vllm with experimental accelerated DAG API #2201

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Use NCCL instead of ray for control-plane communication to remove ser…

e00f45c

…ialization overhead (vllm-project#2221)

wuxibin89 mentioned this pull request Feb 19, 2024

vllm +zero2 hangs OpenLLMAI/OpenRLHF#211

Open

zhuohan123 deleted the remove-serialization-overhead branch February 22, 2024 18:47

wuxibin89 mentioned this pull request Mar 5, 2024

fix cuda device not found error when LLM is initialized in ray actor #3198

Closed

This was referenced Mar 6, 2024

[Minor] Cleanup unused tensor_model_parallel_all_gather #3231

Closed

Migrate logits computation and gather to model_runner #3233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221

Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221

zhuohan123 commented Dec 20, 2023 •

edited

hanzhi713 Dec 21, 2023

zhuohan123 Dec 21, 2023

hanzhi713 Dec 21, 2023

Lvjinhong Dec 26, 2023

zhuohan123 Dec 27, 2023

esmeetu commented Dec 27, 2023

zhuohan123 commented Dec 27, 2023

Lccwr commented Jan 3, 2024

WoosukKwon left a comment

Yard1 commented Jan 3, 2024

zhuohan123 Jan 3, 2024

zhuohan123 Jan 3, 2024

zhuohan123 commented Jan 3, 2024

Yard1 Jan 3, 2024

zhuohan123 Jan 3, 2024

Yard1 left a comment

Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221

Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221

Conversation

zhuohan123 commented Dec 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

esmeetu commented Dec 27, 2023

zhuohan123 commented Dec 27, 2023

Lccwr commented Jan 3, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

Yard1 commented Jan 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuohan123 commented Jan 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 left a comment

Choose a reason for hiding this comment

zhuohan123 commented Dec 20, 2023 •

edited