[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

youkaichao · 2024-04-28T21:39:24Z

Proposal to improve performance

When we use tensor parallel in vLLM, the driver worker need to broadcast some metadata to all workers, such as the input, the lora requests, etc. This functionality is currently implemented in:

vllm/vllm/distributed/communication_op.py

Line 143 in 9c7306a

def broadcast_tensor_dict(

In essence, it uses torch.distributed.broadcast_object_list to broadcast a Python object. This function has many overhead. The overall procedure is:

There are three layers of overhead:

device memory move: pickle works only for cpu memory. so we need to move data from cpu to device back and forth.
pickled data of multiple objects are concated, leading to one memory copy
two broadcast operation is needed, one for broadcasting the size of each pickled object, and the other for broadcasting data.

Current vLLM implementation packs the data in a list of size one, thus overhead 2 is eliminated:

vllm/vllm/distributed/communication_op.py

Lines 173 to 175 in 9c7306a

    
           torch.distributed.broadcast_object_list([metadata_list], 
        
                                                   src=src, 
        
                                                   group=group)

To remove overhead 1, we can use CPU operation to broadcast this kind of metadata.

In addition, if we can know the rough size of picked object, we can remove overhead 3 as well. Only one broadcast is required, which is the optimal case for broadcasting a Python object.

I have wrote some benchmark code in https://gist.github.com/youkaichao/b33fcd70286eb45a4a2d5a6dc32d096b and the result is in https://docs.google.com/spreadsheets/d/1c9xgR0fGvm6SROfk7vrjwOZdYnKQk9oOafWK4_KgOyo/edit?usp=sharing .

The short conclusion is:

using cpu (gloo) to broadcast the data indeed works better than nccl (gpu). For small size metadata, the broadcast time reduces from 400us to 300us.
if we can estimate the rough size, the broadcast time can be reduced to 100us. That requires us to design the object to be broadcast.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-04-28T22:09:38Z

Note: the memory alignment feature depends on the fact that pickle format is self-ended:

s = [1] * 5
import pickle
d = pickle.dumps(s)
d = d + b"whatever"
import pickletools
pickletools.dis(d)

Output:

    0: \x80 PROTO      4
    2: \x95 FRAME      15
   11: ]    EMPTY_LIST
   12: \x94 MEMOIZE    (as 0)
   13: (    MARK
   14: K        BININT1    1
   16: K        BININT1    1
   18: K        BININT1    1
   20: K        BININT1    1
   22: K        BININT1    1
   24: e        APPENDS    (MARK at 13)
   25: .    STOP
highest protocol among opcodes = 4

There is a STOP code in the end. Therefore it is safe to pad/align the pickled data.

cadedaniel · 2024-04-29T13:50:06Z

The optimization makes sense to me (nice writeup!)

AllenDou · 2024-04-30T03:16:36Z

Note: the memory alignment feature depends on the fact that pickle format is self-ended:

s = [1] * 5
import pickle
d = pickle.dumps(s)
d = d + b"whatever"
import pickletools
pickletools.dis(d)

Output:

    0: \x80 PROTO      4
    2: \x95 FRAME      15
   11: ]    EMPTY_LIST
   12: \x94 MEMOIZE    (as 0)
   13: (    MARK
   14: K        BININT1    1
   16: K        BININT1    1
   18: K        BININT1    1
   20: K        BININT1    1
   22: K        BININT1    1
   24: e        APPENDS    (MARK at 13)
   25: .    STOP
highest protocol among opcodes = 4

There is a STOP code in the end. Therefore it is safe to pad/align the pickled data.

s = ['s'] * 5
import pickle
d = pickle.dumps(s)
d = d + b"whatever"
import pickletools
pickletools.dis(d)

    0: \x80 PROTO      4
    2: \x95 FRAME      17
   11: ]    EMPTY_LIST
   12: \x94 MEMOIZE    (as 0)
   13: (    MARK
   14: \x8c     SHORT_BINUNICODE 's'
   17: \x94     MEMOIZE    (as 1)
   18: h        BINGET     1
   20: h        BINGET     1
   22: h        BINGET     1
   24: h        BINGET     1
   26: e        APPENDS    (MARK at 13)
   27: .    STOP
highest protocol among opcodes = 4

The result of pickle.dump does not always seem to be aligned to 4 bytes.

youkaichao · 2024-04-30T03:30:32Z

The result of pickle.dump does not always seem to be aligned to 4 bytes.

It does not matter though. The point is it is self-ended, so we can pad with arbitary bytes. Padding does not affect unpickle.

sfc-gh-zhwang · 2024-05-11T07:36:10Z

Very cool!
Do we know how much latency does broadcast_tensor_dict contribute to the whole inference?

youkaichao · 2024-05-12T00:18:48Z

There are two broadcast_tensor_dict in vllm, one to broadcast blocks for copy/swap, the other for broadcasting input data (tokens, block tables, etc). The former takes about 0.4 ms, the latter takes more time. But I don't have a detailed measurement yet.

youkaichao · 2024-06-21T21:35:56Z

the performance of broadcasting python object is largely resolved by #5399 , in single node case.

youkaichao added the performance Performance-related issues label Apr 28, 2024

youkaichao mentioned this issue Apr 29, 2024

[Core][Distributed] use cpu group to broadcast metadata in cpu #4444

Merged

youkaichao mentioned this issue May 11, 2024

[Core][Distributed] add fast broadcast for tensor dict #4757

Closed

1 task

youkaichao mentioned this issue Jun 15, 2024

[Core][Distributed] add shm broadcast #5399

Merged

youkaichao closed this as completed Jun 21, 2024

youkaichao mentioned this issue Jun 24, 2024

[RFC]: A Flexible Architecture for Distributed Inference #5775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

youkaichao commented Apr 28, 2024

youkaichao commented Apr 28, 2024

cadedaniel commented Apr 29, 2024

AllenDou commented Apr 30, 2024

youkaichao commented Apr 30, 2024

sfc-gh-zhwang commented May 11, 2024

youkaichao commented May 12, 2024

youkaichao commented Jun 21, 2024

[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

Comments

youkaichao commented Apr 28, 2024

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

youkaichao commented Apr 28, 2024

cadedaniel commented Apr 29, 2024

AllenDou commented Apr 30, 2024

youkaichao commented Apr 30, 2024

sfc-gh-zhwang commented May 11, 2024

youkaichao commented May 12, 2024

youkaichao commented Jun 21, 2024