Tensor Parallelism vs Data Parallelism #367

xcxhy · 2023-07-05T07:01:22Z

Hi, thanks! I use vllm to inference the llama-7B model on single gpu, and tensor-parallel on 2-gpus and 4-gpus, we found that it is 10 times faster than HF on a single GPU, but using tensor parallelism, there is no significant increase in token throughput.We understand that through data parallelism, the memory can be expanded and the batch of processing samples can be increased.But the communication between graphics cards may reduce the speed. If 2-gpus is used, there should be an acceleration of 1.5✖️. But now the throughput has basically remained unchanged. Is it because our GPU KV cache usage is full, or there are other reasons. Looking forward to your reply!

xcxhy · 2023-07-07T02:47:00Z

I also want to know whether the tensor parallelism you use is 1D, 2D or 3D?

PythonNut · 2023-07-10T23:35:16Z

As an example of this, I ran the following script with different values of WORLD_SIZE:

import torch, time, tqdm                                                                                                                                                                                                                                                                                                                                                   
from vllm import LLM, SamplingParams                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                           
WORLD_SIZE = 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
BATCH_SIZE = 2048                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
llm = LLM(                                                                                                                                                                                                                                                                                                                                                                 
    model="lmsys/vicuna-7b-v1.3",                                                                                                                                                                                                                                                                                                                                          
    tokenizer="hf-internal-testing/llama-tokenizer",                                                                                                                                                                                                                                                                                                                       
    tensor_parallel_size=WORLD_SIZE,                                                                                                                                                                                                                                                                                                                                       
    gpu_memory_utilization=0.85                                                                                                                                                                                                                                                                                                                                            
)                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                           
start = time.perf_counter()                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                           
batch = torch.randint(32000, (BATCH_SIZE, 120))                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                           
out = llm.generate(                                                                                                                                                                                                                                                                                                                                                        
          prompt_token_ids=[tokens.tolist() for tokens in batch],                                                                                                                                                                                                        
          use_tqdm=False,                                                                                                                                                                                                      
          sampling_params=SamplingParams(                                                                                                                                                                                                         
              max_tokens=20,                                                                                                                                                                                                               
              ignore_eos=True,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
          )                                                                                                                                                                                                                                                                                                                                                                
  )                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                           
print(time.perf_counter() - start)

My results were:

1x A100 40G: 33.2 seconds
2x A100 40G: 28.3 seconds
4x A100 40G: 25.6 seconds
8x A100 40G: 27.8 seconds
16x A100 40G: 36.3 seconds

So there is very little gain from parallelism (and too many GPUs actually make things run slower!).

irasin · 2023-07-11T06:15:07Z

I also want to know whether the tensor parallelism you use is 1D, 2D or 3D?

I think vLLM only use tensor parallel 1D.

xcxhy · 2023-07-11T06:38:36Z

@PythonNut I think your max-tokens set too small, encountered the bottleneck of A100 computing power. vllm is dynamic batch to inference, If you encounter a memory bottleneck, the speed can be improved by using two cards, because the computing power is not fully used. If the computing power encounters a bottleneck, it will slow down due to the communication between multiple cards. This is what I guessed from my own testing analysis.

xcxhy · 2023-07-11T06:45:40Z

@irasin I think so, the communication between each layer cause the slower speed.

PythonNut · 2023-07-11T18:21:06Z

@xcxhy

Right I guess I'm showing a case where tensor parallelism is not the best form of parallelism. In particular, I am measuring throughput not latency; I specifically chose a batch size that is too big to fit in RAM so there are many iterations of inference happening inside generate. Since I am very happy with the throughput on a single GPU, it seems like data parallelism would scale better in this situation than tensor parallelism.

irasin · 2023-07-12T02:08:44Z

@xcxhy

Right I guess I'm showing a case where tensor parallelism is not the best form of parallelism. In particular, I am measuring throughput not latency; I specifically chose a batch size that is too big to fit in RAM so there are many iterations of inference happening inside generate. Since I am very happy with the throughput on a single GPU, it seems like data parallelism would scale better in this situation than tensor parallelism.

Hi, @PythonNut, I wonder have you tested pipeline parallel vs tensor parallel in some other frameworks?

xcxhy · 2023-07-12T02:18:48Z

@PythonNut I agree with your opinion. My test data shows that the length of each text is very long, so that the video memory of a single GPU cannot be put into enough batches. At this time, using tensor parallel will speed up to a certain extent. I think you can test it, set the batch to 10, assuming that one gpu can only accommodate 10, then you use 2 gpus, set the batch to 20, this should be able to verify whether tensor parallel can provide acceleration.

zhuohan123 · 2023-07-18T05:59:35Z

@xcxhy This is actually due to python overhead, which bottlenecks the tensor parallel performance for smaller models. We are thinking about implementing models in C++ to overcome this issue. You can also track the progress in #42

Peter-Devine · 2023-11-01T03:15:42Z

Just dropping this as a simple example of how to do data parallel inference in Python that I've found to be effective.

Obviously I'd appreciate the eventual implementation of proper data parallel processing in the actual package, but this works decently as a stop-gap just now.

import os
import multiprocessing
from vllm import LLM, SamplingParams

NUM_GPUS = 4

def run_inference_one_gpu(gpu_id, prompt_list, model_name, sampling_params):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
    llm = LLM(model=model_name)
    return llm.generate(prompt_list, sampling_params)

# Splits a list into roughly equally sized pieces
# split_list(["a", "b", "c", "d", "e", "f", "g"], 3) -> [['a', 'b'], ['c', 'd'], ['e', 'f', 'g']]
split_list = lambda l, n: [l[i * len(l) // n: (i + 1) * len(l) // n] for i in range(n)]

def run_inference_multi_gpu(model_name, prompts, sampling_params):
    split_prompts = split_list(prompts, NUM_GPUS)
    inputs = [(i, p, model_name, sampling_params) for i, p in enumerate(split_prompts)]

    with multiprocessing.Pool(processes=NUM_GPUS) as pool:
        results = pool.starmap(run_inference_one_gpu, inputs)

    outputs = []
    for result in results:
        outputs.extend(result)

    return outputs

prompts = [f"Write me a story about why {i} is your favourite number.\n\n{i} is my favourite number because " for i in range(100)]
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
model_name = "mistralai/Mistral-7B-v0.1"
outputs = run_inference_multi_gpu(model_name, prompts, sampling_params)
print(outputs)

This requires the whole model to be able to fit on to one GPU (as per data parallel's usual implementation) and will doubtless have a higher RAM overhead (I haven't checked, but it shouldn't be massive depending on your text size), but it does run seem to run at roughly N times the speed of running on one GPU (where N=number of GPUs) compared to <N times for the tensor parallel implementation.

I hope this is useful to someone!

appliedml42 · 2023-11-23T12:36:54Z

@xcxhy

I think I can confirm your intuition with data. I have 4 A6000 cards with much lower IO bandwidth than the A100 cards so this example is big enough to IO bottleneck and not compute bottleneck. Same example with vLLM 0.2.2 on Nvidia PyTorch container 23.10

1 GPU: 47.28260640997905 secs
2 GPU: 38.49804261501413 secs
4 GPU: 35.09966081101447 secs

Best,
Abhishek

SunLemuria · 2023-12-18T09:11:15Z

I think fastchat supports this feature: fastchat scalability

yhyu13 · 2023-12-20T02:47:20Z

I think fastchat supports this feature: fastchat scalability

How to do it natively to vllm project rather than fastchat?

hmellor · 2024-03-08T11:14:53Z

Closing this in favour of #689

zhuohan123 mentioned this issue Jul 18, 2023

[Deprecated] vLLM Development Roadmap #244

Closed

76 tasks

Cyberes mentioned this issue Sep 24, 2023

[Question] Why does using multiple gpus lead to slower performance? #689

Closed

hmellor closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Parallelism vs Data Parallelism #367

Tensor Parallelism vs Data Parallelism #367

xcxhy commented Jul 5, 2023

xcxhy commented Jul 7, 2023

PythonNut commented Jul 10, 2023 •

edited

irasin commented Jul 11, 2023

xcxhy commented Jul 11, 2023

xcxhy commented Jul 11, 2023

PythonNut commented Jul 11, 2023

irasin commented Jul 12, 2023

xcxhy commented Jul 12, 2023

zhuohan123 commented Jul 18, 2023

Peter-Devine commented Nov 1, 2023 •

edited

appliedml42 commented Nov 23, 2023

SunLemuria commented Dec 18, 2023

yhyu13 commented Dec 20, 2023

hmellor commented Mar 8, 2024

Tensor Parallelism vs Data Parallelism #367

Tensor Parallelism vs Data Parallelism #367

Comments

xcxhy commented Jul 5, 2023

xcxhy commented Jul 7, 2023

PythonNut commented Jul 10, 2023 • edited

irasin commented Jul 11, 2023

xcxhy commented Jul 11, 2023

xcxhy commented Jul 11, 2023

PythonNut commented Jul 11, 2023

irasin commented Jul 12, 2023

xcxhy commented Jul 12, 2023

zhuohan123 commented Jul 18, 2023

Peter-Devine commented Nov 1, 2023 • edited

appliedml42 commented Nov 23, 2023

SunLemuria commented Dec 18, 2023

yhyu13 commented Dec 20, 2023

hmellor commented Mar 8, 2024

PythonNut commented Jul 10, 2023 •

edited

Peter-Devine commented Nov 1, 2023 •

edited