How to achieve maximum performance? #15215

Unanswered

gaby asked this question in Q&A

gaby
Mar 20, 2025

I have a server with +10 GPU's, and over 1TB of RAM. I'm current running Mistral Small Instruct with vLLM v0.8.0 in Docker.

In v0.8.0 a new command for running benchmarks was added, when running these I get pretty much the same performance from running:

vLLM with 1 GPU (1 req/s faster than multi-gpu)
vLLM with all the gpus and --tensor-parallel-size set to 4 or 6

The only difference I see is that the logs say I have a way bigger KV cache size. Throughput wise I'm getting the same tokens generated and Req/s.

These are the flags i'm using:

gpu_memory_utilization 0.95
tensor_parallel_size 1/4/6 makes no difference other than KV cache size
max_num_batched_tokens 2048, 8183, 32000 all made minimal difference
enable_chunked_prefill=True

What am I missing? Why is the throughput not increasing, is this a vLLM issue or a vllm bench serve issue?

Is there any way of using some of that RAM to help vLLM? Right now the system is barely using 10GB of RAM.

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment