How to know how much real GPU memory is used? #3056

ikalista · 2024-02-27T12:42:25Z

ikalista
Feb 27, 2024

I know vllm comes up with a controlled area to store KV cached, but a lot of it is actually not really used. May I know that is there any way to measure real GPU memory usage?

Answered by chenxu2048

Mar 14, 2024

vLLM records cache usage, logs them and exposes them via prometheus. We can also recalculate the GPU memory usage from GPU block numbers and it usage. But here are not direct GPU memory usage by kvcache in vLLM for now.

INFO 03-14 11:50:43 llm_engine.py:338] # GPU blocks: 2236, # CPU blocks: 655
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [112564]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0…

View full answer

juni3227 · 2024-03-13T08:31:01Z

juni3227
Mar 13, 2024

If you are using Nvidia, execute nvidia-smi in the other terminal, look for vllm instance. For AMD, I do not know.

0 replies

chenxu2048 · 2024-03-14T04:04:25Z

chenxu2048
Mar 14, 2024

vLLM records cache usage, logs them and exposes them via prometheus. We can also recalculate the GPU memory usage from GPU block numbers and it usage. But here are not direct GPU memory usage by kvcache in vLLM for now.

INFO 03-14 11:50:43 llm_engine.py:338] # GPU blocks: 2236, # CPU blocks: 655
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [112564]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1758 (Press CTRL+C to quit)
INFO 03-14 11:50:58 metrics.py:205] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 03-14 11:51:08 metrics.py:205] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

The latest code also logs how many GPU memory used while loading model weights.

INFO 03-11 13:19:13 model_runner.py:96] Loading model weights took 14.3919 GB

You can also estimate the GPU memory usage by (total_gpu_memory * gpu_memory_utilization - model_weight_usage) * gpu_kv_cache_usage approximately.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to know how much real GPU memory is used? #3056

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to know how much real GPU memory is used? #3056

ikalista Feb 27, 2024

Replies: 2 comments

juni3227 Mar 13, 2024

chenxu2048 Mar 14, 2024

ikalista
Feb 27, 2024

juni3227
Mar 13, 2024

chenxu2048
Mar 14, 2024