[Bug]: Using VLLM0.7.2 server to start DeepSeek-r1-awq model, there is a phenomenon of cuda out of memory and service shutting down.

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```
command:
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8088 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ
GPU：
A800 * 8
CUDA：
12.1

```

</details>


### 🐛 Describe the bug

truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, 
prompt_adapter_request: None.
INFO 02-14 08:44:26 engine.py:275] Added request chatcmpl-4fb99a59-2bdd-455f-97b8-99b8908e83de.
INFO:172.16.18.69:52036- "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error CRITICAL 02-14 08:44:33 launcher.py:101] MQLLMEngine is already dead, terminating server process
(VllmWorkerProcess pid=614) ERROR 02-14 08:44:33 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=614) ERROR 02-14 08:44:33 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File 
"/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 236, in_run_worker_process
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File"/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method 
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]return func(*args,**kwargs)
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File"/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 93, in start_worker_execution_loop self.execute_model(execute_model_req=None) (VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]output= 
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File"/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 413, in execute_model
（VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]output = self.model_runner.execute_model( 
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33multiproc_worker_utils.py:242] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]return func(*args,醒10:00**kwargs)(VllmWorkerProcess pid=614) ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File"/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1719, in execute_model 
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242] 
hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=614) ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File 
... 09:32 "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl 
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]returnself._call_impl(*args,**kwargs)(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in_call_impl
(VllmWorkerProcess pid=614) ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]return forward_call(*args,
09:31 **kwargs) 
(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]File
"/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 687, in forward(VllmWorkerProcess pid=614)ERROR 02-14 08:44:33 multiproc_worker_utils.py:242]hidden_states = calf modellinnut ids nnsitionskv rarhes
2-14 08:44:33 multiproc_worker_utils.py:242]File
"/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1116, in__call__
VIlmWorkerProcess pid=614)ERROR 02-14 08:44:33 engine.py:139] OutfMemoryError('CUDA out of memory. 
况 has 79.05'GiB memory in use. Of the allocated memory 69.80 GiB is allocated by PyTorch,with 76.00 MiB Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 75.25MiB is free.Process 2193685 
allocated in private pools (e.g.,CUDA Graphs), and 63.67 MiB is reserved by PyTorch but unallocated.lf reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.See documentation for Memory Management
0:01 (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)') 
ERROR 02-14 08:44:33 engine.py:139]Traceback (most recent call last):
ERROR 02-14 08:44:33 engine.py:139]File "/opt/conda/lib/python3.10/site- packages/vllm/engine/multiprocessing/engine.py", line 137, in start ERROR 02-14 08:44:33 engine.py:139]self.run_engine_loop()
0:00 ERROR 02-14 08:44:33 engine.py:139]File "/opt/conda/lib/python3.10/site- packages/vllm/engine/multiprocessing/engine.py", line 200, in run_engine_loop
ERROR 02-14 08:44:33 engine.py:139]request_outputs = self.engine_step() ERROR 02-14 08:44:33 engine.py:139]File "/opt/conda/lib/python3.10/site- packages/vllm/engine/multiprocessing/engine.py", line 218, in engine_step ERROR 02-14 08:44:33 engine.py:139] raise e
9:32 ERROR 02-14 08:44:33 engine.py:139]File"/opt/conda/lib/python3.10/site- 
packages/vllm/engine/multiprocessing/engine.py", line 209, in engine_step ERROR 02-14 08:44:33 engine.py:139] Ireturn self.engine.step()
ERROR 02-14 08:44:33 engine.py:139]File "/opt/conda/lib/python3.10/site- packages/vllm/engine/llm_engine.py", line 1386, in step
ERROR 02-14 08:44:33 engine.py:139] outputs = self.model_executor.execute_model( 
9:31
ERROR 02-14 08:44:33 engine.py:139]File "/opt/conda/lib/python3.10/site- packages/vllm/executor/executor_base.py", line 275, in execute_model
ERROR 02-14 08:44:33 engine.py:139]driver_outputs = self._driver_execute_model(execute_model_req) ERROR 02-14 08:44:33 engine.py:139]File"/opt/conda/lib/python3.10/site-
packages/vllm/executor/mpdistributed executor.bv". line 144.in driver execute model

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Using VLLM0.7.2 server to start DeepSeek-r1-awq model, there is a phenomenon of cuda out of memory and service shutting down. #13252

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Using VLLM0.7.2 server to start DeepSeek-r1-awq model, there is a phenomenon of cuda out of memory and service shutting down. #13252

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions