Skip to content

RuntimeError: The error occurred during profile run. This might be due to insufficient GPU memory. #19

@guzhangbo1989

Description

@guzhangbo1989

My platform is : ubuntu22.04(host) + ubuntu22.04(container) + ARC770 16G (intel)

This container is built by following the readme in this project "docker build -f Dockerfile -t vllm-openvino-env . ".
Running container command:
docker run -tid --network host --privileged --shm-size="16g" --memory="32G" --device /dev/dri --name vllm-ov-new --hostname vll-ov-new -v /ai-data/:/ai-data/ vllm-openvino-env:latest

I tried to run below command to test GPU perf , but I got error " RuntimeError: The error occurred during profile run. This might be due to insufficient GPU memory. Consider decreasing max_model_len to limit the maximum simultaneously processed tokens"
I have tried to decrease --max_model_len and --max-num-seqs to use less GPU memory , but it still show same error.

command:
$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_KV_CACHE_PRECISION=i8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON python3 vllm/benchmarks/benchmark_throughput.py --model /ai-data/ai-models/Baichuan2-7B-Base --dataset /ai-data/ShareGPT_V3_unfiltered_cleaned_split.json --max-model-len 512 --max-num-batched-tokens 5120 --max-num-seqs 2 --trust_remote_code --enable-chunked-prefill --gpu-memory-utilization 0.9

log:
root@vll-ov-new:/workspace# VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_KV_CACHE_PRECISION=i8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON python3 vllm/benchmarks/benchmark_throughput.py --model /ai-data/ai-models/Baichuan2-7B-Base --dataset /ai-data/ShareGPT_V3_unfiltered_cleaned_split.json --max-model-len 512 --max-num-batched-tokens 5120 --max-num-seqs 2 --trust_remote_code --enable-chunked-prefill --gpu-memory-utilization 0.9
INFO 07-30 18:07:10 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 07-30 18:07:10 [init.py:32] name=openvino, value=vllm_openvino:register
INFO 07-30 18:07:10 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 07-30 18:07:10 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 07-30 18:07:10 [init.py:44] plugin openvino loaded.
INFO 07-30 18:07:10 [init.py:230] Platform plugin openvino is activated
/workspace/vllm/benchmarks/benchmark_throughput.py:637: UserWarning: The '--dataset' argument will be deprecated in the next release. Please use '--dataset-name' and '--dataset-path' instead.
validate_args(args)
Namespace(backend='vllm', dataset_name='sharegpt', dataset='/ai-data/ShareGPT_V3_unfiltered_cleaned_split.json', dataset_path='/ai-data/ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=None, n=1, num_prompts=1000, hf_max_batch_size=None, output_json=None, async_engine=False, disable_frontend_multiprocessing=False, disable_detokenize=False, lora_path=None, prefix_len=None, random_range_ratio=None, hf_subset=None, hf_split=None, model='/ai-data/ai-models/Baichuan2-7B-Base', task='auto', tokenizer='/ai-data/ai-models/Baichuan2-7B-Base', hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=512, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=5120, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=2, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False)
Token indices sequence length is longer than the specified maximum sequence length for this model (4772 > 4096). Running this sequence through the model will result in indexing errors
INFO 07-30 18:07:23 [config.py:689] This model supports multiple tasks: {'score', 'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 07-30 18:07:23 [arg_utils.py:1731] device type=cpu is not supported by the V1 Engine. Falling back to V0.
INFO 07-30 18:07:23 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 07-30 18:07:23 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=5120.
WARNING 07-30 18:07:23 [platform.py:84] CUDA graph is not supported on OpenVINO backend, fallback to the eager mode.
INFO 07-30 18:07:23 [platform.py:100] KV cache type is overridden to i8 via VLLM_OPENVINO_KV_CACHE_PRECISION env var.
INFO 07-30 18:07:23 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/ai-data/ai-models/Baichuan2-7B-Base', speculative_config=None, tokenizer='/ai-data/ai-models/Baichuan2-7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=i8, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/ai-data/ai-models/Baichuan2-7B-Base, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[2,1],"max_capture_size":2}, use_cached_outputs=False,
WARNING 07-30 18:07:23 [tokenizer.py:248] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-30 18:07:24 [platform.py:41] Using OpenVINO Attention backend.
[W730 18:07:24.105615725 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 07-30 18:07:25 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
WARNING 07-30 18:07:25 [openvino.py:113] Provided model id /ai-data/ai-models/Baichuan2-7B-Base does not contain OpenVINO IR, the model will be converted to IR with default options. If you need to use specific options for model conversion, use optimum-cli export openvino with desired options.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.6.0+cpu)
Python 3.10.16 (you have 3.10.12)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 23.37it/s]
WARNING:root:Cannot apply model.to_bettertransformer because of the exception:
BetterTransformer requires transformers<4.49 but found 4.51.3. optimum.bettertransformer is deprecated and will be removed in optimum v2.0.. Usage model with stateful=True may be non-effective if model does not contain torch.functional.scaled_dot_product_attention
loss_type=None was set in the config but it is unrecognised.Using the default loss: ForCausalLMLoss.
/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py:348: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1:
/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py:62: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py:67: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py:122: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym │ 100% (162 / 162) │ 100% (162 / 162) │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━ 100% • 0:01:15 • 0:00:00
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/multimodal/registry.py", line 95, in getitem
[rank0]: return super().getitem(key)
[rank0]: File "/usr/lib/python3.10/collections/init.py", line 1106, in getitem
[rank0]: raise KeyError(key)
[rank0]: KeyError: <vllm.config.ModelConfig object at 0x79797bbe9ba0>

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm_openvino/worker/openvino_worker.py", line 274, in determine_num_available_blocks
[rank0]: kv_cache_size = self.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm_openvino/worker/openvino_worker.py", line 532, in profile_run
[rank0]: model_profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm_openvino/worker/openvino_worker.py", line 497, in model_profile_run
[rank0]: .dummy_data_for_profiling(model_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/inputs/registry.py", line 358, in dummy_data_for_profiling
[rank0]: mm_counts = mm_registry.get_mm_limits_per_prompt(model_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/multimodal/registry.py", line 386, in get_mm_limits_per_prompt
[rank0]: return self._limits_by_model[model_config]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/multimodal/registry.py", line 99, in getitem
[rank0]: raise KeyError(msg) from exc
[rank0]: KeyError: 'Cannot find mm_limits for model=/ai-data/ai-models/Baichuan2-7B-Base. Did you forget to call init_mm_limits_per_prompt?'

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm/benchmarks/benchmark_throughput.py", line 638, in
[rank0]: main(args)
[rank0]: File "/workspace/vllm/benchmarks/benchmark_throughput.py", line 372, in main
[rank0]: elapsed_time, request_outputs = run_vllm(
[rank0]: File "/workspace/vllm/benchmarks/benchmark_throughput.py", line 40, in run_vllm
[rank0]: llm = LLM(**dataclasses.asdict(engine_args))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1099, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 248, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 522, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 498, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 285, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
[rank0]: results = self.collective_rpc("determine_num_available_blocks")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2378, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm_openvino/worker/openvino_worker.py", line 276, in determine_num_available_blocks
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: The error occurred during profile run. This might be due to insufficient GPU memory. Consider decreasing max_model_len to limit the maximum simultaneously processed tokens.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions