Load model turn out to be very slow after update the version of vllm #2959

jony0113 · 2024-02-21T13:33:42Z

I am using vllm and mixtral 8x7b, the version of vllm is 0.2.6, it works well.
I tried to update the version to latest 0.3.1. however, while it proceeds to load model weight, it becomes very slow. it takes almost 40 minutes vs. 5 minutes for the old version. I don't know how it happens since I do not change any other environment and parameters.

I have checked the difference of loading model in mixtral.py and doesn't find any clues.

I downloaded the model, and my parameters are:

Namespace(host='0.0.0.0', port=28711, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name='mixtral', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/data0/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)

the log hanging for about 40 minutes at:

Initializing an LLM engine with config: model='/data0/Mixtral-8x7B-Instruct-v0.1', tokenizer='/data0/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)

more information:
the model I downloaded was put on the hdfs, and is mounted to the k8s pod using pvc, which has a band limit of 800M/s, I have checked the monitor, while loading model it did not exceed the limit.

sorry to bother, but if you have any idea of this, since you have changed the file (mixtral.py) since 0.2.6. @WoosukKwon @zhuohan123 @pcmoritz @tterrysun

jony0113 · 2024-02-22T09:04:58Z

I tried 0.2.7, it also works well, but when it comes to 0.3.0, it hangs for 40 minutes, so I’m sure there is some change in 0.3.0 that leads to this issue

zhourrr · 2024-02-23T22:15:11Z

Can confirm, distributed inference works with 0.2.7.
I tested with 4 GPUs and 2 nodes, with 2 GPUs on each node, and loaded mistralai/Mixtral-8x7B-v0.1 onto them. This worked well with vllm==0.2.7. But with vllm=0.3.1, this just hangs forever (at least 20 minutes).

Edit: distributed inference on GPUs on the same node works well. Only when using GPUs across nodes did I run into this slow loading.

Thank you!

esmeetu · 2024-02-29T10:35:07Z

@jony0113 @zhourrr Could you try the latest main branch and running with eager mode?

zhourrr · 2024-02-29T20:03:07Z

@jony0113 @zhourrr Could you try the latest main branch and running with eager mode?

I changed vllm library code as in #3037 and it works, thanks! I didn't test the main branch though, because my server environment doesn't support building from source.

jony0113 · 2024-03-03T12:09:29Z

@jony0113 @zhourrr Could you try the latest main branch and running with eager mode?

sorry for late reply, I update to the latest version(0.3.3) today, and start the openapi server with param --enforce-eager, unfortunately, the issue does't seems to be solved.

DarkLight1337 · 2024-06-13T09:19:09Z

We have added documentation for this situation in #5430. Please take a look.

jony0113 mentioned this issue Feb 22, 2024

vLLM running on a Ray Cluster Hanging on Initializing #2826

Closed

esmeetu mentioned this issue Feb 26, 2024

Fix using CuPy for eager mode #3037

Merged

jony0113 mentioned this issue Mar 5, 2024

Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1 #3196

Closed

DarkLight1337 closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load model turn out to be very slow after update the version of vllm #2959

Load model turn out to be very slow after update the version of vllm #2959

jony0113 commented Feb 21, 2024

jony0113 commented Feb 22, 2024

zhourrr commented Feb 23, 2024 •

edited

Loading

esmeetu commented Feb 29, 2024

zhourrr commented Feb 29, 2024

jony0113 commented Mar 3, 2024

DarkLight1337 commented Jun 13, 2024

Load model turn out to be very slow after update the version of vllm #2959

Load model turn out to be very slow after update the version of vllm #2959

Comments

jony0113 commented Feb 21, 2024

jony0113 commented Feb 22, 2024

zhourrr commented Feb 23, 2024 • edited Loading

esmeetu commented Feb 29, 2024

zhourrr commented Feb 29, 2024

jony0113 commented Mar 3, 2024

DarkLight1337 commented Jun 13, 2024

zhourrr commented Feb 23, 2024 •

edited

Loading