Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load model turn out to be very slow after update the version of vllm #2959

Closed
jony0113 opened this issue Feb 21, 2024 · 6 comments
Closed

Load model turn out to be very slow after update the version of vllm #2959

jony0113 opened this issue Feb 21, 2024 · 6 comments

Comments

@jony0113
Copy link

I am using vllm and mixtral 8x7b, the version of vllm is 0.2.6, it works well.
I tried to update the version to latest 0.3.1. however, while it proceeds to load model weight, it becomes very slow. it takes almost 40 minutes vs. 5 minutes for the old version. I don't know how it happens since I do not change any other environment and parameters.

I have checked the difference of loading model in mixtral.py and doesn't find any clues.

I downloaded the model, and my parameters are:

Namespace(host='0.0.0.0', port=28711, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name='mixtral', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/data0/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)

the log hanging for about 40 minutes at:

Initializing an LLM engine with config: model='/data0/Mixtral-8x7B-Instruct-v0.1', tokenizer='/data0/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)

more information:
the model I downloaded was put on the hdfs, and is mounted to the k8s pod using pvc, which has a band limit of 800M/s, I have checked the monitor, while loading model it did not exceed the limit.

sorry to bother, but if you have any idea of this, since you have changed the file (mixtral.py) since 0.2.6. @WoosukKwon @zhuohan123 @pcmoritz @tterrysun

@jony0113
Copy link
Author

I tried 0.2.7, it also works well, but when it comes to 0.3.0, it hangs for 40 minutes, so I’m sure there is some change in 0.3.0 that leads to this issue

@zhourrr
Copy link

zhourrr commented Feb 23, 2024

Can confirm, distributed inference works with 0.2.7.
I tested with 4 GPUs and 2 nodes, with 2 GPUs on each node, and loaded mistralai/Mixtral-8x7B-v0.1 onto them. This worked well with vllm==0.2.7. But with vllm=0.3.1, this just hangs forever (at least 20 minutes).

Edit: distributed inference on GPUs on the same node works well. Only when using GPUs across nodes did I run into this slow loading.

Thank you!

@esmeetu
Copy link
Collaborator

esmeetu commented Feb 29, 2024

@jony0113 @zhourrr Could you try the latest main branch and running with eager mode?

@zhourrr
Copy link

zhourrr commented Feb 29, 2024

@jony0113 @zhourrr Could you try the latest main branch and running with eager mode?

I changed vllm library code as in #3037 and it works, thanks! I didn't test the main branch though, because my server environment doesn't support building from source.

@jony0113
Copy link
Author

jony0113 commented Mar 3, 2024

@jony0113 @zhourrr Could you try the latest main branch and running with eager mode?

sorry for late reply, I update to the latest version(0.3.3) today, and start the openapi server with param --enforce-eager, unfortunately, the issue does't seems to be solved.

@DarkLight1337
Copy link
Collaborator

We have added documentation for this situation in #5430. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants