Skip to content

vllm加载ChatGLM2-6B-32K报错 #1723

@ZEROYXY

Description

@ZEROYXY

通过Python环境加载vllm0.2.2,加载ChatGLM2-6B-32K模型,发现报出如下NCCL相关问题。报错信息如下:

llm = LLM(model="/home/cloud/LLM/THUDM/chatglm2-6b-32k", trust_remote_code=True)
INFO 11-20 16:58:30 llm_engine.py:72] Initializing an LLM engine with config: model='/home/cloud/LLM/THUDM/chatglm2-6b-32k', tokenizer='/home/cloud/LLM/THUDM/chatglm2-6b-32k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING 11-20 16:58:30 tokenizer.py:66] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Traceback (most recent call last):
File "", line 1, in
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
engine = cls(*engine_configs,
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init
self._init_workers(distributed_init_method)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers
self._run_workers(
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
output = executor(*args, **kwargs)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/worker/worker.py", line 65, in init_model
_init_distributed_environment(self.parallel_config, self.rank,
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/worker/worker.py", line 406, in _init_distributed_environment
torch.distributed.all_reduce(torch.zeros(1).cuda())
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions