-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
通过Python环境加载vllm0.2.2,加载ChatGLM2-6B-32K模型,发现报出如下NCCL相关问题。报错信息如下:
llm = LLM(model="/home/cloud/LLM/THUDM/chatglm2-6b-32k", trust_remote_code=True)
INFO 11-20 16:58:30 llm_engine.py:72] Initializing an LLM engine with config: model='/home/cloud/LLM/THUDM/chatglm2-6b-32k', tokenizer='/home/cloud/LLM/THUDM/chatglm2-6b-32k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING 11-20 16:58:30 tokenizer.py:66] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Traceback (most recent call last):
File "", line 1, in
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
engine = cls(*engine_configs,
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init
self._init_workers(distributed_init_method)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers
self._run_workers(
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
output = executor(*args, **kwargs)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/worker/worker.py", line 65, in init_model
_init_distributed_environment(self.parallel_config, self.rank,
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/vllm/worker/worker.py", line 406, in _init_distributed_environment
torch.distributed.all_reduce(torch.zeros(1).cuda())
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/cloud/anaconda3/envs/vllm_awq/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648