Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][Ascend] Add Ascend NPU backend #8054

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

wangshuai09
Copy link

@wangshuai09 wangshuai09 commented Aug 31, 2024

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

  • Ascend Executor
  • Ascend Worker
  • Ascend Model Runner
  • Ascend SingleOps Backend
    • custom_ops with native impl
    • padding for multi prompts
    • update vllm/attention/backends/ascend.py to the latest version.
    • model inference: opt, llama
    • multiproc
  • Platform for Ascend NPU
  • Server
  • Unit-test

Support Device

  • Atlas 800I A2 Inference Server
  • Atlas 800T A2 Training Server
  • Atals 300T A2 Training Card

Install

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py

Using Dockerfile.npu

  1. Clone branch npu_support and step into vllm
git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm
  1. Build the docker image
docker build -t vllm-npu -f Dockerfile.npu .
  1. Run docker container.
    modify --device /dev/davinci0 according to your device.
docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash
  1. Enter the container
docker exec -it vllm bash

Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@zer0py2c
Copy link

zer0py2c commented Sep 1, 2024

Is there any document on how to use it?

@wangshuai09
Copy link
Author

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py, only support single prompt now.

@zer0py2c
Copy link

zer0py2c commented Sep 2, 2024

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py, only support single prompt now.

very thankful, I'll try it.

@wyzanski
Copy link

wyzanski commented Sep 2, 2024

1160ABCD-57B2-4acf-B3DF-F5CDB188708C
I followed the above steps and reported the following error. What is the reason?

@wangshuai09
Copy link
Author

@wyzanski There is a fatal error about git, i think you may need to recheck your git config.

@Aiwenqiuyu
Copy link

期待对国产化的支持!

Co-authored-by: MengqingCao <cmq0113@163.com>
@jkl375
Copy link

jkl375 commented Sep 11, 2024

感谢对国产化的支持!

 * pad slot indices
 * use parameter passing instead of global var to control whether pad length is calculated in the sampling
@MengqingCao
Copy link

MengqingCao commented Sep 11, 2024

TODO:

  • update vllm/attention/backends/ascend.py to the latest version.

@XYZliang
Copy link

感谢对国产化的支持!期待在昇腾系列上的效果,太缺一个高效的推理引擎了

@beardog6
Copy link

是否支持在线推理呢

@wangshuai09
Copy link
Author

wangshuai09 commented Sep 18, 2024

是否支持在线推理呢

Does it means starting an OpenAI-compatible API server? The latest code already supports, like this:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

@XYZliang
Copy link

What Ascend NPU devices are currently supported?
The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A.
However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

@WangxuP
Copy link

WangxuP commented Sep 18, 2024

是否支持在线推理呢

是不是意味着要启动一个兼容 OpenAI 的 API 服务器呢?最新的代码已经支持了,像这样:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

suooprted qwen series LLM?

@wangshuai09
Copy link
Author

Hi @XYZliang, we don`t have device with this chip type, maybe you could test on your device with latest code?

@wangshuai09
Copy link
Author

@WangxuP we do not check the model corretness now, here is a simple offline result:

INFO 09-18 10:03:24 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:24 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 09-18 10:03:33 npu_model_runner.py:319] Starting to load model Qwen/Qwen2-7B-Instruct...
INFO 09-18 10:03:33 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:33 selector.py:161] Using ASCEND_TORCH backend.
INFO 09-18 10:03:34 weight_utils.py:235] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 09-18 10:03:39 npu_model_runner.py:330] Loading model weights took 14.2487 GB
/workspace/cmq/ws-code/vllm/vllm/model_executor/layers/sampler.py:437: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
  top_p_mask[:, -1] = False
INFO 09-18 10:03:45 gpu_executor.py:122] # GPU blocks: 37996, # CPU blocks: 4681
Processed prompts: 100%|████████| 2/2 [00:04<00:00,  2.34s/it, est. speed input: 2.56 toks/s, output: 42.72 toks/s]
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president is the commander-in-chief of the armed forces, the head of the executive branch, and is responsible for enforcing federal laws, taking care that federal laws are faithfully executed, and serving as the commander in chief of the armed forces. The president is also the head of state and represents the nation to foreign governments and to the world at large. The president is the chief diplomat, the chief executive, and the chief legislator of'
Prompt: 'The future of AI is', Generated text: " here, and it's not just about robots and self-driving cars. AI is transforming every industry, from healthcare to finance, and it's changing the way we live and work. In this article, we'll explore the latest advancements in AI and how they're impacting our world.\nOne of the most exciting areas of AI research is natural language processing (NLP). NLP is the ability of machines to understand and interpret human language. This technology is being used to create chatbots, virtual assistants,"

@RogerWYQ
Copy link

should we install mindie first?

@zhangzhiqiangcs
Copy link

Is there a Dockerfile for npu to build image ?

@wangshuai09
Copy link
Author

感谢并期待功能的补全。另外目前版本验证910B下推理性能相对mindie差距较大,qwen1.5-7b-chat的推理速度20tokens/s,在mindie上可以达到38tokens/s。 是否是fa算子未启用的原因?启动日志里有出现: INFO 09-26 09:35:03 selector.py:222] Cannot use _Backend.FLASH_ATTN backend on NPU.

Flash Attn is used by Ascend backend in attention/backends/ascend.py and this log shows that Ascend backend does not use the default FLASH_ATTN which is implemented in vllm. There are space for improving perf, such as fused ops in custom_op,

@verigle
Copy link

verigle commented Sep 27, 2024

ascend vllm 请问是否计划适配 qwen2-vl呢?

@MengqingCao
Copy link

ascend vllm 请问是否计划适配 qwen2-vl呢?

Support for VLMs is in our todo list, including qwen2-vl.

@Yikun Yikun mentioned this pull request Sep 28, 2024
46 tasks
@XYZliang
Copy link

XYZliang commented Oct 8, 2024

What Ascend NPU devices are currently supported? The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A. However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

兄弟,芯片310P型号的昇腾推理卡不太行哦,我用LMDeploy V0.6.0测过了。 :(

我有问过LMDeploy那边,他们应该是基于的是310P 的机子做的适配

@wangyuanxiong-hub
Copy link

Are there any plans for the adaptation of 300I Duo?

@1737686924
Copy link

ascend vllm 请问是否计划适配 qwen2-vl呢?

对 VLM 的支持在我们的待办事项列表中,包括 qwen2-vl。

现在npu上支持qwen2-vl了么,有相应的pr可以参考吗?

@wangshuai09
Copy link
Author

Are there any plans for the adaptation of 300I Duo?

Support 300I Duo is in our to-do list, but it`s not a high priority at the moment.

@MengqingCao
Copy link

ascend vllm 请问是否计划适配 qwen2-vl呢?

对 VLM 的支持在我们的待办事项列表中,包括 qwen2-vl。

现在npu上支持qwen2-vl了么,有相应的pr可以参考吗?

Not support currently

@ccly1996
Copy link

如何修改模型运行时的config呢,目前使用910b运行qwen1.5 7b时告知要调整参数
image

@xymak
Copy link

xymak commented Oct 11, 2024

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

  • Ascend Executor

  • Ascend Worker

  • Ascend Model Runner

  • Ascend SingleOps Backend

    • custom_ops with native impl
    • padding for multi prompts
    • update vllm/attention/backends/ascend.py to the latest version.
    • model inference: opt, llama
    • multiproc
  • Platform for Ascend NPU

  • Server

  • Unit-test

Support Device

  • Atlas 800I A2 Inference Server
  • Atlas 800T A2 Training Server
  • Atals 300T A2 Training Card

Install

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py

Using Dockerfile.npu

  1. Clone branch npu_support and step into vllm
git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm
  1. Build the docker image
docker build -t vllm-npu -f Dockerfile.npu .
  1. Run docker container.
    modify --device /dev/davinci0 according to your device.
docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash
  1. Enter the container
docker exec -it vllm bash

Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie @zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

@wangshuai09 @MengqingCao

同学有没有联系方式?给个邮箱,想跟您沟通一下,我们公司这边想投入两个人参与开发。或者用 xiyuanmak@gmail.com
联系一下我?

@wangshuai09
Copy link
Author

@ccly1996 You can refer to

max_model_len=2048,

@ccly1996
Copy link

@ccly1996 You can refer to

max_model_len=2048,

谢谢,目前可以通过openai接口访问npu部署的模型了吗,另外vllm的FLASH_ATTN已经支持了吗

@MengqingCao
Copy link

谢谢,目前可以通过openai接口访问npu部署的模型了吗

Yes, you can use openai api servser on Ascend NPU now.

另外vllm的FLASH_ATTN已经支持了吗

Flash attention is supported by operators in torch_npu, instead of flash_attn.
Just ignore the warning Cannot use _Backend.FLASH_ATTN backend on NPU.

@ccly1996
Copy link

谢谢,目前可以通过openai接口访问npu部署的模型了吗

Yes, you can use openai api servser on Ascend NPU now.

另外vllm的FLASH_ATTN已经支持了吗

Flash attention is supported by operators in torch_npu, instead of flash_attn. Just ignore the warning Cannot use _Backend.FLASH_ATTN backend on NPU.

只是vllm serve model就可以了吗,目前按这个步骤启动会报错误“cannot import name PoolingParams from vllm”

@MengqingCao
Copy link

只是vllm serve model就可以了吗,目前按这个步骤启动会报错误“cannot import name PoolingParams from vllm”

You can start a server by running a command likepython3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m ... (other configs). And then use a client script (example) to access it.

@forrestjgq
Copy link

@wangshuai09 I've got an error report when run offline inference from example, could you give me some advice?

root@k8s-master-78:/home/gqjiang/vllm# python3 examples/offline_inference_npu.py
INFO 10-16 07:49:16 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 3.21MB/s]
WARNING 10-16 07:49:20 config.py:380] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 10-16 07:49:20 llm_engine.py:234] Initializing an LLM engine (v0.1.dev2868+ga02b772) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 4.05MB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:01<00:00, 728kB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.27MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 2.49MB/s]
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 853kB/s]
INFO 10-16 07:49:26 selector.py:221] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 10-16 07:49:26 selector.py:151] Using ASCEND backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 10-16 07:49:40 model_runner.py:1024] Starting to load model facebook/opt-125m...
INFO 10-16 07:49:40 selector.py:221] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 10-16 07:49:40 selector.py:151] Using ASCEND backend.
INFO 10-16 07:49:41 weight_utils.py:242] Using model weights format ['*.bin']
pytorch_model.bin:  29%|████████████████████████████████████████████████▋                                                                                                                     | 73.4M/251M [00:12<00:26, 6.63MB/s]pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:47<00:00, 5.30MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.09it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.09it/s]

INFO 10-16 07:50:32 model_runner.py:1035] Loading model weights took 0.2389 GB
INFO 10-16 07:50:32 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241016-075032.pkl...
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
INFO 10-16 07:51:01 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241016-075032.pkl.
Traceback (most recent call last):
  File "/home/gqjiang/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/model_runner.py", line 1608, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 326, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 291, in forward
    return self.decoder(input_ids,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 260, in forward
    hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 162, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 105, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
  File "/home/gqjiang/vllm/vllm/attention/backends/ascend.py", line 473, in forward
    output = torch_npu.npu_prompt_flash_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-10-16-07:50:32.915.202 PromptFlashAttention LaunchAicore failed.
        TraceBack (most recent call last):
        attention mask must be NULL,when Qs,Kvs is unAlign or Qs is not equal to Kvs, Qs = 8, Kvs = 8[FUNC:RunBigKernelTilingWithParams][FILE:prompt_flash_attention_tiling.cpp][LINE:2081]
        Tiling failed
        Tiling Failed.
        Kernel GetWorkspace failed. opType: 4
        PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-10-16-07:50:32 (PID:1428, Device:0, RankID:-1) ERR01005 OPS internal error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gqjiang/vllm/examples/offline_inference_npu.py", line 29, in <module>
    llm = LLM(model="facebook/opt-125m")
  File "/home/gqjiang/vllm/vllm/entrypoints/llm.py", line 214, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/gqjiang/vllm/vllm/engine/llm_engine.py", line 585, in from_engine_args
    engine = cls(
  File "/home/gqjiang/vllm/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/home/gqjiang/vllm/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/gqjiang/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/npu_worker.py", line 148, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/npu_model_runner.py", line 271, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-075032.pkl): call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-10-16-07:50:32.915.202 PromptFlashAttention LaunchAicore failed.
        TraceBack (most recent call last):
        attention mask must be NULL,when Qs,Kvs is unAlign or Qs is not equal to Kvs, Qs = 8, Kvs = 8[FUNC:RunBigKernelTilingWithParams][FILE:prompt_flash_attention_tiling.cpp][LINE:2081]
        Tiling failed
        Tiling Failed.
        Kernel GetWorkspace failed. opType: 4
        PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-10-16-07:50:32 (PID:1428, Device:0, RankID:-1) ERR01005 OPS internal error

@sushe2111
Copy link

sushe2111 commented Oct 17, 2024

I got a warning,is this a problem?
/torch_npu/distributed/distributed_c10d.py:110: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.
warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.")

@WWCTF
Copy link

WWCTF commented Oct 17, 2024

在310p上进行开发,300I DUO上推理生成乱码,对比910B上成功推理对比多出一条警告
4432aa9d4fba841298dddfbbdc7731b
请问这是硬件还是软件方面的问题,该怎么处理?

@MengqingCao
Copy link

UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.

This does not affect the existing functions, but only prompts to use the allgather operator instead of gather to complete the communication between NPU cards

@MengqingCao
Copy link

在310p上进行开发,300I DUO上推理生成乱码,对比910B上成功推理对比多出一条警告 !

310P is not supported currently. The args passing into FA operators are a little different on 310p, maybe it cause the wrong inferencing results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.