-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add support for Huawei Ascend NPU #41256
Conversation
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg
|
||
@staticmethod | ||
def get_resource_name() -> str: | ||
return "NPU" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NPU only refer to ascend NPU (seems NPU is a generic term)? Will it be possible that other vendors will build their own NPU and have resource name conflict?
Should the resource name Ascend
or AscendNPU
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPU stands for Neural Processing Unit, and it is manufactured by various vendors, such as Huawei and Cambricon. However, currently in frameworks like PyTorch, HuggingFace, and FastChat, the term 'NPU' is used to represent the Huawei Ascend NPU, while the Cambricon NPU is represented as 'MLU'. Therefore, using 'NPU' does not seem to cause conflicts at the moment and aligns better with user conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if the industry generally treats NPU and Ascend NPU the same, then we can keep the NPU resource name. In that case, the file name can just be npu.py
and the class name can just be NPUAcceleratorManager
, no need to have Ascend in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. AscendNPU has been changed to NPU.
return 0 | ||
|
||
@staticmethod | ||
def get_current_node_accelerator_type() -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment and list some examples of possible values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Comments have been added.
def validate_resource_request_quantity( | ||
quantity: float, | ||
) -> Tuple[bool, Optional[str]]: | ||
return (True, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NPU support fractional NPU: multiple processes share a single NPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, NPU supports fractional utilization, similar to the GPU. However, unlike GPU, when using multiple NPUs, such as in model parallelism, users need to pay attention to the HCCL network configuration. To address this, we have added a warning when a task requests multiple NPUs.
@@ -0,0 +1,122 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add this test to python/ray/tests/BUILD
file in order for it to run in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder. It has been added.
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Test failed on windows: https://buildkite.com/ray-project/premerge/builds/12396#018bf0cc-56e6-4af1-9716-e418006086dc |
…ock in Win32. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
try: | ||
npu_files = glob.glob("/dev/davinci?") | ||
return len(npu_files) | ||
except FileNotFoundError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except FileNotFoundError as e: | |
except Exception as e: |
just to be safe
if quantity > 1: | ||
logger.warning( | ||
"The task is requesting multiple Ascend NPUs. " | ||
"If you need to build an HCCL network for NPU interconnection, " | ||
"please refer to the HCCL User Manual." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it should be in the documentation not here. Also this applies when I start two actors each with one NPU and they want to talk to each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The communication between NPUs is similar to the Nvidia's NVLink. If user want to achieve npu2npu communication, they need to use HCCL (similar to NCCL), whether it's within an actor or between actors. It would be appropriate to provide detailed information about this in the documentation. Which document should I add the instructions for using NPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc doesn't exist yet. I'll let you know later and we don't need to do in this PR.
), patch.object( | ||
Accelerator, "get_current_node_accelerator_type", return_value="Ascend910B" | ||
): | ||
os.environ["ASCEND_VISIBLE_DEVICES"] = "0,1,2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use monkeypatch to change the os.environ
so that it will be auto reverted when the test finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions, I've made the changes.
assert manager.get_current_node_num_accelerators() == 4 | ||
assert manager.__name__ == "NPUAcceleratorManager" | ||
assert ray.available_resources()["NPU"] == 3 | ||
del os.environ["ASCEND_VISIBLE_DEVICES"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Congrats on your first PR Ray merge! |
Ascend NPU is a specialized hardware accelerator designed by Huawei. It is known for its powerful computing capabilities and strong support for deep learning frameworks such as TensorFlow and PyTorch. This PR aim to support Ascend NPU on Ray. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Why are these changes needed?
Ascend NPU is a specialized hardware accelerator designed by Huawei. It is known for its powerful computing capabilities and strong support for deep learning frameworks such as TensorFlow and PyTorch.
This PR aim to support Ascend NPU on Ray.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.