[Core] Add support for Huawei Ascend NPU #41256

liuxsh9 · 2023-11-18T09:10:42Z

Why are these changes needed?

Ascend NPU is a specialized hardware accelerator designed by Huawei. It is known for its powerful computing capabilities and strong support for deep learning frameworks such as TensorFlow and PyTorch.

This PR aim to support Ascend NPU on Ray.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

jjyao

lg

jjyao · 2023-11-21T03:14:27Z

python/ray/_private/accelerators/ascend_npu.py

+
+    @staticmethod
+    def get_resource_name() -> str:
+        return "NPU"


Does NPU only refer to ascend NPU (seems NPU is a generic term)? Will it be possible that other vendors will build their own NPU and have resource name conflict?

Should the resource name Ascend or AscendNPU?

NPU stands for Neural Processing Unit, and it is manufactured by various vendors, such as Huawei and Cambricon. However, currently in frameworks like PyTorch, HuggingFace, and FastChat, the term 'NPU' is used to represent the Huawei Ascend NPU, while the Cambricon NPU is represented as 'MLU'. Therefore, using 'NPU' does not seem to cause conflicts at the moment and aligns better with user conventions.

Ok, if the industry generally treats NPU and Ascend NPU the same, then we can keep the NPU resource name. In that case, the file name can just be npu.py and the class name can just be NPUAcceleratorManager, no need to have Ascend in it.

That makes sense. AscendNPU has been changed to NPU.

jjyao · 2023-11-21T03:15:36Z

python/ray/_private/accelerators/ascend_npu.py

+        return 0
+
+    @staticmethod
+    def get_current_node_accelerator_type() -> Optional[str]:


Can you add a comment and list some examples of possible values?

Sure. Comments have been added.

jjyao · 2023-11-21T03:16:02Z

python/ray/_private/accelerators/ascend_npu.py

+    def validate_resource_request_quantity(
+        quantity: float,
+    ) -> Tuple[bool, Optional[str]]:
+        return (True, None)


Does NPU support fractional NPU: multiple processes share a single NPU?

Yes, NPU supports fractional utilization, similar to the GPU. However, unlike GPU, when using multiple NPUs, such as in model parallelism, users need to pay attention to the HCCL network configuration. To address this, we have added a warning when a task requests multiple NPUs.

jjyao · 2023-11-21T03:20:42Z

python/ray/tests/accelerators/test_ascend_npu.py

@@ -0,0 +1,122 @@
+import os


You need to add this test to python/ray/tests/BUILD file in order for it to run in CI

Thanks for the reminder. It has been added.

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

jjyao · 2023-11-21T17:43:46Z

Test failed on windows: https://buildkite.com/ray-project/premerge/builds/12396#018bf0cc-56e6-4af1-9716-e418006086dc

…ock in Win32. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

jjyao · 2023-11-22T15:27:02Z

python/ray/_private/accelerators/npu.py

+        try:
+            npu_files = glob.glob("/dev/davinci?")
+            return len(npu_files)
+        except FileNotFoundError as e:


Suggested change

except FileNotFoundError as e:

except Exception as e:

just to be safe

jjyao · 2023-11-22T15:29:44Z

python/ray/_private/accelerators/npu.py

+        if quantity > 1:
+            logger.warning(
+                "The task is requesting multiple Ascend NPUs. "
+                "If you need to build an HCCL network for NPU interconnection, "
+                "please refer to the HCCL User Manual."


I feel it should be in the documentation not here. Also this applies when I start two actors each with one NPU and they want to talk to each other?

The communication between NPUs is similar to the Nvidia's NVLink. If user want to achieve npu2npu communication, they need to use HCCL (similar to NCCL), whether it's within an actor or between actors. It would be appropriate to provide detailed information about this in the documentation. Which document should I add the instructions for using NPU?

The doc doesn't exist yet. I'll let you know later and we don't need to do in this PR.

jjyao · 2023-11-22T15:31:18Z

python/ray/tests/accelerators/test_npu.py

+    ), patch.object(
+        Accelerator, "get_current_node_accelerator_type", return_value="Ascend910B"
+    ):
+        os.environ["ASCEND_VISIBLE_DEVICES"] = "0,1,2"


You can use monkeypatch to change the os.environ so that it will be auto reverted when the test finishes.

Great suggestions, I've made the changes.

jjyao · 2023-11-22T15:31:30Z

python/ray/tests/accelerators/test_npu.py

+        assert manager.get_current_node_num_accelerators() == 4
+        assert manager.__name__ == "NPUAcceleratorManager"
+        assert ray.available_resources()["NPU"] == 3
+        del os.environ["ASCEND_VISIBLE_DEVICES"]


Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

dmatrix · 2023-11-27T19:43:18Z

Congrats on your first PR Ray merge!

Ascend NPU is a specialized hardware accelerator designed by Huawei. It is known for its powerful computing capabilities and strong support for deep learning frameworks such as TensorFlow and PyTorch. This PR aim to support Ascend NPU on Ray. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

liuxsh9 added 4 commits November 18, 2023 17:07

Add support for Huawei Ascend NPU

69f0440

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Modify too long line

1b0222f

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Fixed indent issues

6bc3c15

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Adjust ascend_npu test cases

86c091d

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

jjyao self-assigned this Nov 21, 2023

jjyao reviewed Nov 21, 2023

View reviewed changes

liuxsh9 added 2 commits November 21, 2023 15:13

Add comments, enable the execution of NPU test cases in CI.

a7be4c7

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Fix lint issues

7967ffe

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Renamed AscendNPU to NPU, blocked test issues caused by unsupported m…

ab01fc7

…ock in Win32. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

jjyao reviewed Nov 22, 2023

View reviewed changes

liuxsh9 added 2 commits November 23, 2023 11:00

Modified test cases using MonkeyPatch, adjusted exception handling.

32d29a3

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Adjusted exception handling.

3775504

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

jjyao approved these changes Nov 23, 2023

View reviewed changes

jjyao added 2 commits November 23, 2023 07:52

Merge branch 'master' into support-npu

635a695

Merge branch 'master' into support-npu

1bb4d01

jjyao merged commit 68a1adc into ray-project:master Nov 27, 2023
17 checks passed

liuxsh9 deleted the support-npu branch December 7, 2023 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add support for Huawei Ascend NPU #41256

[Core] Add support for Huawei Ascend NPU #41256

liuxsh9 commented Nov 18, 2023 •

edited

Loading

jjyao left a comment

jjyao Nov 21, 2023

liuxsh9 Nov 21, 2023

jjyao Nov 21, 2023

liuxsh9 Nov 22, 2023

jjyao Nov 21, 2023

liuxsh9 Nov 21, 2023

jjyao Nov 21, 2023

liuxsh9 Nov 21, 2023

jjyao Nov 21, 2023

liuxsh9 Nov 21, 2023

jjyao commented Nov 21, 2023

jjyao Nov 22, 2023

jjyao Nov 22, 2023

liuxsh9 Nov 23, 2023

jjyao Nov 23, 2023

jjyao Nov 22, 2023

liuxsh9 Nov 23, 2023

jjyao Nov 22, 2023

dmatrix commented Nov 27, 2023

[Core] Add support for Huawei Ascend NPU #41256

[Core] Add support for Huawei Ascend NPU #41256

Conversation

liuxsh9 commented Nov 18, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

jjyao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Nov 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatrix commented Nov 27, 2023

liuxsh9 commented Nov 18, 2023 •

edited

Loading