Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add support for Huawei Ascend NPU #41256

Merged
merged 11 commits into from
Nov 27, 2023
Merged

Conversation

liuxsh9
Copy link
Contributor

@liuxsh9 liuxsh9 commented Nov 18, 2023

Why are these changes needed?

Ascend NPU is a specialized hardware accelerator designed by Huawei. It is known for its powerful computing capabilities and strong support for deep learning frameworks such as TensorFlow and PyTorch.

This PR aim to support Ascend NPU on Ray.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
@jjyao jjyao self-assigned this Nov 21, 2023
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg


@staticmethod
def get_resource_name() -> str:
return "NPU"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NPU only refer to ascend NPU (seems NPU is a generic term)? Will it be possible that other vendors will build their own NPU and have resource name conflict?

Should the resource name Ascend or AscendNPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPU stands for Neural Processing Unit, and it is manufactured by various vendors, such as Huawei and Cambricon. However, currently in frameworks like PyTorch, HuggingFace, and FastChat, the term 'NPU' is used to represent the Huawei Ascend NPU, while the Cambricon NPU is represented as 'MLU'. Therefore, using 'NPU' does not seem to cause conflicts at the moment and aligns better with user conventions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if the industry generally treats NPU and Ascend NPU the same, then we can keep the NPU resource name. In that case, the file name can just be npu.py and the class name can just be NPUAcceleratorManager, no need to have Ascend in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. AscendNPU has been changed to NPU.

return 0

@staticmethod
def get_current_node_accelerator_type() -> Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment and list some examples of possible values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Comments have been added.

def validate_resource_request_quantity(
quantity: float,
) -> Tuple[bool, Optional[str]]:
return (True, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NPU support fractional NPU: multiple processes share a single NPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, NPU supports fractional utilization, similar to the GPU. However, unlike GPU, when using multiple NPUs, such as in model parallelism, users need to pay attention to the HCCL network configuration. To address this, we have added a warning when a task requests multiple NPUs.

@@ -0,0 +1,122 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add this test to python/ray/tests/BUILD file in order for it to run in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. It has been added.

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
@jjyao
Copy link
Contributor

jjyao commented Nov 21, 2023

…ock in Win32.

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
try:
npu_files = glob.glob("/dev/davinci?")
return len(npu_files)
except FileNotFoundError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except FileNotFoundError as e:
except Exception as e:

just to be safe

Comment on lines 86 to 90
if quantity > 1:
logger.warning(
"The task is requesting multiple Ascend NPUs. "
"If you need to build an HCCL network for NPU interconnection, "
"please refer to the HCCL User Manual."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it should be in the documentation not here. Also this applies when I start two actors each with one NPU and they want to talk to each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The communication between NPUs is similar to the Nvidia's NVLink. If user want to achieve npu2npu communication, they need to use HCCL (similar to NCCL), whether it's within an actor or between actors. It would be appropriate to provide detailed information about this in the documentation. Which document should I add the instructions for using NPU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc doesn't exist yet. I'll let you know later and we don't need to do in this PR.

), patch.object(
Accelerator, "get_current_node_accelerator_type", return_value="Ascend910B"
):
os.environ["ASCEND_VISIBLE_DEVICES"] = "0,1,2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use monkeypatch to change the os.environ so that it will be auto reverted when the test finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions, I've made the changes.

assert manager.get_current_node_num_accelerators() == 4
assert manager.__name__ == "NPUAcceleratorManager"
assert ray.available_resources()["NPU"] == 3
del os.environ["ASCEND_VISIBLE_DEVICES"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
@jjyao jjyao merged commit 68a1adc into ray-project:master Nov 27, 2023
17 checks passed
@dmatrix
Copy link
Contributor

dmatrix commented Nov 27, 2023

Congrats on your first PR Ray merge!

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
Ascend NPU is a specialized hardware accelerator designed by Huawei. It is known for its powerful computing capabilities and strong support for deep learning frameworks such as TensorFlow and PyTorch.

This PR aim to support Ascend NPU on Ray.

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
@liuxsh9 liuxsh9 deleted the support-npu branch December 7, 2023 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants