Skip to content

Use non-CPU device type and id for host accessible memory #25043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

skottmckay
Copy link
Contributor

Description

Use the non-CPU device type and id for host accessible memory to make the link between CPU and the non-CPU device explicit.

Update the data transfer implementations to check vendor id.

Motivation and Context

… the link between CPU and the non-CPU device explicit.

Update the data transfer implementations to check vendor id.
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

bytes,
ACL_MEMCPY_HOST_TO_DEVICE,
static_cast<aclrtStream>(stream.GetHandle())));
} else if (src_device.Type() == OrtDevice::NPU) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped the order to match the non-async method above. i.e. check NPU -> NPU first then NPU -> CPU/host accessible.
The else if in the original code is misleading as there are only 3 src options: NPU, host accessible and CPU, so it was really an else.

This change applied in a couple of other places as well given implementations for data transfer have been copied from existing EPs when being created.

}
} else if (src_device.Type() == OrtDevice::NPU) {
if (dst_device.Type() == OrtDevice::CPU) {
Copy link
Contributor Author

@skottmckay skottmckay Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if was unnecessary as CPU in the old code covered CPU and host accessible, so there are no other destination options as we did npu to npu earlier.

if it was necessary data transfer would have been broken as there's no else that would handle other cases.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@@ -53,11 +53,11 @@ class CUDAExternalAllocator : public CUDAAllocator {
// TODO: add a default constructor
class CUDAPinnedAllocator : public IAllocator {
public:
CUDAPinnedAllocator(const char* name)
CUDAPinnedAllocator(OrtDevice::DeviceId device_id, const char* name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to associate the pinned memory allocator with a specific device? I didn't see where this device_id is used by the allocator implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in CUDA but I believe the device is selected outside of cudaMallocHost via cudaSetDevice. I would have expected the device to have been set in general for the EP given an instance is only using one device. So they get the device id now, but there's probably nothing that needs to be done in the current setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can see, cudaSetDevice is called for CUDAAllocator::Alloc/Free but not CUDAPinnedAllocator::Alloc/Free. is a CUDAPinnedAllocator instance intended to only be used with one device?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA Pinned memory is host memory.

// so that would satisfy the alignment requirement of any other CPU consumers.
// If one device is not on CPU, we default on the one that is CPU.
auto determine_device = [](const OrtDevice& output_device, const OrtDevice& suggested_device) -> OrtDevice {
if (output_device.Type() == OrtDevice::CPU && suggested_device.Type() == OrtDevice::CPU) {
if (output_device.MemType() == OrtDevice::MemType::DEFAULT &&
suggested_device.MemType() == OrtDevice::MemType::DEFAULT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment mentions "default mem type". Is that still relevant or should it be updated?

@@ -399,8 +399,7 @@ void DumpTensor(
// check tensor is on CPU before dumping it
auto& tensor_location = tensor.Location();
if (tensor_location.device.Type() == OrtDevice::CPU ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor_location.device.UsesCpuMemory()

: IAllocator(
OrtMemoryInfo(name, OrtAllocatorType::OrtDeviceAllocator,
OrtDevice(OrtDevice::CPU, OrtDevice::MemType::HOST_ACCESSIBLE, OrtDevice::VendorIds::NVIDIA,
0 /*CPU device always with id 0*/),
OrtDevice(OrtDevice::GPU, OrtDevice::MemType::HOST_ACCESSIBLE, OrtDevice::VendorIds::NVIDIA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU

should still be CPU since it's host memory.

OrtDevice(OrtDevice::CPU, OrtDevice::MemType::HOST_ACCESSIBLE, OrtDevice::VendorIds::NVIDIA,
0 /*CPU device always with id 0*/),
OrtDevice(OrtDevice::GPU, OrtDevice::MemType::HOST_ACCESSIBLE, OrtDevice::VendorIds::NVIDIA,
device_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_id

should set device_id to 0 for host memory.

return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::HOST_ACCESSIBLE, OrtDevice::VendorIds::NVIDIA,
0 /*CPU device id always be 0*/);
return OrtDevice(OrtDevice::GPU, OrtDevice::MemType::HOST_ACCESSIBLE, OrtDevice::VendorIds::NVIDIA,
default_device_.Id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to revert the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants