[DTensor] Allow DTensor support cuda-like device #102468

shaoyf42 · 2023-05-29T04:46:32Z

Allow DTensor support cuda-like device, fix #102442

Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example #101914 and #101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!

Similar to what is done here, we need to initialize the communication backend for the device set by DeviceMesh. So _default_backend_for_device is added to Backend. It is worth noting that when we register a new backend for a device other than cpu and cuda, we also need to add a new default backend for this device.
Adding _device_handle to DeviceMesh for cuda-like devices, similar to what is set in FSDP. When _device_handle is not None, the device has similar behavior to cuda. In this way, functions like torch.cuda.device_count() need to be modified to device_mesh._device_handle.device_count().

pytorch-bot · 2023-05-29T04:46:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102468

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 0afebac:

NEW FAILURE - The following job has failed:

BC Lint (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-05-29T04:46:35Z

The committers listed above are authorized under a signed CLA.

✅ login: shaoyf42 (190b389, ef61de3, 0afebac)

shaoyf42 · 2023-05-30T01:57:49Z

@wanchaol Could you take a look?

wanchaol

Thanks for contributing! It seems interesting that we want cuda' device but not nccl` communicator. I left a few comments about the integration piece, would love to learn more about what's the detailed use case :)

I am also trying out use c10d's dispatchable backend to make it easier for backend integration #102336, I'll land this today and we can see if there's additional gaps to make your case work in this PR?

wanchaol · 2023-05-30T19:14:28Z

torch/distributed/distributed_c10d.py

@@ -190,6 +195,17 @@ def __new__(cls, name: str):
            value = name.lower()
        return value

+    @classmethod
+    def get_default_backend_for_device(cls, device: str):


I am working on a PR that allows easier backend integration with custom backend #102336 so I am not sure if we would still need to change the c10d backend to set for the default backend for device. I assume if you register the custom backend, it should automatically override the backend config if you initialize the world_pg first?

I think maybe we can allow passing in the device_type to device_mesh be sth like cuda:non-nccl and pass this to the init_process_group call

pytorch/torch/distributed/distributed_c10d.py

Line 270 in 7b47cd0

elif ":" in backend.lower():

Oh actually if we don't specify anything in init_process_group, it would not take the custom registered backend and get it initialized... it seems like we do need this map that maps from device_type to backend, cc @H-Huang does it make sense to have this reverse map for custom backends?

I assume if you register the custom backend, it should automatically override the backend config if you initialize the world_pg first?

I think maybe we can allow passing in the device_type to device_mesh be sth like cuda:non-nccl and pass this to the init_process_group call

Oh actually if we don't specify anything in init_process_group, it would not take the custom registered backend and get it initialized

it seems like we do need this map that maps from device_type to backend,

As mentioned here, after registered a custom backend, there are two ways to initialize the backend (indluding custom backend):

User use init_process_group before use them, or passes backend to init_process_group.

Maintain a map from device_type to backend, which is updated when registering the custom backend. When we don't specify anything in init_process_group , it determines whether each device and backend in the map is available, and then uses the available device-backend pairs to initialize the processgroup.

By the way, a unified function can be used to determine whether the backend is available, such as is_backend_available, instead of a special case for each backend. is_backend_available supports the unified judgment of built-in and third-party registered backends. I have an implementation #101945

wanchaol · 2023-05-30T19:16:38Z

torch/distributed/_tensor/device_mesh.py

@@ -107,6 +107,9 @@ def __init__(
        _init_process_groups: bool = True,
    ) -> None:
        self.device_type = device_type
+        self._device_handle = (


I don't quite like the device_handle thing, more like a hack to me specialized to cuda. if you are inventing a new backend that conforms with cuda, shouldn't it also have a identical call to torch.cuda.set_device, or maybe we can use CUDA_VISIBLE_DEVICES if it's hard to hijack into torch.cuda

if there's additional gaps to make your case work in this PR?

We hope that the distributed nature of PyTorch can support third-party devices and third-party backends, not just third-party backends.

In fact, our device and cuda have the same semantics, so it will also be consistent with the cuda interface, which is a natural approach. For example, xpu uses torch._register_device_module("xpu", current_module) to register its extension to torch.xpu, and then we can use torch.xpu.device_count() and torch.xpu.get_rng_state() realize the same function as in cuda. Our implementation is similar to xpu, so for this device using torch.custom_device, we want use _device_handle instead of directly using torch.cuda to support cuda-like devices which like in FSDP.

I see, this make sense then, could you rebase the PR and fix the merge conflict?

I rebase the PR. In addition, the situation that init_process_group does not initialize custom_backend by default can be discussed and resolved in the next PR.

wanchaol

Looks good! I have one more suggestion before we merge it

wanchaol · 2023-06-05T23:02:12Z

torch/distributed/_tensor/device_mesh.py

@@ -101,6 +101,9 @@ def __init__(
        _init_process_groups: bool = True,
    ) -> None:
        self.device_type = device_type
+        self._device_handle = (


One thing I would suggest is that let's not try to attach device_handle to self, we need to be a bit careful when adding additional attribute to DeviceMesh, as one thing that is on the radar is that we want to make sure DTensor is pickable, so additional attribute there seems not quite ideal.

We can simply create _device_handle and use it in the __init__ without saving it to device mesh

wanchaol · 2023-06-05T23:02:52Z

torch/distributed/_tensor/random.py

-        if device_mesh.device_type == "cuda":
-            torch.cuda.set_rng_state(new_state)
+
+        if device_mesh._device_handle:


similarly in here, we can create device_handle again so that we don't need to save it to device_mesh itself.

wanchaol

lgtm, thanks for addressing the comments!

wanchaol · 2023-06-07T20:49:51Z

@pytorchbot merge

pytorchmergebot · 2023-06-07T20:52:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the release notes: distributed (c10d) release notes category label May 29, 2023

pytorchbot added the open source label May 29, 2023

shaoyf42 force-pushed the DTensor branch from 96437a5 to 0585e16 Compare May 29, 2023 10:45

shaoyf42 mentioned this pull request May 29, 2023

[FSDP]Add device_mesh to FSDPstate #102317

Closed

shaoyf42 marked this pull request as ready for review May 29, 2023 15:42

shaoyf42 requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners May 29, 2023 15:42

wanchaol reviewed May 30, 2023

View reviewed changes

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 31, 2023

shaoyf42 force-pushed the DTensor branch from 85e9277 to ee488c0 Compare June 2, 2023 23:47

shaoyf42 requested a review from wanchaol June 3, 2023 15:20

wanchaol reviewed Jun 5, 2023

View reviewed changes

Make DTensor support cuda-like device.

190b389

shaoyf42 force-pushed the DTensor branch from ee488c0 to 190b389 Compare June 7, 2023 02:27

fix circular import

ef61de3

shaoyf42 requested a review from fduwjj as a code owner June 7, 2023 03:55

fix lint check

0afebac

shaoyf42 requested a review from wanchaol June 7, 2023 09:41

wanchaol approved these changes Jun 7, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 7, 2023

pytorchmergebot added the merging label Jun 7, 2023

pytorchmergebot added Merged and removed merging labels Jun 7, 2023

pytorchmergebot closed this in 17737f9 Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Allow DTensor support cuda-like device #102468

[DTensor] Allow DTensor support cuda-like device #102468

shaoyf42 commented May 29, 2023 •

edited

pytorch-bot bot commented May 29, 2023 •

edited

linux-foundation-easycla bot commented May 29, 2023 •

edited

shaoyf42 commented May 30, 2023

wanchaol left a comment

wanchaol May 30, 2023

wanchaol May 31, 2023

shaoyf42 May 31, 2023

shaoyf42 May 31, 2023

wanchaol May 30, 2023

shaoyf42 May 31, 2023

wanchaol Jun 1, 2023

shaoyf42 Jun 3, 2023

wanchaol left a comment

wanchaol Jun 5, 2023

wanchaol Jun 5, 2023

wanchaol left a comment

wanchaol commented Jun 7, 2023

pytorchmergebot commented Jun 7, 2023

[DTensor] Allow DTensor support cuda-like device #102468

[DTensor] Allow DTensor support cuda-like device #102468

Conversation

shaoyf42 commented May 29, 2023 • edited

pytorch-bot bot commented May 29, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102468

❌ 1 New Failure

linux-foundation-easycla bot commented May 29, 2023 • edited

shaoyf42 commented May 30, 2023

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol commented Jun 7, 2023

pytorchmergebot commented Jun 7, 2023

Merge started

shaoyf42 commented May 29, 2023 •

edited

pytorch-bot bot commented May 29, 2023 •

edited

linux-foundation-easycla bot commented May 29, 2023 •

edited