[DTensor] Allow DTensor support third-party device #102442

shaoyf42 · 2023-05-27T18:31:18Z

🚀 The feature, motivation and pitch

Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example #101914 and #101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!

Alternatives

init processgroup
The first step is to initialize the backend for the device_type in DeviceMesh, as done Allow ORT backend for DTensor #101914, default_backend_for_device used to save the default backend used by different devices. However, rather than directly adding it as a global variable, I prefer to add it in the Backend, because there is already a similar structure backend_capability in the Backend. In addition, when the user registers the third-party backend for the third-party device, it also needs to be updated synchronously.

class Backend:
    _default_backend_for_device: Dict[str, List[str]] = {
        "cpu": GLOO,
        "cuda": CUDA,
    }
    @classmethod
    def get_default_backend_for_device(cls, device: str):
        if device not in Backend._default_backend_for_device:
            raise RuntimeError(f"Default backend not set for device type {device}, please set a default using \
                            set_default_backend_for_device")
        return Backend._default_backend_for_device[device]
    @classmethod
    def register_backend(cls, name, func, extended_api=False, devices: Optional[Union[str, List[str]]] = None):
        ......
        for device in Backend.backend_capability[name.lower()]:
            if device not in Backend._default_backend_for_device:
                Backend._default_backend_for_device[device] = name.lower()

Replace torch.cuda.func
In DTensor, the calling method of torch.cuda.xxx needs to be replaced, for example torch.cuda.device_count will be replaced with:

device_model = getattr(torch, device_type, None)
if device_model:
    num_gpus_per_host = getattr(device_model, "device_count")()

Other functions can also be replaced similarly to achieve support for cuda-like devices.

Of course, maybe like FSDP in #99024, maintaining a _device_handle in DeviceMesh may be a good way?

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

The text was updated successfully, but these errors were encountered:

shaoyf42 mentioned this issue May 29, 2023

[DTensor] Allow DTensor support cuda-like device #102468

Closed

awgu added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 30, 2023

pytorchmergebot closed this as completed in 17737f9 Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Allow DTensor support third-party device #102442

[DTensor] Allow DTensor support third-party device #102442

shaoyf42 commented May 27, 2023 •

edited by pytorch-bot bot

[DTensor] Allow DTensor support third-party device #102442

[DTensor] Allow DTensor support third-party device #102442

Comments

shaoyf42 commented May 27, 2023 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

Alternatives

Additional context

shaoyf42 commented May 27, 2023 •

edited by pytorch-bot bot