Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DTensor] Allow DTensor support third-party device #102442

Closed
shaoyf42 opened this issue May 27, 2023 · 0 comments
Closed

[DTensor] Allow DTensor support third-party device #102442

shaoyf42 opened this issue May 27, 2023 · 0 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@shaoyf42
Copy link
Contributor

shaoyf42 commented May 27, 2023

馃殌 The feature, motivation and pitch

Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example #101914 and #101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!

Alternatives

  1. init processgroup
    The first step is to initialize the backend for the device_type in DeviceMesh, as done Allow ORT backend for DTensor聽#101914, default_backend_for_device used to save the default backend used by different devices. However, rather than directly adding it as a global variable, I prefer to add it in the Backend, because there is already a similar structure backend_capability in the Backend. In addition, when the user registers the third-party backend for the third-party device, it also needs to be updated synchronously.
class Backend:
    _default_backend_for_device: Dict[str, List[str]] = {
        "cpu": GLOO,
        "cuda": CUDA,
    }
    @classmethod
    def get_default_backend_for_device(cls, device: str):
        if device not in Backend._default_backend_for_device:
            raise RuntimeError(f"Default backend not set for device type {device}, please set a default using \
                            set_default_backend_for_device")
        return Backend._default_backend_for_device[device]
    @classmethod
    def register_backend(cls, name, func, extended_api=False, devices: Optional[Union[str, List[str]]] = None):
        ......
        for device in Backend.backend_capability[name.lower()]:
            if device not in Backend._default_backend_for_device:
                Backend._default_backend_for_device[device] = name.lower()
  1. Replace torch.cuda.func
    In DTensor, the calling method of torch.cuda.xxx needs to be replaced, for example torch.cuda.device_count will be replaced with:
device_model = getattr(torch, device_type, None)
if device_model:
    num_gpus_per_host = getattr(device_model, "device_count")()

Other functions can also be replaced similarly to achieve support for cuda-like devices.

Of course, maybe like FSDP in #99024, maintaining a _device_handle in DeviceMesh may be a good way?

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

@awgu awgu added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants