You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example #101914 and #101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!
Alternatives
init processgroup
The first step is to initialize the backend for the device_type in DeviceMesh, as done Allow ORT backend for DTensor聽#101914, default_backend_for_device used to save the default backend used by different devices. However, rather than directly adding it as a global variable, I prefer to add it in the Backend, because there is already a similar structure backend_capability in the Backend. In addition, when the user registers the third-party backend for the third-party device, it also needs to be updated synchronously.
classBackend:
_default_backend_for_device: Dict[str, List[str]] = {
"cpu": GLOO,
"cuda": CUDA,
}
@classmethoddefget_default_backend_for_device(cls, device: str):
ifdevicenotinBackend._default_backend_for_device:
raiseRuntimeError(f"Default backend not set for device type {device}, please set a default using \ set_default_backend_for_device")
returnBackend._default_backend_for_device[device]
@classmethoddefregister_backend(cls, name, func, extended_api=False, devices: Optional[Union[str, List[str]]] =None):
......
fordeviceinBackend.backend_capability[name.lower()]:
ifdevicenotinBackend._default_backend_for_device:
Backend._default_backend_for_device[device] =name.lower()
Replace torch.cuda.func
In DTensor, the calling method of torch.cuda.xxx needs to be replaced, for example torch.cuda.device_count will be replaced with:
馃殌 The feature, motivation and pitch
Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example #101914 and #101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!
Alternatives
The first step is to initialize the backend for the device_type in DeviceMesh, as done Allow ORT backend for DTensor聽#101914,
default_backend_for_device
used to save the default backend used by different devices. However, rather than directly adding it as a global variable, I prefer to add it in the Backend, because there is already a similar structurebackend_capability
in the Backend. In addition, when the user registers the third-party backend for the third-party device, it also needs to be updated synchronously.In DTensor, the calling method of torch.cuda.xxx needs to be replaced, for example
torch.cuda.device_count
will be replaced with:Other functions can also be replaced similarly to achieve support for cuda-like devices.
Of course, maybe like FSDP in #99024, maintaining a
_device_handle
inDeviceMesh
may be a good way?Additional context
No response
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
The text was updated successfully, but these errors were encountered: