-
Notifications
You must be signed in to change notification settings - Fork 26k
Description
🐛 Describe the bug
Hi @hliuca
On Nvidia, they support torch._C._set_sm_carveout_experimental for better compute-comms overlapping. this is useful during bwd pass of DDP and fwd/bwd pass of FSDP to ensure there is enough available SM/CUs for the rccl comms kernels to not be blocked by compute kernels that use up all the SM/CUs
Furthermore, it is useful towards benchmarking real world GEMMs that occurs in the backwards pass when the GEMM is unavailable to take up all the available SM/CUs due to rccl comms kernels occupying some of the SM/CUs
related to #147966
I was looking into implementing this myself but it seems like it isn't as simple as calling hipblasLtMatmulDescSetAttribute as it requires changes to hipblaslt itself since unlike cublasLtMatmulDescSetAttribute, HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET is not an option for hipblasLtMatmulDescSetAttribute function which takes in enum of hipblasLtMatmulDescAttributes_t at least according to the AMD docs
computeDesc.setAttribute<int32_t>(
CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET,
at::cuda::getCurrentDeviceProperties()->multiProcessorCount -
at::globalContext()._SMCarveout_EXPERIMENTAL().value());
}Versions
any rocm torch version
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd
Metadata
Metadata
Assignees
Labels
Type
Projects
Status