[ROCm] enable faster_load_save for Fused_SGD (#125456)

Reopen due to rebase error. Fixes #117599 The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh ``` :4:command.cpp :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070 :4:rocvirtual.cpp :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00 :3:rocvirtual.hpp :66 : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns ``` Pull Request resolved: #125456 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
pytorch · May 19, 2024 · 6372770 · 6372770
1 parent c81bf77
commit 6372770
Showing 1 changed file with 0 additions and 4 deletions.
diff --git a/aten/src/ATen/native/cuda/FusedSgdKernel.cu b/aten/src/ATen/native/cuda/FusedSgdKernel.cu
@@ -86,12 +86,8 @@ struct FusedSgdMathFunctor {
         init_args<depth>(args, tl, chunk_idx, chunk_size, tensor_loc)};
     const auto n = tl.numel_for_tensor[tensor_loc] - chunk_idx * chunk_size;
 
-#ifndef USE_ROCM
     const auto use_faster_load_store =
         (n % kILP == 0) && (chunk_size % kILP == 0) && all_aligned;
-#else
-    const auto use_faster_load_store{false};
-#endif
     if (use_faster_load_store) {
       for (auto i_start = threadIdx.x;
            i_start * kILP < n && i_start * kILP < chunk_size;