[rocm] Enable faster load of fused-sgd #117599

crcrpar · 2024-01-17T02:02:49Z

Currently it's turned off to avoid the hang of test_cuda

          I don't have an explanation yet, but since the kernel had two potential paths (`use_faster_load_store`), I used a preprocessor #ifdef for rocm to force the other path.  This worked.  So there was something wrong with using the faster load store path.  This is my diff that worked for ROCm

diff --git a/aten/src/ATen/native/cuda/FusedSgdKernel.cu b/aten/src/ATen/native/cuda/FusedSgdKernel.cu
index a00b47ff79..0c52ee5783 100644
--- a/aten/src/ATen/native/cuda/FusedSgdKernel.cu
+++ b/aten/src/ATen/native/cuda/FusedSgdKernel.cu
@@ -87,6 +87,7 @@ struct FusedSgdMathFunctor {
         init_args<depth>(args, tl, chunk_idx, chunk_size, tensor_loc)};
     n -= chunk_idx * chunk_size;

+#ifndef USE_ROCM
     const auto use_faster_load_store =
         (n % kILP == 0) && (chunk_size % kILP == 0) && all_aligned;
     if (use_faster_load_store) {
@@ -116,7 +117,9 @@ struct FusedSgdMathFunctor {
           load_store(args[2], r_args[2], i_start, 0);
         }
       }
-    } else {
+    } else
+#endif
+    {
       for (auto i_start = 0; i_start < n && i_start < chunk_size;
            i_start += blockDim.x * kILP) {
         load_args<depth>(r_args, args, i_start, chunk_size, n);

Originally posted by @jeffdaily in #116585 (comment)

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

The text was updated successfully, but these errors were encountered:

petrex · 2024-04-02T18:04:10Z

@crcrpar Any particular unit tests we can use to validate the feature?

Reopen due to rebase error. Fixes pytorch#117599 The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh ``` :4:command.cpp :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070 :4:rocvirtual.cpp :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00 :3:rocvirtual.hpp :66 : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns ``` Pull Request resolved: pytorch#125456 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jan 17, 2024

crcrpar changed the title ~~test_cuda's hang with fused sgd on rocm observed in 116585~~ rocm disables faster load of fused-sgd to resolve test_cuda hang Jan 17, 2024

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 17, 2024

crcrpar changed the title ~~rocm disables faster load of fused-sgd to resolve test_cuda hang~~ [rocm] Enable faster load of fused-sgd Jan 20, 2024

petrex mentioned this issue Apr 9, 2024

[ROCm]Enable Faster_Load_Store for FusedSGD #123666

Closed

petrex mentioned this issue May 3, 2024

[ROCm] enable faster_load_save for Fused_SGD #125456

Closed

pytorchmergebot closed this as completed in 30b70b1 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rocm] Enable faster load of fused-sgd #117599

[rocm] Enable faster load of fused-sgd #117599

crcrpar commented Jan 17, 2024 •

edited

Loading

petrex commented Apr 2, 2024

[rocm] Enable faster load of fused-sgd #117599

[rocm] Enable faster load of fused-sgd #117599

Comments

crcrpar commented Jan 17, 2024 • edited Loading

petrex commented Apr 2, 2024

crcrpar commented Jan 17, 2024 •

edited

Loading