Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rocm] Enable faster load of fused-sgd #117599

Closed
crcrpar opened this issue Jan 17, 2024 · 1 comment
Closed

[rocm] Enable faster load of fused-sgd #117599

crcrpar opened this issue Jan 17, 2024 · 1 comment
Labels
module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@crcrpar
Copy link
Collaborator

crcrpar commented Jan 17, 2024

Currently it's turned off to avoid the hang of test_cuda

          I don't have an explanation yet, but since the kernel had two potential paths (`use_faster_load_store`), I used a preprocessor #ifdef for rocm to force the other path.  This worked.  So there was something wrong with using the faster load store path.  This is my diff that worked for ROCm
diff --git a/aten/src/ATen/native/cuda/FusedSgdKernel.cu b/aten/src/ATen/native/cuda/FusedSgdKernel.cu
index a00b47ff79..0c52ee5783 100644
--- a/aten/src/ATen/native/cuda/FusedSgdKernel.cu
+++ b/aten/src/ATen/native/cuda/FusedSgdKernel.cu
@@ -87,6 +87,7 @@ struct FusedSgdMathFunctor {
         init_args<depth>(args, tl, chunk_idx, chunk_size, tensor_loc)};
     n -= chunk_idx * chunk_size;

+#ifndef USE_ROCM
     const auto use_faster_load_store =
         (n % kILP == 0) && (chunk_size % kILP == 0) && all_aligned;
     if (use_faster_load_store) {
@@ -116,7 +117,9 @@ struct FusedSgdMathFunctor {
           load_store(args[2], r_args[2], i_start, 0);
         }
       }
-    } else {
+    } else
+#endif
+    {
       for (auto i_start = 0; i_start < n && i_start < chunk_size;
            i_start += blockDim.x * kILP) {
         load_args<depth>(r_args, args, i_start, chunk_size, n);

Originally posted by @jeffdaily in #116585 (comment)

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jan 17, 2024
@crcrpar crcrpar changed the title test_cuda's hang with fused sgd on rocm observed in 116585 rocm disables faster load of fused-sgd to resolve test_cuda hang Jan 17, 2024
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 17, 2024
@crcrpar crcrpar changed the title rocm disables faster load of fused-sgd to resolve test_cuda hang [rocm] Enable faster load of fused-sgd Jan 20, 2024
@petrex
Copy link
Contributor

petrex commented Apr 2, 2024

@crcrpar Any particular unit tests we can use to validate the feature?

ZelboK pushed a commit to ZelboK/pytorch that referenced this issue May 19, 2024
Reopen due to rebase error. Fixes pytorch#117599

The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR

HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh

```
:4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
```

Pull Request resolved: pytorch#125456
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
3 participants