Skip to content

Commit

Permalink
[ROCm] enable faster_load_save for Fused_SGD (#125456)
Browse files Browse the repository at this point in the history
Reopen due to rebase error. Fixes #117599

The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR

HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh

```
:4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
```

Pull Request resolved: #125456
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
  • Loading branch information
petrex authored and ZelboK committed May 19, 2024
1 parent c81bf77 commit 6372770
Showing 1 changed file with 0 additions and 4 deletions.
4 changes: 0 additions & 4 deletions aten/src/ATen/native/cuda/FusedSgdKernel.cu
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,8 @@ struct FusedSgdMathFunctor {
init_args<depth>(args, tl, chunk_idx, chunk_size, tensor_loc)};
const auto n = tl.numel_for_tensor[tensor_loc] - chunk_idx * chunk_size;

#ifndef USE_ROCM
const auto use_faster_load_store =
(n % kILP == 0) && (chunk_size % kILP == 0) && all_aligned;
#else
const auto use_faster_load_store{false};
#endif
if (use_faster_load_store) {
for (auto i_start = threadIdx.x;
i_start * kILP < n && i_start * kILP < chunk_size;
Expand Down

0 comments on commit 6372770

Please sign in to comment.