Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] enable faster_load_save for Fused_SGD #125456

Closed
wants to merge 1 commit into from

Conversation

petrex
Copy link
Contributor

@petrex petrex commented May 3, 2024

Reopen due to rebase error. Fixes #117599

The reported hang test : test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers is passing with this PR

HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh

:4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

Copy link

pytorch-bot bot commented May 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125456

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c4eff75 with merge base b08072f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@petrex
Copy link
Contributor Author

petrex commented May 3, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/125456/head returned non-zero exit code 1

Rebasing (1/5)
Rebasing (2/5)
Auto-merging test/test_cuda.py
CONFLICT (content): Merge conflict in test/test_cuda.py
error: could not apply cf814876534... enable test_grah_warn...() for rocm
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply cf814876534... enable test_grah_warn...() for rocm

Raised by https://github.com/pytorch/pytorch/actions/runs/8934168228

@janeyx99
Copy link
Contributor

janeyx99 commented May 3, 2024

FYI as of yesterday test_grad_scaling_autocast_fused_optimizers has moved to test/test_optim.py

@janeyx99
Copy link
Contributor

janeyx99 commented May 3, 2024

@crcrpar could you take a look at this PR?

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 3, 2024
@jeffdaily jeffdaily changed the title Enable faster_load_save for Fused_SGD [ROCm] enable faster_load_save for Fused_SGD May 10, 2024
@pytorch-bot pytorch-bot bot added ciflow/rocm module: rocm AMD GPU support for Pytorch labels May 10, 2024
@jeffdaily
Copy link
Collaborator

@petrex my approval is conditional on the CI fully passing. Looks like you'll need to manually rebase.

@petrex petrex requested a review from eqy as a code owner May 11, 2024 02:10
@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label May 11, 2024
@petrex
Copy link
Contributor Author

petrex commented May 11, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fused_sgd_rocm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fused_sgd_rocm && git pull --rebase)

@pruthvistony
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@janeyx99
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
Reopen due to rebase error. Fixes pytorch#117599

The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR

HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh

```
:4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
```

Pull Request resolved: pytorch#125456
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source release notes: cuda release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[rocm] Enable faster load of fused-sgd
7 participants