Add capturable single-tensor RAdam, Adamax #118697

MarouaneMaatouk · 2024-01-31T00:17:17Z

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

linux-foundation-easycla · 2024-01-31T00:17:21Z

The committers listed above are authorized under a signed CLA.

✅ login: MarouaneMaatouk / name: Marouane (3ac081f, cc20f63, 9530527, 75074aa, 622731b, 87b4aa0, d465b22, fc0aef1, a8dacbe, ec1c622)

pytorch-bot · 2024-01-31T00:17:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118697

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 1 Unrelated Failure

As of commit ec1c622 with merge base 1adedc3 ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.1-py3.10-gcc9-sm80 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / rocm5.7-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh)
##[error]The operation was canceled.
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks (gh)
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / build (gh)
##[error]The operation was canceled.
Lint / lintrunner / linux-job (gh)
>>> Lint for torch/testing/_internal/common_optimizers.py:
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh)
test_optim.py::TestOptimRenewedCUDA::test_can_load_older_state_dict_RAdam_cuda_float32
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_optimizers.py::CompiledOptimizerTests::test_radam_capturable_cuda
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_optim.py::TestOptimRenewedCUDA::test_can_load_older_state_dict_RAdam_cuda_float32

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_memory_planning.py::TestMemoryPlanning::test_abi_compatible

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mlazos · 2024-01-31T01:16:17Z

@MarouaneMaatouk I think you did Adamax actually (which was also needed!) but can you add tests here: https://github.com/pytorch/pytorch/pull/117912/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450 for the cudagraphs support?

cc @janeyx99 for the status on capturable testing with optiminfos, I think we might need to wait for #118326 to go in to make sure we're testing the capturable for Adamax properly.

janeyx99 · 2024-01-31T01:27:01Z

No need to wait for me, but do make the Adamax related changes in common_optimizer.py in the linked PR. I don’t want my change to block anything!

MarouaneMaatouk · 2024-02-01T20:18:02Z

@mlazos adamax tests pass, but radam fails and I wasn't able to make sense of the error maybe I am missing something.
From logs:

[2024-02-01 21:02:21,116] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION_KW 14 [SkipFilesVariable(), ListVariable(), ListVariable(), ListVariable(), ListVariable(), ListVariable(), ConstantVariable(float), ConstantVariable(float), ConstantVariable(float), ConstantVariable(float), ConstantVariable(float), ConstantVariable(bool), ConstantVariable(bool), ConstantVariable(bool), ConstantVariable(bool), TupleVariable()]
[2024-02-01 21:02:21,117] [0/0] torch._dynamo.symbolic_convert: [DEBUG] empty checkpoint
[2024-02-01 21:02:21,117] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object radam at 0x7fc5c08e82f0, file "/<PATH>/pytorch/torch/optim/radam.py", line 223>
[2024-02-01 21:02:21,118] [0/0] torch._dynamo.symbolic_convert: [DEBUG] empty checkpoint
[2024-02-01 21:02:21,119] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object step at 0x7fc5c08d7e10, file "/<PATH>/pytorch/torch/optim/radam.py", line 106>
[2024-02-01 21:02:21,119] [0/0] torch._dynamo.symbolic_convert: [DEBUG] empty checkpoint

File "<PATH>/pytorch/torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW
  self.call_function(fn, args, kwargs)
File "<PATH>/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
  self.push(fn.call_function(self, args, kwargs))
File "<PATH>/pytorch/torch/_dynamo/variables/misc.py", line 685, in call_function
  unimplemented(f"call torch._dynamo.disable() wrapped function {self.value}")
File "<PATH>/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
  raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call torch._dynamo.disable() wrapped function <function _single_tensor_radam at 0x7f7d62edecb0>

torch/optim/adamax.py

mlazos · 2024-02-01T20:20:50Z

@mlazos adamax tests pass, but radam fails and I wasn't able to make sense of the error maybe I am missing something. From logs:

[2024-02-01 21:02:21,116] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION_KW 14 [SkipFilesVariable(), ListVariable(), ListVariable(), ListVariable(), ListVariable(), ListVariable(), ConstantVariable(float), ConstantVariable(float), ConstantVariable(float), ConstantVariable(float), ConstantVariable(float), ConstantVariable(bool), ConstantVariable(bool), ConstantVariable(bool), ConstantVariable(bool), TupleVariable()]
[2024-02-01 21:02:21,117] [0/0] torch._dynamo.symbolic_convert: [DEBUG] empty checkpoint
[2024-02-01 21:02:21,117] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object radam at 0x7fc5c08e82f0, file "/<PATH>/pytorch/torch/optim/radam.py", line 223>
[2024-02-01 21:02:21,118] [0/0] torch._dynamo.symbolic_convert: [DEBUG] empty checkpoint
[2024-02-01 21:02:21,119] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object step at 0x7fc5c08d7e10, file "/<PATH>/pytorch/torch/optim/radam.py", line 106>
[2024-02-01 21:02:21,119] [0/0] torch._dynamo.symbolic_convert: [DEBUG] empty checkpoint

File "<PATH>/pytorch/torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW
  self.call_function(fn, args, kwargs)
File "<PATH>/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
  self.push(fn.call_function(self, args, kwargs))
File "<PATH>/pytorch/torch/_dynamo/variables/misc.py", line 685, in call_function
  unimplemented(f"call torch._dynamo.disable() wrapped function {self.value}")
File "<PATH>/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
  raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call torch._dynamo.disable() wrapped function <function _single_tensor_radam at 0x7f7d62edecb0>

Remove this code: https://github.com/pytorch/pytorch/blob/923a7c757205a327e8f8a6d4f9dbda036ae531d3/torch/_dynamo/eval_frame.py#L1571C4-L1572C10

pytorch-bot · 2024-02-01T21:18:02Z

Please seek CI approval before scheduling CIFlow labels

MarouaneMaatouk · 2024-02-01T21:20:01Z

Remove this code: https://github.com/pytorch/pytorch/blob/923a7c757205a327e8f8a6d4f9dbda036ae531d3/torch/_dynamo/eval_frame.py#L1571C4-L1572C10

Thanks, done in the last commit. However I am still having some issues due to this condition (https://github.com/pytorch/pytorch/pull/118697/files#diff-4e7620901810b83e6a28709cbb678170338937eae1e3949b1c1e295d803cca68R368), using directly if wasn't working.
I'll look into this

mlazos · 2024-02-02T07:38:13Z

Remove this code: 923a7c7/torch/_dynamo/eval_frame.py#L1571C4-L1572C10

Thanks, done in the last commit. However I am still having some issues due to this condition (#118697 (files)), using directly if wasn't working. I'll look into this

oh if you look at the multitensor version at the bottom of the same file I use torch.where, that will be more fitting for this use case. It selects elements from a left or right tensor based on elements in a bool tensor. cond is a little more general and experimental (it can't be lowered to GPU)

pytorch-bot · 2024-02-03T09:59:56Z

Please seek CI approval before scheduling CIFlow labels

janeyx99

Will look more closely sometime after my meetings today/tomorrow but noticed two things from a brief glance

janeyx99 · 2024-02-05T20:34:49Z

torch/testing/_internal/common_optimizers.py

        optim_error_inputs_func=optim_error_inputs_func_adamax,
        supported_impls=("foreach", "differentiable"),
-        only_supports_capturable_on_foreach=True,  # Remove this line when #117836 is done!
+        only_supports_capturable_on_foreach=False,


Please delete the whole line!

And also remove the skipped tests for Adamax as well

pytorch-bot · 2024-02-06T22:20:35Z

Please seek CI approval before scheduling CIFlow labels

janeyx99

@MarouaneMaatouk It looks like there is failing CI currently. I am also noticing that the CUDA graph tests (the ones that test the point of capturable) are missing. Add in the single tensor variants here: https://github.com/pytorch/pytorch/blob/main/test/test_cuda.py#L2695-L2766

Let us know if you need any help!

Finishes the work started in #118697. Thanks MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Implementation thanks to MarouaneMaatouk in #118697. Added tests and the cudagraph health check. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Finishes the work started in #118697. Thanks MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Implementation thanks to MarouaneMaatouk in #118697. Added tests and the cudagraph health check. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Finishes the work started in #118697. Thanks MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Implementation thanks to MarouaneMaatouk in #118697. Added tests and the cudagraph health check. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

@MarouaneMaatouk

Finishes the work started in #118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. Pull Request resolved: #121183 Approved by: https://github.com/albanD

Implementation thanks to MarouaneMaatouk in #118697. Added tests and the cudagraph health check. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

@MarouaneMaatouk

Implementation thanks to @MarouaneMaatouk in #118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check. Pull Request resolved: #121260 Approved by: https://github.com/mlazos

github-actions · 2024-04-13T20:33:41Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

janeyx99 · 2024-04-13T23:02:15Z

this has been done, closing

MarouaneMaatouk requested review from albanD and janeyx99 as code owners January 31, 2024 00:17

pytorch-bot bot added the release notes: optim label Jan 31, 2024

MarouaneMaatouk mentioned this pull request Jan 31, 2024

Add capturable single-tensor RAdam #118230

Closed

pytorchbot added the open source label Jan 31, 2024

MarouaneMaatouk marked this pull request as draft January 31, 2024 00:35

mlazos self-requested a review January 31, 2024 01:00

github-actions bot added the module: inductor label Feb 1, 2024

MarouaneMaatouk commented Feb 1, 2024

View reviewed changes

torch/optim/adamax.py Outdated Show resolved Hide resolved

github-actions bot added module: dynamo ciflow/inductor labels Feb 1, 2024

pytorch-bot bot removed the ciflow/inductor label Feb 1, 2024

github-actions bot added the ciflow/inductor label Feb 3, 2024

pytorch-bot bot removed the ciflow/inductor label Feb 3, 2024

MarouaneMaatouk added 6 commits February 3, 2024 11:10

Add capturable single-tensor RAdam

3ac081f

update adamax

cc20f63

update radam single_tensor

9530527

Enable tests

75074aa

Fix adamax

622731b

update radam

87b4aa0

pytorch-bot bot removed the ciflow/inductor label Feb 6, 2024

janeyx99 reviewed Feb 6, 2024

View reviewed changes

Update tests

ec1c622

github-actions bot added the ciflow/inductor label Feb 6, 2024

pytorch-bot bot removed the ciflow/inductor label Feb 6, 2024

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2024

janeyx99 changed the title ~~Add capturable single-tensor RAdam~~ Add capturable single-tensor RAdam, Adamax Feb 13, 2024

janeyx99 reviewed Feb 13, 2024

View reviewed changes

janeyx99 mentioned this pull request Mar 5, 2024

Add capturable single tensor Adamax #121183

Closed

janeyx99 mentioned this pull request Mar 5, 2024

Add RAdam capturable API for forloop #121260

Closed

github-actions bot added the Stale label Apr 13, 2024

pytorch-bot bot added the ciflow/inductor label Apr 13, 2024

janeyx99 closed this Apr 13, 2024

Add capturable single-tensor RAdam, Adamax #118697

Add capturable single-tensor RAdam, Adamax #118697

Uh oh!

Conversation

MarouaneMaatouk commented Jan 31, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118697

❌ 9 New Failures, 1 Unrelated Failure

Uh oh!

mlazos commented Jan 31, 2024

Uh oh!

janeyx99 commented Jan 31, 2024

Uh oh!

MarouaneMaatouk commented Feb 1, 2024

Uh oh!

Uh oh!

mlazos commented Feb 1, 2024

Uh oh!

pytorch-bot bot commented Feb 1, 2024

Uh oh!

MarouaneMaatouk commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlazos commented Feb 2, 2024

Uh oh!

pytorch-bot bot commented Feb 3, 2024

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 Feb 5, 2024

Choose a reason for hiding this comment

Uh oh!

janeyx99 Feb 5, 2024

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Feb 6, 2024

Uh oh!

janeyx99 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2024

Uh oh!

janeyx99 commented Apr 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MarouaneMaatouk commented Jan 31, 2024 •

edited by pytorch-bot bot

Loading

linux-foundation-easycla bot commented Jan 31, 2024 •

edited

Loading

pytorch-bot bot commented Jan 31, 2024 •

edited

Loading

MarouaneMaatouk commented Feb 1, 2024 •

edited

Loading

janeyx99 left a comment •

edited

Loading