Make Adam optimizer differentiable #82205

emcastillo · 2022-07-26T03:32:56Z

Continues 80938

facebook-github-bot · 2022-07-26T03:33:03Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82205
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit db97708 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

albanD

Nice!

albanD · 2022-07-26T08:02:37Z

torch/optim/adam.py

@@ -278,7 +284,10 @@ def _single_tensor_adam(params: List[Tensor],

            if amsgrad:
                # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                if differentiable:
+                    max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)


nit: max_exp_avg_sqs[i].copy_(torch.maximum(max_exp_avg_sqs[i], exp_avg_sq))

Also the perf hit compared to the out= call below is really minor. I think we can unconditionally use this new version for the sake of simplicity (same below). cc @jbschlosser what do you think?

One thing here is that we need to clone max_exp_avg_sqs for this to work ...
Wonder if its ok to keep this

Ho right maximum saves both its inputs.

emcastillo · 2022-07-27T04:20:32Z

test/test_optim.py

@@ -2693,9 +2693,9 @@ def _diff_fn(p, grad, opt_differentiable_state, opt_class, kwargs, *ignored):
    p.grad = grad
    opt_differentiable_state = {k: v.clone() for k, v in opt_differentiable_state.items()}
    opt = opt_class([p], **kwargs)
-    opt.state.update(opt_differentiable_state)
+    opt.state[p].update(opt_differentiable_state)


This was my mistake in the SGD PR, state is a dict of dicts where the key is the param.

albanD · 2022-07-27T08:19:51Z

test/test_optim.py

    opt.step()
-    return (p,) + tuple(opt_differentiable_state.values())
+    return (p,) + tuple(v for v in opt_differentiable_state.values() if v.requires_grad)


Is this actually needed? Gradcheck will filter out things that don't require gradients already

actually it didn't, lol step was returned and gradcheck detected errors with it

albanD · 2022-07-27T08:20:10Z

test/test_optim.py

+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state['step'] = torch.tensor(10., requires_grad=False, dtype=torch.float64)


not requires grad?

this causes the gradcheck to fail if its enabled ...

Ho this is the step. Why is this float and not long?

https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L140-L141
It was defined as float in the optimizer itself

Ok,
Can you leave a comment here that mentions that step is not a continuous variable (even though we define it as a float) and so it shouldn't require gradients.
btw can we have an nice assert in the adam() function to ensure that step never requires grad?

albanD · 2022-07-27T08:20:34Z

torch/optim/adam.py

@@ -278,7 +284,10 @@ def _single_tensor_adam(params: List[Tensor],

            if amsgrad:
                # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                if differentiable:
+                    max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)


Ho right maximum saves both its inputs.

albanD · 2022-07-27T08:23:26Z

torch/optim/adam.py

@@ -278,7 +280,7 @@ def _single_tensor_adam(params: List[Tensor],

            if amsgrad:
                # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)


This is a very personal preference but I'm not very confident with advanced indexing assignment in general, can this be done with copy_() to make sure of the behavior we get?

But we should also avoid the clone in general.

Suggested change

max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)

prev_exp_avg = max_exp_avg_sqs[i].clone() if differentiable else max_exp_avg_sqs[i]

max_exp_avg_sqs[i].copy_(torch.maximum(prev_exp_avg, exp_avg_sq))

the problem here is that if we pass the differentiable argument, the JIT and other pieces of code will complain and we will make this non-compat since torchscript does not allow arguments with default values.
I wonder if we can store the flag in the thread local storage set with the differentiable context manager for this check?

Error if we don't clone

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [10]], which is output 0 of torch::autograd::CopyBackwards, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

It is only kwarg-only arguments with default values IIRC ?
You can make it positional as done in

pytorch/torch/optim/sgd.py

Lines 172 to 175 in 1dfcad8

# kwonly args with defaults are not supported by functions compiled with torchscript issue #70627

# setting this as kwarg for now as functional API is compiled by torch/distributed/optim

has_sparse_grad: bool = None,

foreach: bool = None,

to workaround the torchscript bug.

albanD

nit on a clone but SGTM otherwise.
The CI needs fixing though no?

albanD · 2022-08-15T14:13:23Z

test/test_optim.py

+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state['step'] = torch.tensor(10., requires_grad=False, dtype=torch.float64)


Ok,
Can you leave a comment here that mentions that step is not a continuous variable (even though we define it as a float) and so it shouldn't require gradients.
btw can we have an nice assert in the adam() function to ensure that step never requires grad?

albanD · 2022-08-15T14:14:11Z

torch/optim/adam.py

@@ -305,7 +315,7 @@ def _single_tensor_adam(params: List[Tensor],

            if amsgrad:
                # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq, out=max_exp_avg_sqs[i])


Only if differentiable?

this was not needed, the differentiable case is now handled in the outer if :)

albanD · 2022-08-15T14:14:47Z

torch/optim/adam.py

@@ -328,7 +338,7 @@ def _multi_tensor_adam(params: List[Tensor],
                       weight_decay: float,
                       eps: float,
                       maximize: bool,
-                       capturable: bool):
+                       differentiable: bool):


The foreach ops don't support autograd. So we should have an assert here that differentiable is False (same for sgd btw).
This can be done in a separate PR though.

Added the assert to this one, will fix SGD too in the next PR :)

albanD

SGTM !
Just a small nit about error meessage

albanD · 2022-08-16T15:15:22Z

torch/optim/adam.py

@@ -152,7 +154,8 @@ def step(self, closure=None):

                    if group['amsgrad']:
                        max_exp_avg_sqs.append(state['max_exp_avg_sq'])
-
+                    if group['differentiable']:
+                        assert not state['step'].requires_grad


nit make this a RuntimeError and add a nice error message.

emcastillo · 2022-08-17T07:18:33Z

@pytorchbot merge

pytorchmergebot · 2022-08-17T07:20:33Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-08-17T07:21:16Z

Hey @emcastillo.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Continues [80938](#80938) Pull Request resolved: #82205 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5aab57e112d244f0cf3bbab30db640e52a0c2c44 Reviewed By: seemethere Differential Revision: D38788436 fbshipit-source-id: e35677b92267d068e044693acb9a7fcc96ed59c5

Blocked by #82205 Pull Request resolved: #83578 Approved by: https://github.com/albanD

Summary: Blocked by #82205 Pull Request resolved: #83578 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f0eb841d209f251d6a735827d4b903962d0d31b8 Reviewed By: seemethere Differential Revision: D38911145 Pulled By: seemethere fbshipit-source-id: 10bff92beba31fed5adacdf453b680c5acd8b19c

emcastillo requested a review from albanD as a code owner July 26, 2022 03:32

facebook-github-bot added the cla signed label Jul 26, 2022

emcastillo requested a review from jbschlosser July 26, 2022 03:33

emcastillo added the module: optimizer Related to torch.optim label Jul 26, 2022

pytorchbot added the open source label Jul 26, 2022

emcastillo force-pushed the differentiable-adam branch 2 times, most recently from df14687 to 9af1df4 Compare July 26, 2022 04:27

albanD reviewed Jul 26, 2022

View reviewed changes

H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 26, 2022

emcastillo commented Jul 27, 2022

View reviewed changes

emcastillo force-pushed the differentiable-adam branch 2 times, most recently from b8a6f60 to 37c23c1 Compare July 27, 2022 06:30

albanD reviewed Jul 27, 2022

View reviewed changes

emcastillo force-pushed the differentiable-adam branch from 37c23c1 to 36f09c9 Compare August 3, 2022 05:13

albanD reviewed Aug 15, 2022

View reviewed changes

emcastillo force-pushed the differentiable-adam branch from 36f09c9 to b647df0 Compare August 16, 2022 05:48

albanD approved these changes Aug 16, 2022

View reviewed changes

Emilio Castillo added 4 commits August 17, 2022 04:05

Make Adam optimizer differentiable

6cfa3f0

Minor fixes

8651beb

review fixes

d1dec9e

Change error message

db97708

emcastillo force-pushed the differentiable-adam branch from b647df0 to db97708 Compare August 17, 2022 04:08

emcastillo mentioned this pull request Aug 17, 2022

Make torch.optim.RMSprop differentiable #83578

Closed

pytorchmergebot added the Merged label Aug 17, 2022

pytorchmergebot closed this in 5aab57e Aug 17, 2022

emcastillo added release notes: nn release notes category topic: improvements topic category labels Aug 17, 2022

pytorchmergebot pushed a commit that referenced this pull request Aug 22, 2022

Make torch.optim.RMSprop differentiable (#83578)

f0eb841

Blocked by #82205 Pull Request resolved: #83578 Approved by: https://github.com/albanD

felixdivo mentioned this pull request Dec 8, 2022

Adam's parameter "differentiable" is not documented #90467

Closed

	max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)
	prev_exp_avg = max_exp_avg_sqs[i].clone() if differentiable else max_exp_avg_sqs[i]
	max_exp_avg_sqs[i].copy_(torch.maximum(prev_exp_avg, exp_avg_sq))

	# kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
	# setting this as kwarg for now as functional API is compiled by torch/distributed/optim
	has_sparse_grad: bool = None,
	foreach: bool = None,

Make Adam optimizer differentiable #82205

Make Adam optimizer differentiable #82205

Uh oh!

Conversation

emcastillo commented Jul 26, 2022

Uh oh!

facebook-github-bot commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emcastillo Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Aug 17, 2022

Uh oh!

pytorchmergebot commented Aug 17, 2022

Uh oh!

github-actions bot commented Aug 17, 2022

Uh oh!

Uh oh!

facebook-github-bot commented Jul 26, 2022 •

edited

Loading

emcastillo Aug 2, 2022 •

edited

Loading