Skip to content

Conversation

emcastillo
Copy link
Collaborator

Continues 80938

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 26, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit db97708 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@emcastillo emcastillo requested a review from jbschlosser July 26, 2022 03:33
@emcastillo emcastillo added the module: optimizer Related to torch.optim label Jul 26, 2022
@emcastillo emcastillo force-pushed the differentiable-adam branch 2 times, most recently from df14687 to 9af1df4 Compare July 26, 2022 04:27
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -278,7 +284,10 @@ def _single_tensor_adam(params: List[Tensor],

if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
if differentiable:
max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: max_exp_avg_sqs[i].copy_(torch.maximum(max_exp_avg_sqs[i], exp_avg_sq))

Also the perf hit compared to the out= call below is really minor. I think we can unconditionally use this new version for the sake of simplicity (same below). cc @jbschlosser what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing here is that we need to clone max_exp_avg_sqs for this to work ...
Wonder if its ok to keep this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho right maximum saves both its inputs.

@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 26, 2022
@@ -2693,9 +2693,9 @@ def _diff_fn(p, grad, opt_differentiable_state, opt_class, kwargs, *ignored):
p.grad = grad
opt_differentiable_state = {k: v.clone() for k, v in opt_differentiable_state.items()}
opt = opt_class([p], **kwargs)
opt.state.update(opt_differentiable_state)
opt.state[p].update(opt_differentiable_state)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my mistake in the SGD PR, state is a dict of dicts where the key is the param.

@emcastillo emcastillo force-pushed the differentiable-adam branch 2 times, most recently from b8a6f60 to 37c23c1 Compare July 27, 2022 06:30
opt.step()
return (p,) + tuple(opt_differentiable_state.values())
return (p,) + tuple(v for v in opt_differentiable_state.values() if v.requires_grad)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed? Gradcheck will filter out things that don't require gradients already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it didn't, lol step was returned and gradcheck detected errors with it

state = {}
p = torch.rand(10, requires_grad=True, dtype=torch.float64)
grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
state['step'] = torch.tensor(10., requires_grad=False, dtype=torch.float64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not requires grad?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this causes the gradcheck to fail if its enabled ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho this is the step. Why is this float and not long?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,
Can you leave a comment here that mentions that step is not a continuous variable (even though we define it as a float) and so it shouldn't require gradients.
btw can we have an nice assert in the adam() function to ensure that step never requires grad?

@@ -278,7 +284,10 @@ def _single_tensor_adam(params: List[Tensor],

if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
if differentiable:
max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho right maximum saves both its inputs.

@@ -278,7 +280,7 @@ def _single_tensor_adam(params: List[Tensor],

if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very personal preference but I'm not very confident with advanced indexing assignment in general, can this be done with copy_() to make sure of the behavior we get?

But we should also avoid the clone in general.

Suggested change
max_exp_avg_sqs[i][:] = torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq)
prev_exp_avg = max_exp_avg_sqs[i].clone() if differentiable else max_exp_avg_sqs[i]
max_exp_avg_sqs[i].copy_(torch.maximum(prev_exp_avg, exp_avg_sq))

Copy link
Collaborator Author

@emcastillo emcastillo Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem here is that if we pass the differentiable argument, the JIT and other pieces of code will complain and we will make this non-compat since torchscript does not allow arguments with default values.
I wonder if we can store the flag in the thread local storage set with the differentiable context manager for this check?

Error if we don't clone

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [10]], which is output 0 of torch::autograd::CopyBackwards, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only kwarg-only arguments with default values IIRC ?
You can make it positional as done in

pytorch/torch/optim/sgd.py

Lines 172 to 175 in 1dfcad8

# kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
# setting this as kwarg for now as functional API is compiled by torch/distributed/optim
has_sparse_grad: bool = None,
foreach: bool = None,
to workaround the torchscript bug.

@emcastillo emcastillo force-pushed the differentiable-adam branch from 37c23c1 to 36f09c9 Compare August 3, 2022 05:13
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on a clone but SGTM otherwise.
The CI needs fixing though no?

state = {}
p = torch.rand(10, requires_grad=True, dtype=torch.float64)
grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
state['step'] = torch.tensor(10., requires_grad=False, dtype=torch.float64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,
Can you leave a comment here that mentions that step is not a continuous variable (even though we define it as a float) and so it shouldn't require gradients.
btw can we have an nice assert in the adam() function to ensure that step never requires grad?

@@ -305,7 +315,7 @@ def _single_tensor_adam(params: List[Tensor],

if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
torch.maximum(max_exp_avg_sqs[i].clone(), exp_avg_sq, out=max_exp_avg_sqs[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if differentiable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not needed, the differentiable case is now handled in the outer if :)

@@ -328,7 +338,7 @@ def _multi_tensor_adam(params: List[Tensor],
weight_decay: float,
eps: float,
maximize: bool,
capturable: bool):
differentiable: bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The foreach ops don't support autograd. So we should have an assert here that differentiable is False (same for sgd btw).
This can be done in a separate PR though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the assert to this one, will fix SGD too in the next PR :)

@emcastillo emcastillo force-pushed the differentiable-adam branch from 36f09c9 to b647df0 Compare August 16, 2022 05:48
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM !
Just a small nit about error meessage

@@ -152,7 +154,8 @@ def step(self, closure=None):

if group['amsgrad']:
max_exp_avg_sqs.append(state['max_exp_avg_sq'])

if group['differentiable']:
assert not state['step'].requires_grad
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit make this a RuntimeError and add a nice error message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@emcastillo
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link
Contributor

Hey @emcastillo.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@emcastillo emcastillo added release notes: nn release notes category topic: improvements topic category labels Aug 17, 2022
facebook-github-bot pushed a commit that referenced this pull request Aug 17, 2022
Summary:
Continues [80938](#80938)

Pull Request resolved: #82205
Approved by: https://github.com/albanD

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5aab57e112d244f0cf3bbab30db640e52a0c2c44

Reviewed By: seemethere

Differential Revision: D38788436

fbshipit-source-id: e35677b92267d068e044693acb9a7fcc96ed59c5
pytorchmergebot pushed a commit that referenced this pull request Aug 22, 2022
facebook-github-bot pushed a commit that referenced this pull request Aug 23, 2022
Summary:
Blocked by #82205

Pull Request resolved: #83578
Approved by: https://github.com/albanD

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f0eb841d209f251d6a735827d4b903962d0d31b8

Reviewed By: seemethere

Differential Revision: D38911145

Pulled By: seemethere

fbshipit-source-id: 10bff92beba31fed5adacdf453b680c5acd8b19c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: optimizer Related to torch.optim open source release notes: nn release notes category topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants