-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[optim] prevent unintended aliasing in lr_scheduler; update type annotations/docs #163120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163120
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0ddc6b4 with merge base 6cfb080 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
it would also be great to have purely closed-form / functional scheduler formula functions (maybe extracted out of currently implemented schedulers)... Then they could be used even outside of ParamGroup context |
Sounds interesting @vadimkantorov. Does this come up often? I'd be open to adding something like this soon, but I want to expand lr_scheduler test coverage first (once this and #163122 land). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying what's going on! One highlevel thing I would ask for this PR to clarify (and document) is whether users expect get_lr or get_last_lr to return aliases vs copies of the Tensor lrs. We should at the very least document what is happening now (e.g., after this change, get_lr would return copies instead of aliases). And as that behavior is changing from previous releases, it is BC-breaking. Though, for this case, I think that it is fair to consider this a bug fix, as get_lr used to return immutable floats, and so we should follow that semantic and return un-aliased Tensors.
What @vadimkantorov is talking about is something out of the scope for this PR, and what our get_closed_form functions attempt to help solve. (Vadim correct me if i'm misunderstanding!) At one time there was an effort to deprecate all the nonclosed forms, but it was unsuccessful because there were still use cases we didn't cover, and the person who maintained LRScheduler stopped doing so. So if you want a good can of worms, this situation may fall under that category 😛 |
Yes, my comment is out of scope. I was meaning to say that it's quite stunning how small edge cases / bugs / unexpected complexities are found around existing stateful/imperative schedulers. And I think that if closed-form purely functional formulas were also available as standalone functions (even if not all use-cases covered), users could use them directly and experiment with other scheduler APIs (also applicable not only to lr, but e.g. to weight decay) |
Yeah, great points @janeyx99 - the PR was underbaked. Should be better now! Here's an overview: Current behaviorThere are several rough edges caused by patterns like this in self._last_lr: list[float] = [group["lr"] for group in self.optimizer.param_groups]
import torch
from torch import optim, nn
lr = torch.tensor(0.1)
opt = optim.SGD([nn.Parameter(torch.tensor(1.0))], lr=lr)
scheduler = optim.lr_scheduler.ConstantLR(opt, total_iters=1)
x = scheduler.get_last_lr()[0]
print(x) # tensor(0.0333)
opt.step()
scheduler.step() # mutates x
print(x) # tensor(0.1000)
x /= 2 # corrupts param_groups[0]["lr"] This causes unexpected behavior in common workflows: history = []
for step in total_iters:
opt.step()
scheduler.step()
history.append(scheduler.get_last_lr()[0])
print(history) # All values == history[-1]
@override
def get_lr(self) -> list[float | Tensor]:
"""Compute the learning rate of each parameter group."""
_warn_get_lr_called_within_step(self)
if (self.last_epoch == 0) or (self.last_epoch % self.step_size != 0):
return [group["lr"] for group in self.optimizer.param_groups] # Aliased
return [group["lr"] * self.gamma for group in self.optimizer.param_groups] # New tensors This isn't a huge issue since Aside: it's not clear to me why
New behaviorThis PR makes the existing type behavior explicit, but without the alias footguns. Note that this PR doesn't add tests to enforce matching types, and that that Alternatives
- self._last_lr = [group["lr"] for group in self.optimizer.param_groups]
+ self._last_lr = [float(group["lr"]) for group in self.optimizer.param_groups] Now that we use
AsideWhat's the process for particularly marginal fixes? For example, I noticed if factor > 1.0 or factor < 0:
raise ValueError(
"Constant multiplicative factor expected to be between 0 and 1."
) Which allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Thank you for looking into it. I agree let's land this now + build more on top later.
With get_lr...yea deprecating + privatizing it sounds good given the context you've mentioned. Can you check through GH/some issues to ensure that no one is relying on get_lr in a way we don't expect though? (This is for a future PR, the current one is good as is.)
Also, is there a reason get_lr is not returning copies like get_last_lr? Does our internal usage depend on mutating the lr? (I would guess that we'd also want get_lr to return copies) |
@override
def get_lr(self) -> list[float | Tensor]:
"""Compute the learning rate of each parameter group."""
_warn_get_lr_called_within_step(self)
if (self.last_epoch == 0) or (self.last_epoch % self.step_size != 0):
- return [group["lr"] for group in self.optimizer.param_groups] # Returned group["lr"]s
+ return _param_groups_val_list(self.optimizer, "lr") # Uses .clone()
return [group["lr"] * self.gamma for group in self.optimizer.param_groups] # New tensors |
@janeyx99 seeing that there are some issues with the docs - do you have any advice on debugging this? I've been trying to build the docs locally and I'm running into lots of errors. Is there an environment where you find they work reliably? |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / test (default, 3, 3, lf.windows.4xlarge.nonephemeral) Details for Dev Infra teamRaised by workflow job |
CI caught an interesting issue!
And it attempts to bitwise-OR the types. The fix: - values = cast(list[float | Tensor], self._get_closed_form_lr())
+ values = cast(list[Union[float, Tensor]], self._get_closed_form_lr()) Sorry for missing this @janeyx99 - just pushed a fix. |
@pytorchbot merge Let’s go! c: |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…tations/docs (#163120) 1. Prevents unintended aliasing of `self._last_lr`/`get_last_lr(...)` with `group["lr"]` when `group["lr"]` is a tensor. 2. Prevents unintended aliasing of `LRScheduler.base_lrs` with the `group["initial_lr"]`s. 3. Updates `test/optim/test_lrscheduler.py` to test tensor LRs. 4. Changes type annotations for `_last_lr`, `get_last_lr()`, `base_lrs`, `get_lr()`, and `_get_closed_form_lr()` from `list[float]` to `list[float | Tensor]`; adds documentation. Fixes #163103 LR schedulers can behave in unexpected ways when using a tensor LR due to patterns like this: ```python self._last_lr: list[float] = [group["lr"] for group in self.optimizer.param_groups] ``` This PR adds a helper to address this: ```python def _param_groups_val_list(optimizer: Optimizer, key: str) -> list[Any]: """Create a list containing group[key] for each optimizer param_group. Prevents aliasing when group[key] could be a Tensor. Raises a KeyError when group[key] does not exist. """ return [ group[key].clone() if isinstance(group[key], Tensor) else group[key] for group in optimizer.param_groups ] ``` Pull Request resolved: #163120 Approved by: https://github.com/janeyx99
self._last_lr
/get_last_lr(...)
withgroup["lr"]
whengroup["lr"]
is a tensor.LRScheduler.base_lrs
with thegroup["initial_lr"]
s.test/optim/test_lrscheduler.py
to test tensor LRs._last_lr
,get_last_lr()
,base_lrs
,get_lr()
, and_get_closed_form_lr()
fromlist[float]
tolist[float | Tensor]
; adds documentation.Fixes #163103
LR schedulers can behave in unexpected ways when using a tensor LR due to patterns like this:
This PR adds a helper to address this: