Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with LBGFS optimizer: optimizer_.zero_grad no longer called inside of train_step_single #636

Closed
jwdink opened this issue May 13, 2020 · 8 comments

Comments

@jwdink
Copy link

jwdink commented May 13, 2020

It looks like in the following commit, the call to optimizer_.zero_grad was removed from train_step_single, and instead put inside of train_step:

a62e419#diff-5d1f8e13ab8b822b72e9b295cf7301c5L596

As far as I can tell, this isn't appropriate if you're using the LBFGS optimizer -- for that optimizer you need to zero the gradient on every call to the closure.

I ran into this because nets that had previously worked on skorch 0.60 now quickly hit nan loss on skorch 0.80. I reverted versions and hunted down this commit for the cause (see here for an example).

@BenjaminBossan
Copy link
Collaborator

@jwdink Thank you a lot for reporting this error.

You are right, the commit you identified is the reason for the trouble.

This is actually a difficult problem to solve. The idea of moving the zero_grad out of each individual call of optimizer.step was to allow the user to have more control over it, e.g. to implement gradient accumulation. This is in conflict with optimizers that want to call the closure multiple times and need zero_grad after each of it.

As a quick fix for you, I would propose to either restore the previous state or move the call to zero_grad inside the step_fn closure here:

skorch/skorch/net.py

Lines 655 to 660 in 1f6b542

def step_fn():
step = self.train_step_single(Xi, yi, **fit_params)
step_accumulator.store_step(step)
return step['loss']
self.optimizer_.step(step_fn)
self.optimizer_.zero_grad()

I believe it should fix your issue.

Now what could be done in the long run? Hard to tell. Reverting the change removes the mentioned benefit. We could, hypothetically, add a parameter to NeuralNet that allows to switch between the two. But that still requires the user of LBFGS or similar optimizers to actively remember that. It's thus not a very nice solution.

If we could somehow know what optimizers actually need to call zero_grad, we could control the behavior dynamically. But I don't think we can know that. So this is also not a solution.

I never use optimizers that require that, thus have almost no experience. Do you have any good idea?

@jwdink
Copy link
Author

jwdink commented May 13, 2020

Now what could be done in the long run? Hard to tell. Reverting the change removes the mentioned benefit. We could, hypothetically, add a parameter to NeuralNet that allows to switch between the two. But that still requires the user of LBFGS or similar optimizers to actively remember that. It's thus not a very nice solution.

Since the change was made to accommodate additional control, would it make sense to add that argument to NeuralNet, but its default value causes the old behavior (i.e. calling zero_grad inside)? This way, only users looking for additional control (e.g. accumulating gradients) have to override this default argument. For everyone else, all the optimizers from the torch.optim module would work with NeuralNet out of the box.

@BenjaminBossan
Copy link
Collaborator

but its default value causes the old behavior

Unfortunately, the changed behavior is part of our last release. Therefore, if we changed it back to the old behavior, we would create yet another breakage. But there is an argument to be made that if possible, skorch should work with the optimizers provided by PyTorch, and those include LBFGS.

So we could introduce a parameter like perform_zero_grad_each_step=True or so (not a very nice name) and if a user wants to do their custom stuff with gradients, they need to turn it off.

@ottonemo do you have a better idea?

@jwdink
Copy link
Author

jwdink commented May 15, 2020

So we could introduce a parameter like perform_zero_grad_each_step=True or so (not a very nice name) and if a user wants to do their custom stuff with gradients, they need to turn it off.

Yes sorry if I was being unclear -- this is exactly what I was proposing.

@BenjaminBossan
Copy link
Collaborator

No, you were not unclear, I was just elaborating on what you said :)

BenjaminBossan pushed a commit that referenced this issue May 16, 2020
This fixes a bug introduced by moving the `optimizer.zero_grad()` call
outside of the train step function, making it incompatible with LBFGS
and other optimizers that call the train step several times per
batch and expect the gradient to be reset after each call (#636).
@BenjaminBossan
Copy link
Collaborator

@jwdink I created a PR (#639) to fix this issue. Could you please check if your issue is solved when you use skorch from that branch?

@jwdink
Copy link
Author

jwdink commented May 18, 2020

Yep, that fixes it. Thanks for the speed response!

@BenjaminBossan
Copy link
Collaborator

Fixed by #639

BenjaminBossan added a commit that referenced this issue Aug 30, 2020
This release of skorch contains a few minor improvements and some nice additions. As always, we fixed a few bugs and improved the documentation. Our [learning rate scheduler](https://skorch.readthedocs.io/en/latest/callbacks.html#skorch.callbacks.LRScheduler) now optionally logs learning rate changes to the history; moreover, it now allows the user to choose whether an update step should be made after each batch or each epoch.

If you always longed for a metric that would just use whatever is defined by your criterion, look no further than [`loss_scoring`](https://skorch.readthedocs.io/en/latest/scoring.html#skorch.scoring.loss_scoring). Also, skorch now allows you to easily change the kind of nonlinearity to apply to the module's output when `predict` and `predict_proba` are called, by passing the `predict_nonlinearity` argument.

Besides these changes, we improved the customization potential of skorch. First of all, the `criterion` is now set to `train` or `valid`, depending on the phase -- this is useful if the criterion should act differently during training and validation. Next we made it easier to add custom modules, optimizers, and criteria to your neural net; this should facilitate implementing architectures like GANs. Consult the [docs](https://skorch.readthedocs.io/en/latest/user/neuralnet.html#subclassing-neuralnet) for more on this. Conveniently, [`net.save_params`](https://skorch.readthedocs.io/en/latest/net.html#skorch.net.NeuralNet.save_params) can now persist arbitrary attributes, including those custom modules.
As always, these improvements wouldn't have been possible without the community. Please keep asking questions, raising issues, and proposing new features. We are especially grateful to those community members, old and new, who contributed via PRs:

```
Aaron Berk
guybuk
kqf
Michał Słapek
Scott Sievert
Yann Dubois
Zhao Meng
```

Here is the full list of all changes:

### Added

- Added the `event_name` argument for `LRScheduler` for optional recording of LR changes inside `net.history`. NOTE: Supported only in Pytorch>=1.4
- Make it easier to add custom modules or optimizers to a neural net class by automatically registering them where necessary and by making them available to set_params
- Added the `step_every` argument for `LRScheduler` to set whether the scheduler step should be taken on every epoch or on every batch.
- Added the `scoring` module with `loss_scoring` function, which computes the net's loss (using `get_loss`) on provided input data.
- Added a parameter `predict_nonlinearity` to `NeuralNet` which allows users to control the nonlinearity to be applied to the module output when calling `predict` and `predict_proba` (#637, #661)
- Added the possibility to save the criterion with `save_params` and with checkpoint callbacks
- Added the possibility to save custom modules with `save_params` and with checkpoint callbacks

### Changed

- Removed support for schedulers with a `batch_step()` method in `LRScheduler`.
- Raise `FutureWarning` in `CVSplit` when `random_state` is not used. Will raise an exception in a future (#620)
- The behavior of method `net.get_params` changed to make it more consistent with sklearn: it will no longer return "learned" attributes like `module_`; therefore, functions like `sklearn.base.clone`, when called with a fitted net, will no longer return a fitted net but instead an uninitialized net; if you want a copy of a fitted net, use `copy.deepcopy` instead;`net.get_params` is used under the hood by many sklearn functions and classes, such as `GridSearchCV`, whose behavior may thus be affected by the change. (#521, #527)
- Raise `FutureWarning` when using `CyclicLR` scheduler, because the default behavior has changed from taking a step every batch to taking a step every epoch. (#626)
- Set train/validation on criterion if it's a PyTorch module (#621)
- Don't pass `y=None` to `NeuralNet.train_split` to enable the direct use of split functions without positional `y` in their signatures. This is useful when working with unsupervised data (#605).
- `to_numpy` is now able to unpack dicts and lists/tuples (#657, #658)
- When using `CrossEntropyLoss`, softmax is now automatically applied to the output when calling `predict` or `predict_proba`

### Fixed

- Fixed a bug where `CyclicLR` scheduler would update during both training and validation rather than just during training.
- Fixed a bug introduced by moving the `optimizer.zero_grad()` call outside of the train step function, making it incompatible with LBFGS and other optimizers that call the train step several times per batch (#636)
- Fixed pickling of the `ProgressBar` callback (#656)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants