Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard logging #512

Merged
merged 17 commits into from Sep 12, 2019

Conversation

@BenjaminBossan
Copy link
Collaborator

commented Aug 18, 2019

Fixes #510

This includes the changes as proposed by @taketwo, namely to automatically log everything except keys that were opted out, the same as PrintLog.

I would like to update one of the existing notebooks to showcase the use. My candidate would be MNIST-torchvision, or do you think this would be too much for this notebook, @ottonemo? Of course, it's difficult to show the tensorboard inside the Jupyter notebook, I would probably add a screenshot.

.. that it will share with TensorBoard.

In addition, improve docstring and move the handling of keys_ignored
being a string to initialize method (so that init params are not
modified).
There will be an update to one of the existing notebooks to use this
callback and show how to extend it.
Skip TensorBoard tests if tensorboard is installed.
Copy link
Contributor

left a comment

As it is now, unlike PrintLog, TensorBoard does not ignore batches. Is this desired? To be honest I don't recall what's stored there, but in my own implementation of TensorBoard callback I ignored this key.

Also, I ignored dur by default because TensorBoard stores timestamps for all events, and thus for the interested user it's easy to check the timing. Though maybe there are use cases where it's important to visualize the plot of epoch durations.

skorch/callbacks/logging.py Outdated Show resolved Hide resolved
Co-Authored-By: Sergey Alexandrov <alexandrov88@gmail.com>
@BenjaminBossan

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 19, 2019

@taketwo Tensorboard also ignores batches, they are added here.

When it comes to the completeness, I didn't add every single detail, only those that I thought were most relevant to the users. But if you want to, I can change this.

Regarding duration, if there is an easy way to show time per epoch in tensorboard, I can remove the duration. Otherwise I think it's better to log too much than too little.

@taketwo

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

@taketwo Tensorboard also ignores batches, they are added here.

Sorry, overlooked this. So both callbacks that use the filter_log_keys() utility function ignore the batches key. Then the question is: why is it not ignored "globally" in filter_log_keys() (like e.g. epoch)?

When it comes to the completeness, I didn't add every single detail, only those that I thought were most relevant to the users. But if you want to, I can change this.

You are right, it's probably not necessary to list every single detail. But for me the sentence

Note that keys starting with 'event_' or ending on '_best' are ignored by default.

reads like it's an exhaustive list of default ignores, which it is not. So perhaps it would be helpful to note that there are more default ignores. Also from the docstring (without checking code) it's not clear what will happen if user supplies keys_ignored argument. Will it replace the defaults or will they be merged?

Regarding duration, if there is an easy way to show time per epoch in tensorboard, I can remove the duration. Otherwise I think it's better to log too much than too little.

No, I don't think it is possible to have a time per epoch plot, so the user will need to hover data points and compute time differences herself. Agree, too much is better than too little.

Also demonstrates how to subclass TensorBoard callback.
As suggested by taketwo
…korch into tensorboard-logging
@BenjaminBossan

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 22, 2019

I added TensorBoard to the torchvision notebook, including an example that shows how to subclass. Also clarified the docstrings, as suggested by @taketwo

@BenjaminBossan BenjaminBossan marked this pull request as ready for review Aug 22, 2019
@BenjaminBossan BenjaminBossan requested review from thomasjpfan and ottonemo Aug 22, 2019
... bias = extract_bias(net.module_)
... epoch = net.history[-1, 'epoch']
... self.writer.add_histogram('bias', bias, global_step=epoch)
... super().on_epoch_end(net, **kwargs) # call super last

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Aug 23, 2019

Member

This does not have to be last?

This comment has been minimized.

Copy link
@BenjaminBossan

BenjaminBossan Aug 24, 2019

Author Collaborator

True, it doesn't have to. The reason why I say this is that sometimes, calling super first can lead to subtle errors. E.g. during on_batch_end, first_batch_ is set to false. If a user were to override this and call super first, then implement something that depends on first_batch_ being true, it would never trigger.

@ottonemo

This comment has been minimized.

Copy link
Collaborator

commented Aug 25, 2019

I'm not done testing but the first run of the notebook with PyTorch 1.2 failed at

net.fit(mnist_train, y=y_train);

with

Expected object of backend CUDA but got backend CPU for argument #4 'mat1'
Error occurs, No graph saved

You can find the whole traceback here.

I haven't changed anything in the notebook and DEVICE is set to 'cuda'.

@BenjaminBossan

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 25, 2019

@ottonemo Not sure why you get this error and I don't, but you surely uncovered a more fundamental problem. Right now, TensorBoard just passes the module and X to SummaryWriter.add_graph, which does the calling. However, this circumvents the logic currently contained in net.infer, for instance the device handling.

The good news is that we could move that specific logic to TensorBoard.add_graph. The bad news is that we cannot move the logic of how forward is called to TensorBoard, since this is controlled on the pytorch side. This is a problem when X is a dict, because then we want to call forward(**X_dict), but pytorch will just call forward(*X_dict) (meaning we only receive the keys).

I don't see an elegant solution to this. Does anyone of you have an idea? I opened a topic on the pytorch forums to ask for help there as well: https://discuss.pytorch.org/t/use-module-with-kwargs-in-summarywriter-add-graph/54263

@BenjaminBossan

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 25, 2019

I added a test that shows the discussed failing behavior with a dict input.

@ottonemo

This comment has been minimized.

Copy link
Collaborator

commented Aug 26, 2019

@ottonemo Not sure why you get this error and I don't, but you surely uncovered a more fundamental problem. Right now, TensorBoard just passes the module and X to SummaryWriter.add_graph, which does the calling. However, this circumvents the logic currently contained in net.infer, for instance the device handling.

The good news is that we could move that specific logic to TensorBoard.add_graph. The bad news is that we cannot move the logic of how forward is called to TensorBoard, since this is controlled on the pytorch side. This is a problem when X is a dict, because then we want to call forward(**X_dict), but pytorch will just call forward(*X_dict) (meaning we only receive the keys).

I don't see an elegant solution to this. Does anyone of you have an idea? I opened a topic on the pytorch forums to ask for help there as well: https://discuss.pytorch.org/t/use-module-with-kwargs-in-summarywriter-add-graph/54263

Ideally we would have a single point of control for how the data is converted and passed to the model.
Currently, this is net.infer and, if possible, it would be nice to have the graph generated the same way.

I tried passing net.infer directly to torch.jit.trace which then fails with:

RuntimeError: Cannot insert a Tensor that requires grad as a constant. 
Consider making it a parameter or input, or detaching the gradient

There is some background information in pytorch/pytorch#20101, basically it seems that the trace is not able to gather the module's parameters this way.

@thomasjpfan

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Tensorboard logging is already pretty useful with only the metrics. We can have this PR include just the metrics, and have another PR for add_graph?

It works most of the time but not all of the time. Therefore, we
remove this functionality for now, so that we have at least the scalar
logging (which is more useful anyway).

Still show in the notebook how to add graphs.
@BenjaminBossan

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 31, 2019

Tensorboard logging is already pretty useful with only the metrics. We can have this PR include just the metrics, and have another PR for add_graph?

That's a good idea, I removed the graph functionality for now. I left the tests there (skipping them for now) and also show in the notebook how to add the graph yourself.

Please review again.

Copy link
Collaborator

left a comment

I think it is a good base for users to build upon. A few notes after testing more thoroughly:

  1. there's a note missing in the notebook how to start tensorboard, e.g.

    To use tensorboard, run tensorboard --logdir runs in the directory you are running this notebook in.

  2. there's explicitly no experiment management, running net.fit(mnist_tain, y_train) more than once will write into the same log dir

I don't think we should address (2) here but we should definitely think about making it as easy as possible to manage experiments and possibly have the increase of experiment runs handled by skorch.

@BenjaminBossan

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 11, 2019

there's explicitly no experiment management, running net.fit(mnist_tain, y_train) more than once will write into the same log dir

we should definitely think about making it as easy as possible to manage experiments and possibly have the increase of experiment runs handled by skorch.

Since we leave the creation of the SummaryWriter to the user, I would consider this the user's job as well. Otherwise, we would need to create it ourselves, with all the trouble and dependencies that come with it. Maybe there is an easy way to achieve what you suggested, but I'm not aware of it.

Addressing reviewer suggestion.
return

global_step = global_step if global_step is not None else hist['epoch']
try:

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 11, 2019

Member
from contextlib import suppress

with suppress(NotImplementedError):
    self.writer...

tensorboard_installed = False
try:
import tensorboard

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 11, 2019

Member

Just to keep pylint happy in the future:

# pylint: disable=unused-import
import tensorboard
Addresses reviewer comment
Addresses reviewer comment
@ottonemo ottonemo merged commit b8cb53f into master Sep 12, 2019
4 checks passed
4 checks passed
Travis CI - Branch Build Passed
Details
Travis CI - Pull Request Build Passed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
BenjaminBossan added a commit that referenced this pull request Sep 15, 2019
The new behavior of get_params will be to not returned any "learned"
attributes such as "module_".

This PR implements the new behavior but doesn't switch to it yet to
give users time to adjust their code. This is a breaking change but it
is necessary since it is the "correct" behavior; the old one could
introduce subtle bugs in rare situations (e.g. `GridSearchCV` with a
net that has `warm_start=True`).

The PR also includes tests that are currently failing but that are
passing under the new behavior. When switching to the new behavior,
all tests, including these new ones, should pass (they currently
xfail).

Relates to #512
@BenjaminBossan BenjaminBossan deleted the tensorboard-logging branch Sep 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.