-
Notifications
You must be signed in to change notification settings - Fork 559
Add basic tensorboard support #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jysohn23 how would I add |
test/test_train_imagenet.py
Outdated
| return correct / total_samples | ||
|
|
||
| accuracy = 0.0 | ||
| writer = SummaryWriter(log_dir='/tmp/imagenet_tensorboard') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be the default, and the log_dir must be a command line argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulted to FLAGS.logdir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send you links to where we build our GCE images and we can add them there.
test/test_train_imagenet.py
Outdated
| return correct / total_samples | ||
|
|
||
| accuracy = 0.0 | ||
| writer = SummaryWriter(log_dir='/tmp/imagenet_tensorboard') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add FLAGS.logdir? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulted to FLAGS.logdir
test/test_train_imagenet.py
Outdated
| accuracy = sum(accuracies) / len(accuracies) | ||
| print("Epoch: {}".format(epoch)) | ||
| accuracy = mean(accuracies) | ||
| writer.add_scalar('Accuracy/test', accuracy, epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add additional scalar metrics such as loss, examples/sec, etc (where the x-axis is global step number)? Similar to the tensorboard metrics we see with TF estimator. We can do this in a followup PR maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be adding more in some followup PRs. Still deciding on a clean way to do the others since we'll want to average over all devices during the training loop. Or we could have N different lines on the loss curves where N=num_cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cleanup unintentionally added runs/Sep03_23-44-20_zcain-highmem-96-build-from-head/events.out.tfevents.1567554260.zcain-highmem-96-build-from-head.54803.0 etc output from tb? Thanks.
Yup yup, removed |
test/test_train_imagenet.py
Outdated
| return correct / total_samples | ||
|
|
||
| accuracy = 0.0 | ||
| writer = SummaryWriter(log_dir=FLAGS.logdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be created only if logdir is not None ... unless SummaryWriter is a noop if logdir is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running this without specifying --logdir and verified that None was being passed into SummaryWriter. When this happens, SummaryWriter works without complaining and defaults to creating a new directory: "runs/"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not want that 😉
So lets create a write only if the user wants to (logdir not None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If writer is None when logdir is None, we will need to perform a check for if writer: at every spot where we call writer.add_scalar. Not a huge deal but adds to clutter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's OK. Dumping stuff into a random location is worse 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a helper method so it'll be a 1-line call whenever we write a summary data point. Verified that nothing gets written with --logdir=None and that valid summaries get written with --logdir=non-None
test/test_train_imagenet.py
Outdated
| return MODEL_PROPERTIES.get(FLAGS.model, MODEL_PROPERTIES['DEFAULT'])[key] | ||
|
|
||
|
|
||
| def add_scalar_to_summary(summary_writer, metric_name, y_value, x_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to test_utils.py.
Also, what are those x,y values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and renamed x,y to more closely match the documentation: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_scalar
In my mind, each scalar is an x,y coordinate in the Tensorboard graph. But hopefully this new naming is more clear
|
We may want to change MNIST and CIFAR10 as well... |
|
Verified that my CL on the google3 side has propagated and that pytorch-nightly conda env now has tensorboard installed |
|
This change works on pytorch-nightly but on pytorch-0.1, I needed to run |
|
Hmm, pytorch-0.1 should not be running newer code. |
Ok sounds good. I will leave it alone for now and then we can add |
Tested on a fresh red VM that I built from head using build_torch_wheels.sh