Add basic tensorboard support #985

zcain117 · 2019-09-03T23:22:50Z

Tested on a fresh red VM that I built from head using build_torch_wheels.sh

zcain117 · 2019-09-03T23:23:52Z

@jysohn23 how would I add tensorboard to the pytorch-nightly and pytorch-0.1 conda environments that come with the pytorch image? And also to the docker images?

test/test_train_imagenet.py

dlibenzi · 2019-09-03T23:30:14Z

test/test_train_imagenet.py

    return correct / total_samples

  accuracy = 0.0
+  writer = SummaryWriter(log_dir='/tmp/imagenet_tensorboard')


This should definitely be the default, and the log_dir must be a command line argument.

Defaulted to FLAGS.logdir

jysohn23

I'll send you links to where we build our GCE images and we can add them there.

jysohn23 · 2019-09-03T23:25:13Z

test/test_train_imagenet.py

    return correct / total_samples

  accuracy = 0.0
+  writer = SummaryWriter(log_dir='/tmp/imagenet_tensorboard')


Could we add FLAGS.logdir? What do you think?

Defaulted to FLAGS.logdir

jysohn23 · 2019-09-03T23:31:34Z

test/test_train_imagenet.py

-    accuracy = sum(accuracies) / len(accuracies)
+    print("Epoch: {}".format(epoch))
+    accuracy = mean(accuracies)
+    writer.add_scalar('Accuracy/test', accuracy, epoch)


Can we add additional scalar metrics such as loss, examples/sec, etc (where the x-axis is global step number)? Similar to the tensorboard metrics we see with TF estimator. We can do this in a followup PR maybe?

I'll be adding more in some followup PRs. Still deciding on a clean way to do the others since we'll want to average over all devices during the training loop. Or we could have N different lines on the loss curves where N=num_cores

jysohn23

Can we cleanup unintentionally added runs/Sep03_23-44-20_zcain-highmem-96-build-from-head/events.out.tfevents.1567554260.zcain-highmem-96-build-from-head.54803.0 etc output from tb? Thanks.

zcain117 · 2019-09-03T23:55:36Z

Can we cleanup unintentionally added runs/Sep03_23-44-20_zcain-highmem-96-build-from-head/events.out.tfevents.1567554260.zcain-highmem-96-build-from-head.54803.0 etc output from tb? Thanks.

Yup yup, removed

dlibenzi · 2019-09-04T00:14:30Z

test/test_train_imagenet.py

    return correct / total_samples

  accuracy = 0.0
+  writer = SummaryWriter(log_dir=FLAGS.logdir)


This needs to be created only if logdir is not None ... unless SummaryWriter is a noop if logdir is None.

I tried running this without specifying --logdir and verified that None was being passed into SummaryWriter. When this happens, SummaryWriter works without complaining and defaults to creating a new directory: "runs/"

No, we do not want that 😉
So lets create a write only if the user wants to (logdir not None).

If writer is None when logdir is None, we will need to perform a check for if writer: at every spot where we call writer.add_scalar. Not a huge deal but adds to clutter

That's OK. Dumping stuff into a random location is worse 😉

Added a helper method so it'll be a 1-line call whenever we write a summary data point. Verified that nothing gets written with --logdir=None and that valid summaries get written with --logdir=non-None

…ary-writer

dlibenzi · 2019-09-04T18:22:32Z

test/test_train_imagenet.py

  return MODEL_PROPERTIES.get(FLAGS.model, MODEL_PROPERTIES['DEFAULT'])[key]


+def add_scalar_to_summary(summary_writer, metric_name, y_value, x_value):


Move this to test_utils.py.
Also, what are those x,y values?

Moved and renamed x,y to more closely match the documentation: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_scalar

In my mind, each scalar is an x,y coordinate in the Tensorboard graph. But hopefully this new naming is more clear

dlibenzi · 2019-09-04T19:06:28Z

We may want to change MNIST and CIFAR10 as well...

zcain117 · 2019-09-04T21:03:15Z

Verified that my CL on the google3 side has propagated and that pytorch-nightly conda env now has tensorboard installed

zcain117 · 2019-09-04T21:36:24Z

This change works on pytorch-nightly but on pytorch-0.1, I needed to run conda install future. I can either add future to our conda setup file or just wait to see if it's needed once we cut a new stable build (which I think will be soon). @dlibenzi what do you think?

dlibenzi · 2019-09-04T21:38:07Z

Hmm, pytorch-0.1 should not be running newer code.
We are not forward compatible.

zcain117 · 2019-09-04T21:41:11Z

Hmm, pytorch-0.1 should not be running newer code.
We are not forward compatible.

Ok sounds good. I will leave it alone for now and then we can add conda install future if necessary once we are cutting the new build

add tensorboard support

c352174

zcain117 requested a review from jysohn23 September 3, 2019 23:22

dlibenzi reviewed Sep 3, 2019

View reviewed changes

test/test_train_imagenet.py Show resolved Hide resolved

zcain117 added 2 commits September 3, 2019 23:28

print mean accuracy

c8bb437

add percent sign

c6d76d7

dlibenzi reviewed Sep 3, 2019

View reviewed changes

jysohn23 reviewed Sep 3, 2019

View reviewed changes

zcain117 added 2 commits September 3, 2019 23:52

default log_dir to FLAGS.logdir

b8f0f10

remove runs

754e6db

jysohn23 reviewed Sep 3, 2019

View reviewed changes

jysohn23 approved these changes Sep 3, 2019

View reviewed changes

dlibenzi reviewed Sep 4, 2019

View reviewed changes

zcain117 added 2 commits September 4, 2019 17:50

Merge branch 'master' of https://github.com/pytorch/xla into add-summ…

68e4b28

…ary-writer

only write if logdir is not None

663a9f8

dlibenzi requested changes Sep 4, 2019

View reviewed changes

zcain117 added 2 commits September 4, 2019 18:47

move add_Scalar to test_utils

e864204

comment

1ce503f

dlibenzi approved these changes Sep 4, 2019

View reviewed changes

zcain117 mentioned this pull request Sep 4, 2019

Imagenet, MNIST, and Cifar10: consistency with Tensorboard and logging #993

Closed

dlibenzi merged commit 1640784 into master Sep 4, 2019

		return MODEL_PROPERTIES.get(FLAGS.model, MODEL_PROPERTIES['DEFAULT'])[key]


		def add_scalar_to_summary(summary_writer, metric_name, y_value, x_value):

Add basic tensorboard support #985

Add basic tensorboard support #985

Uh oh!

Conversation

zcain117 commented Sep 3, 2019

Uh oh!

zcain117 commented Sep 3, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jysohn23 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jysohn23 left a comment

Choose a reason for hiding this comment

Uh oh!

zcain117 commented Sep 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlibenzi commented Sep 4, 2019

Uh oh!

zcain117 commented Sep 4, 2019

Uh oh!

zcain117 commented Sep 4, 2019

Uh oh!

dlibenzi commented Sep 4, 2019

Uh oh!

zcain117 commented Sep 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants