Skip to content

Conversation

@zcain117
Copy link
Collaborator

@zcain117 zcain117 commented Sep 3, 2019

Tested on a fresh red VM that I built from head using build_torch_wheels.sh

@zcain117 zcain117 requested a review from jysohn23 September 3, 2019 23:22
@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 3, 2019

@jysohn23 how would I add tensorboard to the pytorch-nightly and pytorch-0.1 conda environments that come with the pytorch image? And also to the docker images?

return correct / total_samples

accuracy = 0.0
writer = SummaryWriter(log_dir='/tmp/imagenet_tensorboard')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be the default, and the log_dir must be a command line argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulted to FLAGS.logdir

Copy link
Collaborator

@jysohn23 jysohn23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll send you links to where we build our GCE images and we can add them there.

return correct / total_samples

accuracy = 0.0
writer = SummaryWriter(log_dir='/tmp/imagenet_tensorboard')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add FLAGS.logdir? What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulted to FLAGS.logdir

accuracy = sum(accuracies) / len(accuracies)
print("Epoch: {}".format(epoch))
accuracy = mean(accuracies)
writer.add_scalar('Accuracy/test', accuracy, epoch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add additional scalar metrics such as loss, examples/sec, etc (where the x-axis is global step number)? Similar to the tensorboard metrics we see with TF estimator. We can do this in a followup PR maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be adding more in some followup PRs. Still deciding on a clean way to do the others since we'll want to average over all devices during the training loop. Or we could have N different lines on the loss curves where N=num_cores

Copy link
Collaborator

@jysohn23 jysohn23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cleanup unintentionally added runs/Sep03_23-44-20_zcain-highmem-96-build-from-head/events.out.tfevents.1567554260.zcain-highmem-96-build-from-head.54803.0 etc output from tb? Thanks.

@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 3, 2019

Can we cleanup unintentionally added runs/Sep03_23-44-20_zcain-highmem-96-build-from-head/events.out.tfevents.1567554260.zcain-highmem-96-build-from-head.54803.0 etc output from tb? Thanks.

Yup yup, removed

return correct / total_samples

accuracy = 0.0
writer = SummaryWriter(log_dir=FLAGS.logdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be created only if logdir is not None ... unless SummaryWriter is a noop if logdir is None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this without specifying --logdir and verified that None was being passed into SummaryWriter. When this happens, SummaryWriter works without complaining and defaults to creating a new directory: "runs/"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not want that 😉
So lets create a write only if the user wants to (logdir not None).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If writer is None when logdir is None, we will need to perform a check for if writer: at every spot where we call writer.add_scalar. Not a huge deal but adds to clutter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK. Dumping stuff into a random location is worse 😉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a helper method so it'll be a 1-line call whenever we write a summary data point. Verified that nothing gets written with --logdir=None and that valid summaries get written with --logdir=non-None

return MODEL_PROPERTIES.get(FLAGS.model, MODEL_PROPERTIES['DEFAULT'])[key]


def add_scalar_to_summary(summary_writer, metric_name, y_value, x_value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to test_utils.py.
Also, what are those x,y values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and renamed x,y to more closely match the documentation: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_scalar

In my mind, each scalar is an x,y coordinate in the Tensorboard graph. But hopefully this new naming is more clear

@dlibenzi
Copy link
Collaborator

dlibenzi commented Sep 4, 2019

We may want to change MNIST and CIFAR10 as well...

@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 4, 2019

Verified that my CL on the google3 side has propagated and that pytorch-nightly conda env now has tensorboard installed

@dlibenzi dlibenzi merged commit 1640784 into master Sep 4, 2019
@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 4, 2019

This change works on pytorch-nightly but on pytorch-0.1, I needed to run conda install future. I can either add future to our conda setup file or just wait to see if it's needed once we cut a new stable build (which I think will be soon). @dlibenzi what do you think?

@dlibenzi
Copy link
Collaborator

dlibenzi commented Sep 4, 2019

Hmm, pytorch-0.1 should not be running newer code.
We are not forward compatible.

@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 4, 2019

Hmm, pytorch-0.1 should not be running newer code.
We are not forward compatible.

Ok sounds good. I will leave it alone for now and then we can add conda install future if necessary once we are cutting the new build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants