Skip to content

Conversation

qlzh727
Copy link
Member

@qlzh727 qlzh727 commented Mar 15, 2018

Currently it only log the metric, which we could use in any of the hooks.

Copy link
Contributor

@karmel karmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this, I'm not clear on where this will be called from-- in the graph, or in the outer loop? The decision there will probably have implications as to how different input types and IO processes have to be handled and tested; maybe the best thing to do is to mock an incorporation of this into MNIST, and that will clarify what we need the logger to be able to handle.

import json
import os

from tensorflow.python.platform import gfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use the public API for this, which is preferable-- tf.gfile.GFile, etc. See tensorflow/python/platform/gfile.py for allowed symbols.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, Thanks for pointing this out. I will change to use the public API.

@@ -0,0 +1,64 @@
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this here instead of in utils/logging? I'm not opposed, but we should have reasons, given the similarity of purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, nit: since we have utils, and will probably eventually have models, and will certainly have multiple benchmarks, how do you feel about naming this package benchmarks instead of the singular benchmark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intention is to put all benchmark related code into official/benchmark, but not the benchmark themselves. I will have other code like upload to bigstore and other libs in future change. I think they are more logically grouped together.

I could move those code to officials/utils/benchmark if you prefer

if not gfile.IsDirectory(self._logging_dir):
gfile.MakeDirs(self._logging_dir)

def logMetric(self, name, value, unit=None, global_step=None, extras=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be log_metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, yes.


Args:
name: string, the name of the metric to log.
value: number, the value of the metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually do type checking here-- should we? Or maybe relax the stated requirement, and just say this should be json-dumpable (which, IIRC, has all sorts of limitations, but float/str/int should be okay-- though that raises the question of how tensors would get handled here?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment, i think I will assume the input are simple value, and even in the case of the tensor value at runtime, this logger should be passed into some run session hook and get the loggable value there.

I will put a type check here for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a similar note, we should decide how to handle failures. I'm inclined to not have a logging hook bring down the whole run, but just silently bailing doesn't seem good either. Perhaps salvage what info can be dumped and write that along with an indication that it is not a clean record?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, probably we could add a logger to local warning for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a warning instead of throwing valueError here.

name: string, the name of the metric to log.
value: number, the value of the metric.
unit: string, the unit of the metric, E.g "image per second".
global_step: int, the global_step when the metric is logged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just run tf.train.get_global_step, you get a Tensor. Not sure what will happen if we try to log that directly? Do we have to convert tensor output to raw? We probably want to have a stance and add tests accordingly. I guess it depends on where we are calling this from-- the main loop, or from within the Estimator and related tf code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://github.com/tensorflow/tensorflow/blob/r1.6/tensorflow/python/training/training_util.py#L45, tf.train.global_step() will return a int. In the case that the its not in a tf runtime, tf.train.global_step() will throw an Error.

"timestamp": datetime.datetime.now().strftime(
_DATE_TIME_FORMAT_PATTERN),
"extras": extras}
json.dump(metric, f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does json.dump play nicely with the tf file writer? If this is actually happening deferred in a graph, what happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline, I think we will stick with the simple number value for the moment. When user try to use this in the context of tf run session, they should be able to get the simple value from the run session.


from official.benchmark import logger
import tensorflow as tf
from tensorflow.python.platform import gfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on not about about public API. Also, I know pylint complains, but we have the main TF import first, with the local package imports after. (Note that this is also important for the import-to-google rules to work as they are now.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def tearDown(self):
gfile.DeleteRecursively(self.get_temp_dir())

def testCreateLoggingDir(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less particular about test names, but I think the rest of our tests are underscored rather than camelCase-- probably best to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def testLogMetric(self):
log_dir = tempfile.mkdtemp(dir=self.get_temp_dir())
log = logger.BenchmarkLogger(log_dir)
log.logMetric("accuracy", 0.999, global_step=1e4, extras={"name": "value"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stated above, but just to record in the proper place: we should take a stance on Tensor input, and make sure we test accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the previous comment.

"""
with gfile.GFile(
os.path.join(self._logging_dir, _METRIC_LOG_FILE_NAME), "a") as f:
metric = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assign a unique ID to each class instance and include it? That way if a bunch of records get jammed together in the same file (i.e. the default path) it is easy to separate runs later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logging_dir is the flag that caller should tweak. Similar as the log_dir flag for summary writer, the caller should make sure to specify different log dir for different run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline, we actually have the UUID field in the design for record keeping. Will have the ID field in future change.


Args:
name: string, the name of the metric to log.
value: number, the value of the metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a similar note, we should decide how to handle failures. I'm inclined to not have a logging hook bring down the whole run, but just silently bailing doesn't seem good either. Perhaps salvage what info can be dumped and write that along with an indication that it is not a clean record?

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

@karmel
Copy link
Contributor

karmel commented Mar 19, 2018

Poor Git. Try:

git merge origin master
git diff > ~/benchmark.patch # Open that patch. Should look like your set of changes and only your set.
git push # Not necessary, but just in case
# Grab the commit hash for current master HEAD
git reset --hard <commit hash for master HEAD>
git apply ~/benchmark.patch
git diff # Make sure this looks right
git push

1. Update logger to convert the value type to float to please JSON.
2. Update logger test to call parent test tearDown.
@qlzh727
Copy link
Member Author

qlzh727 commented Mar 19, 2018

See qlzh727@6ad9e0b as a demo for metric logger

@qlzh727
Copy link
Member Author

qlzh727 commented Mar 19, 2018

Btw, the metric log from the qlzh727@6ad9e0b looks like below:

{"name": "train_accuracy", "timestamp": "2018-03-19T11:36:57.969879Z", "value": 0.9988235235214233, "extras": null, "unit": null, "global_step": 25604}
{"name": "loss", "timestamp": "2018-03-19T11:37:07.436055Z", "value": 1.953561877598986e-05, "extras": null, "unit": null, "global_step": 25704}
{"name": "train_accuracy", "timestamp": "2018-03-19T11:37:07.436379Z", "value": 0.9988889098167419, "extras": null, "unit": null, "global_step": 25704}
{"name": "loss", "timestamp": "2018-03-19T11:37:16.885552Z", "value": 0.01344460342079401, "extras": null, "unit": null, "global_step": 25804}
{"name": "train_accuracy", "timestamp": "2018-03-19T11:37:16.886348Z", "value": 0.9984210729598999, "extras": null, "unit": null, "global_step": 25804}
{"name": "loss", "timestamp": "2018-03-19T11:37:26.488251Z", "value": 0.00038322151522152126, "extras": null, "unit": null, "global_step": 25904}
{"name": "train_accuracy", "timestamp": "2018-03-19T11:37:26.488567Z", "value": 0.9984999895095825, "extras": null, "unit": null, "global_step": 25904}
{"name": "loss", "timestamp": "2018-03-19T11:37:36.103425Z", "value": 8.149698260240257e-05, "extras": null, "unit": null, "global_step": 26004}
{"name": "train_accuracy", "timestamp": "2018-03-19T11:37:36.103754Z", "value": 0.9985714554786682, "extras": null, "unit": null, "global_step": 26004}
{"name": "loss", "timestamp": "2018-03-19T11:37:45.991382Z", "value": 0.0013781175948679447, "extras": null, "unit": null, "global_step": 26104}
{"name": "train_accuracy", "timestamp": "2018-03-19T11:37:45.991713Z", "value": 0.9986363649368286, "extras": null, "unit": null, "global_step": 26104}
{"name": "loss", "timestamp": "2018-03-19T11:37:55.977899Z", "value": 0.0007561420788988471, "extras": null, "unit": null, "global_step": 26204}
{"name": "train_accuracy", "timestamp": "2018-03-19T11:37:55.978276Z", "value": 0.9986956715583801, "extras": null, "unit": null, "global_step": 26204}
{"name": "loss", "timestamp": "2018-03-19T11:38:05.737698Z", "value": 0.008610671386122704, "extras": null, "unit": null, "global_step": 26304}

@karmel
Copy link
Contributor

karmel commented Mar 19, 2018

That is great-- thanks. I do feel like the example makes the case stronger that the logger should live in utils, since one might want to use to to write arbitrary JSON logs separate from benchmarking. But very nice regardless.

@qlzh727
Copy link
Member Author

qlzh727 commented Mar 19, 2018

Done. Moved to offical/utils/logging.

@karmel
Copy link
Contributor

karmel commented Mar 19, 2018

Thanks- LGTM, though we should still try to clean up the git record so that the CLA is clear.

@qlzh727 qlzh727 merged commit 875fcb3 into tensorflow:master Mar 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants