Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing multiple logdirs doesn't work (and raises warning about "Conflict for name") #179

Open
colinmorris opened this issue Jul 4, 2017 · 14 comments

Comments

@colinmorris
Copy link

These work:

tensorboard --logdir logs/foo
tensorboard --logdir logs/bar
tensorboard --logdir a:logs/foo,b:logs/bar

But when I run this:

tensorboard --logdir logs/foo,logs/bar

I get a warning like this repeatedly printed:

WARNING:tensorflow:Conflict for name .: old path /home/colin/src/myproject/logs/foo, new path /home/colin/src/myproject/logs/bar

In tensorboard, only one run is shown.

I installed tensorflow-tensorboard using pip, and the version is 0.1.2

(I tried debugging, but the warning gets raised in a separate daemon thread, and I wasn't able to figure out how to get pdb to attach to it)

@teamdandelion
Copy link
Contributor

This is a design issue and not just a simple technical bug. Here's why:

If we have path/to/logdir/run1, path/to/logdir/run2, and you pass --logdir=path/to/logdir, you want the run names for run1 and run2 to be "run1" and "run2", naturally enough. But suppose there are events in path/to/logdir too. What should we call that run? Borrowing from unix conventions, we called it ".".

However, if you pass two unnamed logdirs, they both get auto-assigned "." and that creates the conflict.

Possibly we should fix this by requiring explicit names via the colon any time multiple logdirs are passed.

Going to mark this "contributiosn welcome". @colinmorris if you come up with a solution you like, feel free to send a pull request.

@wchargin
Copy link
Contributor

wchargin commented Jul 6, 2017

@colinmorris you can attach pdb as follows:

diff --git a/tensorboard/backend/event_processing/event_multiplexer.py b/tensorboard/backend/event_processing/event_multiplexer.py
index 8a71f82..66e576a 100644
--- a/tensorboard/backend/event_processing/event_multiplexer.py
+++ b/tensorboard/backend/event_processing/event_multiplexer.py
@@ -124,6 +124,7 @@ class EventMultiplexer(object):
         if name in self._paths and self._paths[name] != path:
           # TODO(danmane) - Make it impossible to overwrite an old path with
           # a new path (just give the new path a distinct name)
+          import pdb; pdb.set_trace()  # XXX BREAKPOINT
           tf.logging.warning('Conflict for name %s: old path %s, new path %s',
                              name, self._paths[name], path)
         tf.logging.info('Constructing EventAccumulator for %s', path)

(You must invoke with bazel-bin/tensorboard/tensorboard --logdir /tmp/a,/tmp/b rather than bazel run tensorboard -- --logdir /tmp/a,/tmp/b.)

@wchargin
Copy link
Contributor

wchargin commented Jul 6, 2017

For what it's worth, I used to run into this all the time, too: I would bazel run tensorboard -- --logdir /tmp/mnist,/tmp/histograms_demo,/tmp/scalars_demo,/tmp/text,/tmp/audio_demo for testing on comprehensive data sets, but this caused the collision as you describe. I worked around it by instead symlinking all those directories into a single parent directory. When you pass this as the unique logdir, everything works fine, because the root cause that @dandelionmane describes does not apply.

That said, I do recognize that this is just a workaround, and would be happy to see a clean solution to this problem. :-)

@lucasb-eyer
Copy link

I would love this flag to be more unix-y, such that I can run something like tensorboard --logdir experiment{1,2,3} super_new_run{5,8} and have them show up like that. (For all reasonable shells, that command translates to tensorboard --logdir experiment1 experiment2 experiment3 super_new_run5 super_new_run8)

The current syntax gets extremely tiring when I have a folder with many runs and need to start tensorboard with few individual runs to compare (which I need to do because tensorboard loads runs so slowly). The alternative with a folder with symlinks is just as tiring for my use-case.

@SPP3000
Copy link

SPP3000 commented Aug 6, 2019

Is there a new way of get around this issue in tensorflow2.0-beta?

@wchargin
Copy link
Contributor

wchargin commented Aug 6, 2019

I would love this flag to be more unix-y […]

I would love that, too.

Is there a new way of get around this issue in tensorflow2.0-beta?

Nothing new, no, though if you’re willing to add a shell function to
your .bashrc you can always automate the tiring symlink-directory
behavior:

# usage: multitb LOGDIR [LOGDIR...]
# runs TensorBoard with a comma-separated list of all provided logdirs
multitb() (
    set -eu
    if [ $# -eq 0 ]; then
        printf >&2 'fatal: provide at least one logdir\n'
        return 1
    fi
    tmpdir="$(mktemp -d)"
    for arg; do
        case "${arg}" in
            /*) ln -s "${arg}" "${tmpdir}/" ;;
            *) ln -s "${PWD}/${arg}" "${tmpdir}/" ;;
        esac
    done
    exit_code=0
    \command ls -l "${tmpdir}"
    printf 'tensorboard --logdir %s\n' "${tmpdir}"
    tensorboard --logdir "${tmpdir}" || exit_code=$?
    # This really should be 'find -H "${tmpdir}" -type l -delete` to
    # account for logdirs whose names start with `.`, which should
    # behave correctly on any POSIX system, but it is much more clear
    # that the version below cannot under any circumstances delete the
    # underlying data (e.g., with non-POSIX-compliant `find`(1)).
    rm "${tmpdir}"/*
    rmdir "${tmpdir}"
    return "${exit_code}"
)

Then running multitb experiment{1,2,3} super_new_run{5,8} actually
invokes TensorBoard with a fresh logdir that has the desired experiments
linked in.

This should work in any POSIX shell (e.g., bash, zsh).

@guicho271828
Copy link

I would love this flag to be more unix-y, such that I can run something like tensorboard --logdir experiment{1,2,3} super_new_run{5,8}

so far I do $(echo experiment{1,2,3} super_new_run{5,8} | sed s/ /,/g ) ... stupid but works.

If we have path/to/logdir/run1, path/to/logdir/run2, and you pass --logdir=path/to/logdir, you want the run names for run1 and run2 to be "run1" and "run2", naturally enough. But suppose there are events in path/to/logdir too. What should we call that run? Borrowing from unix conventions, we called it ".".

I don't understand ... why they are named based on the dirnames, not filenames, like event... file?

@wchargin
Copy link
Contributor

so far I do $(echo experiment{1,2,3} super_new_run{5,8} | sed s/ /,/g ) ... stupid but works.

Works, as long as your paths don’t have any funny characters in them.
This isn’t safe in general.

why they are named based on the dirnames, not filenames, like
event... file?

Two reasons:

  • Often, one logical run comprises many event files. This may happen
    if your job is interrupted (e.g., preempted and then restarted while
    running on a cluster, or auto-restarting after a crash) or in some
    distributed training topologies. In this case, the data from all
    event files in a directory is, roughly speaking, concatenated as if
    there were only one file.

  • At the API layer, the user always specifies a directory name, not a
    file name. Invoking tf.summary.create_file_writer("logs") creates
    a directory called logs, not a normal file.

@guicho271828
Copy link

Often, one logical run comprises many event files.

I see, that seems legit.

so, for the first example in this thread tensorboard --logdir logs/foo,logs/bar above, the working answer at the moment would be "use tensorboard --logdir logs", or putting the log files into one layer deeper, like logs/bar/log and logs/foo/log and use tensorboard --logdir logs/foo,logs/bar. I was accidentally using the second approach.

@wchargin
Copy link
Contributor

If you have directories logs/foo and logs/bar, then just point
TensorBoard at --logdir logs.

The names logs/foo and logs/bar cannot be the names of event files,
because event file names always contain the string tfevents. But if
you somehow had event files logs/foo.tfevents.123 and
logs/bar.tfevents.123, then yes, you would need to put them into
subdirectories (say, logs/foo and logs/bar) and again point
TensorBoard’s --logdir at a common ancestor directory (logs).

@guicho271828
Copy link

the point is, the original author may have millions of other directories under logs/ (e.g. logs/baz1 ... logs/baz1000), and want to specifically point to logs/foo and logs/bar. Assuming a specific usage and defaults to "include all" is generally not a nice idea for a widely used software as tensorboard.

I am not misunderstanding the event files structure. I am assuming the original author has files like logs/foo/xxx.tfevents....

@shwang
Copy link

shwang commented Jun 6, 2020

Here's a simple way to make this work that avoids the overlapping run names issue.

We could add another flag --log_dir_multi [log_dir1] [log_dir2] ....

It errors if the --log_dir option is also present. To process this flag, Tensorboard first makes a temporary directory $gather_dir. Then Tensorboard makes a symlinks all the log dirs in $gather_dir/[log_dir#]. Finally Tensorboard proceeds as if we ran tensorboard --log_dir $gather_dir.

The run names will look as if you ran tensorboard --log_dir . but only with the selected log dirs.

@shwang
Copy link

shwang commented Jun 6, 2020

I'd be happy to try implementing this if I get a go-ahead for this design from a maintainer.

@Toeplitz
Copy link

Toeplitz commented Nov 25, 2020

The following case should also be considered:

--logdir /local/folder,gs://bucket/path

A combination of local folders and training data in GCP buckets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants