Race condition in EventFileWriter #6974

gibiansky · 2017-01-20T08:41:50Z

There is a race condition here in EventFileWriter: here

    if not gfile.IsDirectory(self._logdir):
      gfile.MakeDirs(self._logdir)

It is possible for multiple concurrent threads and/or processes to get to this code simultaneously, both check IsDirectory, return False, then both try creating the directory, and have one succeed and one error.

This can be fixed by by catching the AlreadyExistsError from MakeDirs and handling it gracefully.

(this is a real issue that we have encountered in daily usage and causes us issues, not just a hypothetical)

The text was updated successfully, but these errors were encountered:

asimshankar · 2017-01-20T18:11:58Z

I don't believe it is recommended to have two distinct EventFileWriters writing to the same log directory as that could lead to two of them writing to the same file simultaneously, which will certainly be incorrect.

Could you elaborate on how you run into this situation and why you need to have concurrent EventFileWriters in the same log directory? Can you have your concurrent writers writing to different directories?

gibiansky · 2017-01-20T18:46:55Z

I agree that this can be handled more effectively by fixing user code, but it is still an issue that should be fixed and should be causing at most a warning (not a crash).

The reason this is happening is because we are training in parallel using MPI, and so we have many copies of identical MPI processes that run the same model (but communicate with each other to reduce gradients). If the processes are identical they will each write a log of the things they are doing to each directory. This could be handled by either doing a per-rank log directory or by having only a single rank write logs, and that is how we will handle this for the time being.

More importantly, the reason this happens in a context like this is incredibly confusing and non-deterministic, and in general this situation is more appropriately dealt with by catching the error rather than by checking for the directory existing up-front; you can have a variety of other situations in which the directory does not initially exist but then gets created, and probably not all of those situations are user errors.

yaroslavvb · 2017-01-20T18:53:37Z

We are also experimenting with MPI for training, seems like catching/ignoring AlreadyExistsError is a useful fix.

asimshankar · 2017-01-20T19:02:34Z

It won't be as simple as that, since we'd want to ensure that the underlying C++ EventsWriter doesn't end up with the same filename and that we are not writing concurrently to the same underlying file.

But that said, it shouldn't be too hard to fix the whole chain. Contributions welcome!

gibiansky · 2017-01-20T19:12:44Z

These are somewhat orthogonal issues. I can trigger this issue without doing multiple training jobs at once – for example, I can have a loop that runs a small TensorFlow script with logging over and over again and writing to dir log-dir, and concurrently another process that is constantly looping and doing rm -rf log-dir. Eventually you'll hit this race condition.

Do you think a fix for this issue could be merged without having to drag in coordination between multiple processes doing training? That may be our specific use case, but it has nothing to do with what's causing the bug...

gibiansky · 2017-01-20T19:14:07Z

In particular, the using-the-same-filename issue is already present: suppose the directory already exists in advance, and you run two jobs training and logging to the same directory. They might write to the same file already as it is now – changing the semantics here does not affect that bug. Since these are two separate bugs they can be fixed in two separate issues and PRs.

asimshankar · 2017-01-20T20:23:51Z

I was thinking about things working end-to-end, but as you correctly point out, there is no strict dependency.

That said, I'm not sure if the right fix is to catch and ignore the AlreadyExistsError or if there should be no error in the first place. For example, the underlying C++ operation should not be failing (

tensorflow/tensorflow/core/platform/file_system.h

Line 107 in 5e5dc97

// * OK - successfully created the directory and sub directories, even if

)

So I'd like to understand why the error is being thrown. Doesn't happen in this simple test:

import tensorflow as tf
tf.platform.gfile.MakeDirs('/tmp/foo')
tf.platform.gfile.MakeDirs('/tmp/foo')

Let me dig around a bit as to why you're getting this error, or if you have any pointers, that would be help.

asimshankar · 2017-01-20T20:37:26Z

It seems there is some inconsistency between implementations of FileSystem::CreateDir. For example, posix_file_system.cc:230 fails when the directory already exist, while the test filesystem does not.

We'll look into making these FileSystem implementations consistent and having FileSystem::RecursivelyCreateDir and consequently gfile.MakeDirs not fail when the directory already exists.

gibiansky · 2017-01-20T22:36:45Z

Thanks!

- Ensure that CreateDir returns error::ALREADY_EXISTS if the dirname exists. - Ensure that RecursivelyCreateDirectory ignores error::ALREADY_EXISTS when creating directories and subdirectories. Fixes tensorflow#6974 Change: 145144720

- Ensure that CreateDir returns error::ALREADY_EXISTS if the dirname exists. - Ensure that RecursivelyCreateDirectory ignores error::ALREADY_EXISTS when creating directories and subdirectories. Fixes #6974 Change: 145144720

asimshankar added the stat:awaiting response Status - Awaiting response from author label Jan 20, 2017

asimshankar added stat:contribution welcome Status - Contributions welcome and removed stat:awaiting response Status - Awaiting response from author labels Jan 20, 2017

asimshankar removed the stat:contribution welcome Status - Contributions welcome label Jan 20, 2017

asimshankar assigned suharshs Jan 20, 2017

gunan closed this as completed in 723c404 Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in EventFileWriter #6974

Race condition in EventFileWriter #6974

gibiansky commented Jan 20, 2017

asimshankar commented Jan 20, 2017

gibiansky commented Jan 20, 2017 •

edited

yaroslavvb commented Jan 20, 2017 •

edited

asimshankar commented Jan 20, 2017

gibiansky commented Jan 20, 2017

gibiansky commented Jan 20, 2017

asimshankar commented Jan 20, 2017

asimshankar commented Jan 20, 2017 •

edited

gibiansky commented Jan 20, 2017

Race condition in EventFileWriter #6974

Race condition in EventFileWriter #6974

Comments

gibiansky commented Jan 20, 2017

asimshankar commented Jan 20, 2017

gibiansky commented Jan 20, 2017 • edited

yaroslavvb commented Jan 20, 2017 • edited

asimshankar commented Jan 20, 2017

gibiansky commented Jan 20, 2017

gibiansky commented Jan 20, 2017

asimshankar commented Jan 20, 2017

asimshankar commented Jan 20, 2017 • edited

gibiansky commented Jan 20, 2017

gibiansky commented Jan 20, 2017 •

edited

yaroslavvb commented Jan 20, 2017 •

edited

asimshankar commented Jan 20, 2017 •

edited