Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in EventFileWriter #6974

Closed
gibiansky opened this issue Jan 20, 2017 · 9 comments
Closed

Race condition in EventFileWriter #6974

gibiansky opened this issue Jan 20, 2017 · 9 comments
Assignees

Comments

@gibiansky
Copy link
Contributor

There is a race condition here in EventFileWriter: here

    if not gfile.IsDirectory(self._logdir):
      gfile.MakeDirs(self._logdir)

It is possible for multiple concurrent threads and/or processes to get to this code simultaneously, both check IsDirectory, return False, then both try creating the directory, and have one succeed and one error.

This can be fixed by by catching the AlreadyExistsError from MakeDirs and handling it gracefully.

(this is a real issue that we have encountered in daily usage and causes us issues, not just a hypothetical)

@asimshankar
Copy link
Contributor

I don't believe it is recommended to have two distinct EventFileWriters writing to the same log directory as that could lead to two of them writing to the same file simultaneously, which will certainly be incorrect.

Could you elaborate on how you run into this situation and why you need to have concurrent EventFileWriters in the same log directory? Can you have your concurrent writers writing to different directories?

@asimshankar asimshankar added the stat:awaiting response Status - Awaiting response from author label Jan 20, 2017
@gibiansky
Copy link
Contributor Author

gibiansky commented Jan 20, 2017

I agree that this can be handled more effectively by fixing user code, but it is still an issue that should be fixed and should be causing at most a warning (not a crash).

The reason this is happening is because we are training in parallel using MPI, and so we have many copies of identical MPI processes that run the same model (but communicate with each other to reduce gradients). If the processes are identical they will each write a log of the things they are doing to each directory. This could be handled by either doing a per-rank log directory or by having only a single rank write logs, and that is how we will handle this for the time being.

More importantly, the reason this happens in a context like this is incredibly confusing and non-deterministic, and in general this situation is more appropriately dealt with by catching the error rather than by checking for the directory existing up-front; you can have a variety of other situations in which the directory does not initially exist but then gets created, and probably not all of those situations are user errors.

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Jan 20, 2017

We are also experimenting with MPI for training, seems like catching/ignoring AlreadyExistsError is a useful fix.

@asimshankar
Copy link
Contributor

It won't be as simple as that, since we'd want to ensure that the underlying C++ EventsWriter doesn't end up with the same filename and that we are not writing concurrently to the same underlying file.

But that said, it shouldn't be too hard to fix the whole chain. Contributions welcome!

@asimshankar asimshankar added stat:contribution welcome Status - Contributions welcome and removed stat:awaiting response Status - Awaiting response from author labels Jan 20, 2017
@gibiansky
Copy link
Contributor Author

These are somewhat orthogonal issues. I can trigger this issue without doing multiple training jobs at once – for example, I can have a loop that runs a small TensorFlow script with logging over and over again and writing to dir log-dir, and concurrently another process that is constantly looping and doing rm -rf log-dir. Eventually you'll hit this race condition.

Do you think a fix for this issue could be merged without having to drag in coordination between multiple processes doing training? That may be our specific use case, but it has nothing to do with what's causing the bug...

@gibiansky
Copy link
Contributor Author

In particular, the using-the-same-filename issue is already present: suppose the directory already exists in advance, and you run two jobs training and logging to the same directory. They might write to the same file already as it is now – changing the semantics here does not affect that bug. Since these are two separate bugs they can be fixed in two separate issues and PRs.

@asimshankar
Copy link
Contributor

I was thinking about things working end-to-end, but as you correctly point out, there is no strict dependency.

That said, I'm not sure if the right fix is to catch and ignore the AlreadyExistsError or if there should be no error in the first place. For example, the underlying C++ operation should not be failing (

// * OK - successfully created the directory and sub directories, even if
)

So I'd like to understand why the error is being thrown. Doesn't happen in this simple test:

import tensorflow as tf
tf.platform.gfile.MakeDirs('/tmp/foo')
tf.platform.gfile.MakeDirs('/tmp/foo')

Let me dig around a bit as to why you're getting this error, or if you have any pointers, that would be help.

@asimshankar
Copy link
Contributor

asimshankar commented Jan 20, 2017

It seems there is some inconsistency between implementations of FileSystem::CreateDir. For example, posix_file_system.cc:230 fails when the directory already exist, while the test filesystem does not.

We'll look into making these FileSystem implementations consistent and having FileSystem::RecursivelyCreateDir and consequently gfile.MakeDirs not fail when the directory already exists.

@asimshankar asimshankar removed the stat:contribution welcome Status - Contributions welcome label Jan 20, 2017
@gibiansky
Copy link
Contributor Author

Thanks!

@gunan gunan closed this as completed in 723c404 Jan 24, 2017
benoitsteiner pushed a commit to benoitsteiner/tensorflow that referenced this issue Jan 27, 2017
- Ensure that CreateDir returns error::ALREADY_EXISTS if the dirname exists.
- Ensure that RecursivelyCreateDirectory ignores error::ALREADY_EXISTS when
  creating directories and subdirectories.

Fixes tensorflow#6974
Change: 145144720
gunan pushed a commit that referenced this issue Feb 7, 2017
- Ensure that CreateDir returns error::ALREADY_EXISTS if the dirname exists.
- Ensure that RecursivelyCreateDirectory ignores error::ALREADY_EXISTS when
  creating directories and subdirectories.

Fixes #6974
Change: 145144720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants