New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in EventFileWriter #6974
Comments
I don't believe it is recommended to have two distinct Could you elaborate on how you run into this situation and why you need to have concurrent |
I agree that this can be handled more effectively by fixing user code, but it is still an issue that should be fixed and should be causing at most a warning (not a crash). The reason this is happening is because we are training in parallel using MPI, and so we have many copies of identical MPI processes that run the same model (but communicate with each other to reduce gradients). If the processes are identical they will each write a log of the things they are doing to each directory. This could be handled by either doing a per-rank log directory or by having only a single rank write logs, and that is how we will handle this for the time being. More importantly, the reason this happens in a context like this is incredibly confusing and non-deterministic, and in general this situation is more appropriately dealt with by catching the error rather than by checking for the directory existing up-front; you can have a variety of other situations in which the directory does not initially exist but then gets created, and probably not all of those situations are user errors. |
We are also experimenting with MPI for training, seems like catching/ignoring |
It won't be as simple as that, since we'd want to ensure that the underlying C++ But that said, it shouldn't be too hard to fix the whole chain. Contributions welcome! |
These are somewhat orthogonal issues. I can trigger this issue without doing multiple training jobs at once – for example, I can have a loop that runs a small TensorFlow script with logging over and over again and writing to dir Do you think a fix for this issue could be merged without having to drag in coordination between multiple processes doing training? That may be our specific use case, but it has nothing to do with what's causing the bug... |
In particular, the using-the-same-filename issue is already present: suppose the directory already exists in advance, and you run two jobs training and logging to the same directory. They might write to the same file already as it is now – changing the semantics here does not affect that bug. Since these are two separate bugs they can be fixed in two separate issues and PRs. |
I was thinking about things working end-to-end, but as you correctly point out, there is no strict dependency. That said, I'm not sure if the right fix is to catch and ignore the
So I'd like to understand why the error is being thrown. Doesn't happen in this simple test: import tensorflow as tf
tf.platform.gfile.MakeDirs('/tmp/foo')
tf.platform.gfile.MakeDirs('/tmp/foo') Let me dig around a bit as to why you're getting this error, or if you have any pointers, that would be help. |
It seems there is some inconsistency between implementations of We'll look into making these |
Thanks! |
- Ensure that CreateDir returns error::ALREADY_EXISTS if the dirname exists. - Ensure that RecursivelyCreateDirectory ignores error::ALREADY_EXISTS when creating directories and subdirectories. Fixes tensorflow#6974 Change: 145144720
- Ensure that CreateDir returns error::ALREADY_EXISTS if the dirname exists. - Ensure that RecursivelyCreateDirectory ignores error::ALREADY_EXISTS when creating directories and subdirectories. Fixes #6974 Change: 145144720
There is a race condition here in
EventFileWriter
: hereIt is possible for multiple concurrent threads and/or processes to get to this code simultaneously, both check
IsDirectory
, returnFalse
, then both try creating the directory, and have one succeed and one error.This can be fixed by by catching the
AlreadyExistsError
fromMakeDirs
and handling it gracefully.(this is a real issue that we have encountered in daily usage and causes us issues, not just a hypothetical)
The text was updated successfully, but these errors were encountered: