-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_rpc_spawn fails sporadically #41474
Comments
@Flamefire thanks for reporting! Does this error only relate to TensorPipe, or do you see similar failures with |
Those 2 are the only tests failing for me which have Rpc in the name. |
Any chance you're able to run this with the child processes attached to a debugger and give us the stack traces of when the segfault occurs? That would greatly help as otherwise we're looking for a needle in a haystack... Thanks! |
Sure. Could you tell me how to do that? Those are started from the python test, aren't they? |
Yes, attaching a debugger to the child process will be a bit tricky. Here is one way to do it. You need to first modify the test code, by adding two lines in here: pytorch/torch/testing/_internal/distributed/rpc/rpc_test.py Lines 3348 to 3351 in c86699d
just at the beginning of the function, before the init_rpc call. These lines should be:
(you may also have to add the If you now run the tests (you can pass command line flags to the executable to filter only the test you want) they will print a few lines with their rank and PID and then pause for 5 seconds. In that pause you should run, from another terminal, |
Ok, I finally managed to get a SIGABRT in gdb. Backtrace:
|
Rebuilding glog with debug info reveals this:
The code is So it seems As I also sometimes see output like:
I assume the following happens:
|
And finally I verified this by adding some debug logs to glog:
So PID 10857 destroys all mutexes (there are 4 for all others too), then tries to log the EOF event (moved the logging before the MutexLock) and gets an EINVAL from pthread_mutex_lock --> Invalid mutex passed. The agent even checks for EOF before logging:
However judging from the output "EOF: end of file", the type of the error is not Proposed solution: |
Wow, thank you so much for this thorough investigation, it really helps a lot! I looked at the stack trace and I think your diagnosis is right. I believe the problem is that the TensorPipe agent instance is a global static variable and thus, when the process exits, its destruction races with the destruction of the glog mutexes and thus sometimes one of TensorPipe's event loops tries to log a message after the mutex was destroyed. The reason this problem happens only in those two tests is probably simple: it seems we forgot to add a The shutdown call fixes the problem because it clears the global static RPC agent earlier, before the process exits. As for whether it is a requirement of the API that the user always calls As for the EOFError vs UVError approach, it could work in this specific instance but I'm afraid it could still be brittle. That is because there is no guarantee that we'll always get an EOF error when agents shut down, sometimes we could get a ECONNRESET, or possibly others. Figuring out which ones to map to EOFError and which ones to keep as UVErrors will not really scale. Also, in this specific case, the warning that gets logged is something we want to get rid of because it's somewhat expected and shouldn't thus pollute the logs, but we haven't found a clear way yet to detect it. See also #40094. |
One way to fix this is to log a message to a dummy sink which ignores the message in the constructor of your global static. This forces initialization of the glog global statics prior to yours and hence shedules destruction after it. Obviously avoiding global static objects would be best but I'm aware that this is not always possible.
So yes there are 2 issues: One is the usage of GLOG on termination and the other that the intended fix from #40094 is missing the UVError::EOF. As there is a dedicated subclass for EOF introducing a function for creating an error out of an uv error code and translating the uv EOF into EOFError instead seems like the right approach to fix the 2nd issue. I see no point having to different EOF errors. |
Sorry I forgot that. I tried that now (changed |
Awesome! I'll send out a PR to fix this shortly.
I think that's normal/expected, and probably happens for all tests (right?). That's because nowhere in PyTorch we call InitGoogleLogging(). It's up to the end user to call it, so that it can be passed the cmdline flags etc.
That seems to be something that you know better than me (I've never been able to find good docs on glog, if you know any please send them my way!). If there's a way to formalize the dependency of the global RPC agent on the glog mutex so that we reliably get the right destruction order, that would be great! I see perhaps a couple of issues though: the global RPC agent is just a shared_ptr, which starts unset, so initializing glog in the constructor of the agent wouldn't necessarily create a dependency between glog and that shared_ptr. Second, PyTorch doesn't always use glog, and it has its own fallback in case glog isn't available, and I believe this will limit the number of "advanced features" we can use. In fact, I think PyTorch isn't built with glog by default, which means the problem you hit only occurs for people who build their own PyTorch. This is probably why we didn't catch it earlier...
They do provide different information. In case of that specific backend, an EOF error is raised when a read returns 0, whereas a UVError is returned whenever there is an async error on a socket. So keeping them separate helps more easily chasing down problems. Also, different backends could choose to handle error differently, and may not even have a "native" EOF error (think of backends doing RDMA). I'm not convinced this is the right approach. My view is that the RPC agent shouldn't even inspect the error type at all, except for the PipeClosedError which is very special. The fact that, at the moment, we also check for the EOFError is just because it was the most common error type being logged and thus silencing it allowed us to avoid a lot of spurious log lines. However, once we have a proper solution for avoiding these lines, we won't have to special-case EOFError anymore. |
Addresses this bug report: #41474 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) [ghstack-poisoned]
…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]
…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]
Pull Request resolved: #41558 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 ghstack-source-id: 107947863 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
Actually not. For the tests discussed here I see it for 0-3 of the subprocesses. GLog is initialized from InitCaffeLogging called from GlobalInit. I see it called from many different files. So I'd say if it isn't called for this test then this is an oversight rather than intention as it is done in other test files.
That is just leveraging the C++ guarantee that destruction order is the inverse of the finishing construction order. But if that static global is a shared_ptr than not much can be done here except e.g. initializing the shared_ptr (to empty or whatever it currently is) from a function that makes the logging mutex be constructed. That's maybe something to ask the google guys.
Ok then question to evaluate: What information does an UVError value "EOF" provide? I'd expect that it is pretty much the same, but I'll leave it to you as you seem to be more firm on that. It was just a suggestion which looks right.
Then either UVError::EOF could be added to that list with the same reasoning or it could be converted to EOFError to be handled like it (see above)
A similar thing can be used as suggested above for glog: Have an instance which is destructed before whatever happens that causes the EOF/PipeClosed. The order can be ensured by using constructor order. The destructor would then gracefully shutdown the threads "somehow". Again just ideas. I'm not familiar with the code base so I can just outline what I think is feasible. |
…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]
Pull Request resolved: #41558 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 ghstack-source-id: 107995887 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]
Pull Request resolved: #41558 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 ghstack-source-id: 108231453 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
@lw Sorry for commenting on a closed issue but I just wanted to verify something you said:
Could you verify this? I found the following lines which seem to suggest that using glog, gflags and protobuf is required: Lines 756 to 761 in b85568a
I'm asking because I'm seeing an issue on POWER systems producing unexpected NaN results or floating point exceptions (FPE) when using glog. Those disappear when not using glog. |
I am by no means an expert. I said that because at one point I was curious about this so I downloaded vanilla PyTorch from PyPI, built a C++ extension on top of it which allowed me to invoke the Lines 79 to 85 in caea1ad
|
🐛 Bug
Running the tests in distributed/rpc/tensorpipe/test_rpc_spawn.py fails randomly on one of your systems. Specifically I see it for:
As the names are similar I expect the issue to be the same
To Reproduce
Steps to reproduce the behavior:
python distributed/rpc/tensorpipe/test_rpc_spawn.py TensorPipeAgentRpcTestWithSpawn.test_init_rpc_then_pg TensorPipeAgentRpcTestWithSpawn.test_init_pg_then_rpc
Sample output below. The "Expect process 1" varies between 1 and 2.
Environment
PyTorch version: 1.6.0-rc3
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80
Nvidia driver version: 450.36.06
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.17.3
[pip3] torch==1.6.0
cc @ezyang @gchanan @zou3519 @osalpekar @jiayisuse @lw @beauby @pritamdamania87 @mrshenli @jjlilley @gqchen @rohan-varma
The text was updated successfully, but these errors were encountered: