test_rpc_spawn fails sporadically #41474

Flamefire · 2020-07-15T14:04:37Z

🐛 Bug

Running the tests in distributed/rpc/tensorpipe/test_rpc_spawn.py fails randomly on one of your systems. Specifically I see it for:

test_init_rpc_then_pg
test_init_pg_then_rpc

As the names are similar I expect the issue to be the same

To Reproduce

Steps to reproduce the behavior:

python distributed/rpc/tensorpipe/test_rpc_spawn.py TensorPipeAgentRpcTestWithSpawn.test_init_rpc_then_pg TensorPipeAgentRpcTestWithSpawn.test_init_pg_then_rpc
Repeat until at least one fails

Sample output below. The "Expect process 1" varies between 1 and 2.

======================================================================
FAIL: test_init_rpc_then_pg (__main__.TensorPipeAgentRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
    self._join_processes(fn)
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 349, in _check_return_codes
    i, first_process.exitcode, p.exitcode
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1122, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Expect process 1 exit code to match Process 0 exit code of 0, but got -6

Environment

PyTorch version: 1.6.0-rc3
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80

Nvidia driver version: 450.36.06
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.3
[pip3] torch==1.6.0

cc @ezyang @gchanan @zou3519 @osalpekar @jiayisuse @lw @beauby @pritamdamania87 @mrshenli @jjlilley @gqchen @rohan-varma

The text was updated successfully, but these errors were encountered:

mrshenli · 2020-07-15T16:10:26Z

@Flamefire thanks for reporting! Does this error only relate to TensorPipe, or do you see similar failures with RpcTestWithSpawn?

Flamefire · 2020-07-15T16:15:07Z

Those 2 are the only tests failing for me which have Rpc in the name.

lw · 2020-07-15T16:22:53Z

Any chance you're able to run this with the child processes attached to a debugger and give us the stack traces of when the segfault occurs? That would greatly help as otherwise we're looking for a needle in a haystack... Thanks!

Flamefire · 2020-07-15T16:24:23Z

Sure. Could you tell me how to do that? Those are started from the python test, aren't they?

lw · 2020-07-15T18:35:58Z

Yes, attaching a debugger to the child process will be a bit tricky. Here is one way to do it. You need to first modify the test code, by adding two lines in here:

pytorch/torch/testing/_internal/distributed/rpc/rpc_test.py

Lines 3348 to 3351 in c86699d

    
           @dist_init(setup_rpc=False) 
        
           def test_init_rpc_then_pg(self): 
        
               rpc.init_rpc( 
        
                   name=worker_name(self.rank),

just at the beginning of the function, before the init_rpc call. These lines should be:

print(f"Worker {self.rank}: PID {os.getpid()}")
time.sleep(5)

(you may also have to add the os and time imports at the top)

If you now run the tests (you can pass command line flags to the executable to filter only the test you want) they will print a few lines with their rank and PID and then pause for 5 seconds. In that pause you should run, from another terminal, gdb --pid={PID}. Since you said that this typically happens for workers 1 or 2, you could connect to one of those only. (Or to all of them, if you prefer, from four separate terminals). Once GDB has attached to the process and is showing you the prompt, you can resume the program (run the continue command) and wait. If you hit a segfault then GDB will catch it and allow you to do something before the program crashes. At that point you can run backtrace in the GDB prompt and this should print the stack trace of where the program failed. If you don't catch a segfault then you'll have to re-run the whole thing again. Hopefully this error isn't too rare...

Flamefire · 2020-07-16T08:10:15Z

Ok, I finally managed to get a SIGABRT in gdb. Backtrace:

#0  0x00007f76bbb67387 in raise () from /lib64/libc.so.6
#1  0x00007f76bbb68a78 in abort () from /lib64/libc.so.6
#2  0x00007f76532a0856 in google::LogMessage::Flush() [clone .cold.124] ()
   from /software/glog/0.4.0-GCCcore-8.3.0/lib64/libglog.so.0
#3  0x00007f76532a746c in google::LogMessage::~LogMessage() ()
   from /software/glog/0.4.0-GCCcore-8.3.0/lib64/libglog.so.0
#4  0x00007f7687e63115 in std::_Function_handler<void (tensorpipe::Error const&, torch::distributed::rpc::Message&&), torch::distributed::rpc::TensorPipeAgent::respond(std::shared_ptr<tensorpipe::Pipe>&)::{lambda(tensorpipe::Error const&, torch::distributed::rpc::Message&&)#1}>::_M_invoke(std::_Any_data const&, tensorpipe::Error const&, torch::distributed::rpc::Message&&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007f7687e5b6fe in std::_Function_handler<void (tensorpipe::Error const&, tensorpipe::Message), torch::distributed::rpc::TensorPipeAgent::pipeRead(std::shared_ptr<tensorpipe::Pipe> const&, std::function<void (tensorpipe::Error const&, torch::distributed::rpc::Message&&)>)::{lambda(tensorpipe::Error const&, tensorpipe::Message)#1}>::_M_invoke(std::_Any_data const&, tensorpipe::Error const&, tensorpipe::Message&&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007f764aeea5fa in tensorpipe::Pipe::Impl::readDescriptorFromLoop_(std::function<void (tensorpipe::Error const&, tensorpipe::Message)>)::{lambda(tensorpipe::Error const&, tensorpipe::Message)#1}::operator()(tensorpipe::Error const&, tensorpipe::Message) const ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#7  0x00007f764aeea7ea in std::_Function_handler<void (tensorpipe::Error const&, tensorpipe::Message), tensorpipe::Pipe::Impl::readDescriptorFromLoop_(std::function<void (tensorpipe::Error const&, tensorpipe::Message)>)::{lambda(tensorpip
ssage&&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#8  0x00007f764aeea4ec in tensorpipe::Pipe::Impl::callReadDescriptorCallback_(tensorpipe::(anonymous namespace)::ReadOperation&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#9  0x00007f764aeeaea6 in tensorpipe::Pipe::Impl::advanceReadOperation_(tensorpipe::(anonymous namespace)::ReadOperation&) [clone .isra.561] ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#10 0x00007f764aeed7c0 in tensorpipe::Pipe::Impl::handleError_() ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#11 0x00007f764aef64f8 in std::_Function_handler<void (), void tensorpipe::LazyCallbackWrapper<tensorpipe::Pipe::Impl>::entryPoint_<tensorpipe::Pipe::Impl::readDescriptorOfMessage_(tensorpipe::(anonymous namespace)::ReadOperation&)::{lambda(tensorpipe::Pipe::Impl&)#1}>(tensorpipe::Pipe::Impl&, tensorpipe::Pipe::Impl::readDescriptorOfMessage_(tensorpipe::(anonymous namespace)::ReadOperation&)::{lambda(tensorpipe::Pipe::Impl&)#1}&&, tensorpipe::Error const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#12 0x00007f764aee76b8 in tensorpipe::OnDemandLoop::deferToLoop(std::function<void ()>) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#13 0x00007f764aef42ed in _ZNSt17_Function_handlerIFvRKN10tensorpipe5ErrorEEZNS0_10runIfAliveINS0_4Pipe4ImplEZNS0_19LazyCallbackWrapperIS7_EclIZNS7_24readDescriptorOfMessage_ERNS0_12_GLOBAL__N_113ReadOperationEEUlRS7_E_EEDaOT_EUlSE_S3_D
pOT_E_EEDaRSt23enable_shared_from_thisISG_EOT0_EUlSK_E_E9_M_invokeERKSt9_Any_dataS3_ ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#14 0x00007f764af3750c in tensorpipe::transport::uv::Connection::Impl::readFromLoop(std::function<void (tensorpipe::Error const&, void const*, unsigned long)>)::{lambda(tensorpipe::Error const&, void const*, unsigned long)#1}::operator()(tensorpipe::Error const&, void const*, unsigned long) const ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#15 0x00007f764af38689 in tensorpipe::transport::uv::Connection::Impl::handleError_() ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#16 0x00007f764af3b4e3 in tensorpipe::transport::uv::Connection::Impl::readCallbackFromLoop_(long, uv_buf_t const*)
    ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#17 0x00007f764af3e78a in tensorpipe::transport::uv::StreamHandle<tensorpipe::transport::uv::TCPHandle, uv_tcp_s>::uv__read_cb(uv_stream_s*, long, uv_buf_t const*) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#18 0x00007f764af7d377 in uv__read ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#19 0x00007f764af7d818 in uv__stream_io ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#20 0x00007f764af82dba in uv__io_poll ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#21 0x00007f764af779eb in uv_run ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#22 0x00007f764af54525 in tensorpipe::transport::uv::Loop::loop() ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#23 0x00007f768a2110ff in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#24 0x00007f76bc109ea5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f76bbc2f8dd in clone () from /lib64/libc.so.6

Flamefire · 2020-07-16T09:13:03Z

Rebuilding glog with debug info reveals this:

#0  0x00007fa0d1fda387 in raise () from /lib64/libc.so.6
#1  0x00007fa0d1fdba78 in abort () from /lib64/libc.so.6
#2  0x00007fa0697159af in Lock (this=<optimized out>)
    at /dev/shm/s3248973-EasyBuild/glog/0.4.0/GCCcore-8.3.0/glog-0.4.0/src/base/mutex.h:272
#3  MutexLock (mu=<optimized out>, this=<optimized out>)
    at /dev/shm/s3248973-EasyBuild/glog/0.4.0/GCCcore-8.3.0/glog-0.4.0/src/base/mutex.h:290
#4  google::LogMessage::Flush (this=0x7fa00e88da10)
    at /dev/shm/s3248973-EasyBuild/glog/0.4.0/GCCcore-8.3.0/glog-0.4.0/src/logging.cc:1361

The code is void Mutex::Lock() { SAFE_PTHREAD(pthread_mutex_lock); } with #define SAFE_PTHREAD(fncall) if (is_safe_ && fncall(&mutex_) != 0) abort(). The mutex itself is a is defined as a global static: static Mutex log_mutex;

So it seems pthread_mutex_lock fails.

As I also sometimes see output like:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W0716 10:52:25.812986 32276 tensorpipe_agent.cpp:491] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown)

I assume the following happens:

The client process is being shut down with a thread still running
Static variables are being deinitialized including that global mutex
An error event gets received by the tensorpipe worker thread
It tries to log that as a potential error and depending on if that happens before or after the mutex was destroyed the result is the message or a crash

Flamefire · 2020-07-16T09:56:18Z

And finally I verified this by adding some debug logs to glog:

pthread_mutex_destroy(10857) --> 0
pthread_mutex_destroy(10857) --> 0
pthread_mutex_destroy(10857) --> 0
pthread_mutex_destroy(10857) --> 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0716 11:48:07.276598 10965 tensorpipe_agent.cpp:491] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)
pthread_mutex_lock(10857) --> 22

So PID 10857 destroys all mutexes (there are 4 for all others too), then tries to log the EOF event (moved the logging before the MutexLock) and gets an EINVAL from pthread_mutex_lock --> Invalid mutex passed.

The agent even checks for EOF before logging:

pytorch/torch/csrc/distributed/rpc/tensorpipe_agent.cpp

Line 488 in d80e0c6

error.isOfType<tensorpipe::transport::EOFError>()) {

However judging from the output "EOF: end of file", the type of the error is not tensorpipe::transport::EOFError but a UVError because the what of EOFErrorreturns onlyeof`

Proposed solution:
Check for instances of TP_CREATE_ERROR(UVError, and replace by code which creates an EOFError if the error code is an uv EOF and an UVError otherwise

lw · 2020-07-16T11:24:51Z

Wow, thank you so much for this thorough investigation, it really helps a lot!

I looked at the stack trace and I think your diagnosis is right. I believe the problem is that the TensorPipe agent instance is a global static variable and thus, when the process exits, its destruction races with the destruction of the glog mutexes and thus sometimes one of TensorPipe's event loops tries to log a message after the mutex was destroyed.

The reason this problem happens only in those two tests is probably simple: it seems we forgot to add a rpc.shutdown() call at the end of their test methods. (In most tests this is done automatically by the @dist_init decorator, but this does not happen when it we pass setup_rpc=False). Could you try adding that call at the end of these tests and check if they are still flaky?

The shutdown call fixes the problem because it clears the global static RPC agent earlier, before the process exits.

As for whether it is a requirement of the API that the user always calls rpc.shutdown() or if the library is supposed to work correctly even if users don't do that, I am not sure. @mrshenli might have some opinion on that.

As for the EOFError vs UVError approach, it could work in this specific instance but I'm afraid it could still be brittle. That is because there is no guarantee that we'll always get an EOF error when agents shut down, sometimes we could get a ECONNRESET, or possibly others. Figuring out which ones to map to EOFError and which ones to keep as UVErrors will not really scale. Also, in this specific case, the warning that gets logged is something we want to get rid of because it's somewhat expected and shouldn't thus pollute the logs, but we haven't found a clear way yet to detect it. See also #40094.

Flamefire · 2020-07-16T11:57:57Z

As for whether it is a requirement of the API that the user always calls rpc.shutdown() or if the library is supposed to work correctly even if users don't do that, I am not sure. @mrshenli might have some opinion on that.

One way to fix this is to log a message to a dummy sink which ignores the message in the constructor of your global static. This forces initialization of the glog global statics prior to yours and hence shedules destruction after it. Obviously avoiding global static objects would be best but I'm aware that this is not always possible.

As for the EOFError vs UVError approach, [...] the warning that gets logged is something we want to get rid of because it's somewhat expected and shouldn't thus pollute the logs, but we haven't found a clear way yet to detect it. See also #40094.

So yes there are 2 issues: One is the usage of GLOG on termination and the other that the intended fix from #40094 is missing the UVError::EOF. As there is a dedicated subclass for EOF introducing a function for creating an error out of an uv error code and translating the uv EOF into EOFError instead seems like the right approach to fix the 2nd issue. I see no point having to different EOF errors.

Flamefire · 2020-07-16T17:08:08Z

Could you try adding that call at the end of these tests and check if they are still flaky?

Sorry I forgot that. I tried that now (changed /lib/python3.7/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py) and reran. It seems to work. But there is also still the "WARNING: Logging before InitGoogleLogging() is written to STDERR" message.

lw · 2020-07-16T20:49:09Z

It seems to work.

Awesome! I'll send out a PR to fix this shortly.

But there is also still the "WARNING: Logging before InitGoogleLogging() is written to STDERR" message.

I think that's normal/expected, and probably happens for all tests (right?). That's because nowhere in PyTorch we call InitGoogleLogging(). It's up to the end user to call it, so that it can be passed the cmdline flags etc.

One way to fix this is to log a message to a dummy sink which ignores the message in the constructor of your global static.

That seems to be something that you know better than me (I've never been able to find good docs on glog, if you know any please send them my way!). If there's a way to formalize the dependency of the global RPC agent on the glog mutex so that we reliably get the right destruction order, that would be great! I see perhaps a couple of issues though: the global RPC agent is just a shared_ptr, which starts unset, so initializing glog in the constructor of the agent wouldn't necessarily create a dependency between glog and that shared_ptr. Second, PyTorch doesn't always use glog, and it has its own fallback in case glog isn't available, and I believe this will limit the number of "advanced features" we can use.

In fact, I think PyTorch isn't built with glog by default, which means the problem you hit only occurs for people who build their own PyTorch. This is probably why we didn't catch it earlier...

I see no point having to different EOF errors.

They do provide different information. In case of that specific backend, an EOF error is raised when a read returns 0, whereas a UVError is returned whenever there is an async error on a socket. So keeping them separate helps more easily chasing down problems. Also, different backends could choose to handle error differently, and may not even have a "native" EOF error (think of backends doing RDMA). I'm not convinced this is the right approach.

My view is that the RPC agent shouldn't even inspect the error type at all, except for the PipeClosedError which is very special. The fact that, at the moment, we also check for the EOFError is just because it was the most common error type being logged and thus silencing it allowed us to avoid a lot of spurious log lines. However, once we have a proper solution for avoiding these lines, we won't have to special-case EOFError anymore.

Addresses this bug report: #41474 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) [ghstack-poisoned]

…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]

lw · 2020-07-16T21:11:36Z

The fix for the tests is in #41558, and the discussion on what to do for the general problem is in #41561.

…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]

Pull Request resolved: #41558 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 ghstack-source-id: 107947863 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Flamefire · 2020-07-17T06:58:09Z

I think that's normal/expected, and probably happens for all tests (right?)

Actually not. For the tests discussed here I see it for 0-3 of the subprocesses. GLog is initialized from InitCaffeLogging called from GlobalInit. I see it called from many different files. So I'd say if it isn't called for this test then this is an oversight rather than intention as it is done in other test files.

That seems to be something that you know better than me

That is just leveraging the C++ guarantee that destruction order is the inverse of the finishing construction order. But if that static global is a shared_ptr than not much can be done here except e.g. initializing the shared_ptr (to empty or whatever it currently is) from a function that makes the logging mutex be constructed. That's maybe something to ask the google guys.

They do provide different information. In case of that specific backend, an EOF error is raised when a read returns 0, whereas a UVError is returned whenever there is an async error on a socket

Ok then question to evaluate: What information does an UVError value "EOF" provide? I'd expect that it is pretty much the same, but I'll leave it to you as you seem to be more firm on that. It was just a suggestion which looks right.

The fact that, at the moment, we also check for the EOFError is just because it was the most common error type being logged and thus silencing it allowed us to avoid a lot of spurious log lines.

Then either UVError::EOF could be added to that list with the same reasoning or it could be converted to EOFError to be handled like it (see above)

However, once we have a proper solution for avoiding these lines, we won't have to special-case EOFError anymore.

A similar thing can be used as suggested above for glog: Have an instance which is destructed before whatever happens that causes the EOF/PipeClosed. The order can be ensured by using constructor order. The destructor would then gracefully shutdown the threads "somehow". Again just ideas. I'm not familiar with the code base so I can just outline what I think is feasible.

…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]

Pull Request resolved: #41558 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 ghstack-source-id: 107995887 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

…ing down RPC" The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/) Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779) [ghstack-poisoned]

Pull Request resolved: #41558 The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT. Fixes #41474 ghstack-source-id: 108231453 Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Flamefire · 2020-09-16T11:27:59Z

@lw Sorry for commenting on a closed issue but I just wanted to verify something you said:

In fact, I think PyTorch isn't built with glog by default,

Could you verify this? I found the following lines which seem to suggest that using glog, gflags and protobuf is required:

pytorch/CMakeLists.txt

Lines 756 to 761 in b85568a

    
           if((NOT USE_GLOG) OR (NOT USE_GFLAGS) OR BUILD_CUSTOM_PROTOBUF) 
        
             message(WARNING 
        
                 "Generated cmake files are only fully tested if one builds " 
        
                 "with system glog, gflags, and protobuf. Other settings may " 
        
                 "generate files that are not well tested.") 
        
           endif()

I'm asking because I'm seeing an issue on POWER systems producing unexpected NaN results or floating point exceptions (FPE) when using glog. Those disappear when not using glog.

lw · 2020-09-18T11:17:45Z

I am by no means an expert. I said that because at one point I was curious about this so I downloaded vanilla PyTorch from PyPI, built a C++ extension on top of it which allowed me to invoke the IsUsingGoogleLogging function and saw that it returned false. That may have changed since though.

pytorch/c10/util/Logging.h

Lines 79 to 85 in caea1ad

    
           constexpr bool IsUsingGoogleLogging() { 
        
           #ifdef C10_USE_GLOG 
        
             return true; 
        
           #else 
        
             return false; 
        
           #endif 
        
           }

mrshenli added module: tensorpipe Related to Tensorpipe RPC Agent module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 15, 2020

pytorch-probot bot added high priority triage review labels Jul 15, 2020

lw mentioned this issue Jul 16, 2020

[RPC tests] Fix test_init_(rpc|pg)_then_(rpc|pg) not shutting down RPC #41558

Closed

lw mentioned this issue Jul 16, 2020

[RPC] Should we support users _not_ calling rpc.shutdown()? #41561

Open

Flamefire mentioned this issue Jul 17, 2020

How to properly use glog with static/global objects? google/glog#547

Closed

facebook-github-bot closed this as completed in fced54a Jul 22, 2020

Flamefire mentioned this issue Jul 29, 2020

[RPC] Use leaky singleton for current RPC agent global #41591

Closed

Flamefire mentioned this issue Sep 18, 2020

Clarify use of GLog/GFlags #44948

Open

VRehnberg mentioned this issue May 10, 2023

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 easybuilders/easybuild-easyconfigs#17156

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_rpc_spawn fails sporadically #41474

test_rpc_spawn fails sporadically #41474

Flamefire commented Jul 15, 2020 •

edited by pytorch-probot bot

mrshenli commented Jul 15, 2020

Flamefire commented Jul 15, 2020

lw commented Jul 15, 2020

Flamefire commented Jul 15, 2020

lw commented Jul 15, 2020

Flamefire commented Jul 16, 2020

Flamefire commented Jul 16, 2020 •

edited

Flamefire commented Jul 16, 2020 •

edited

lw commented Jul 16, 2020

Flamefire commented Jul 16, 2020

Flamefire commented Jul 16, 2020

lw commented Jul 16, 2020

lw commented Jul 16, 2020

Flamefire commented Jul 17, 2020

Flamefire commented Sep 16, 2020

lw commented Sep 18, 2020

test_rpc_spawn fails sporadically #41474

test_rpc_spawn fails sporadically #41474

Comments

Flamefire commented Jul 15, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Environment

mrshenli commented Jul 15, 2020

Flamefire commented Jul 15, 2020

lw commented Jul 15, 2020

Flamefire commented Jul 15, 2020

lw commented Jul 15, 2020

Flamefire commented Jul 16, 2020

Flamefire commented Jul 16, 2020 • edited

Flamefire commented Jul 16, 2020 • edited

lw commented Jul 16, 2020

Flamefire commented Jul 16, 2020

Flamefire commented Jul 16, 2020

lw commented Jul 16, 2020

lw commented Jul 16, 2020

Flamefire commented Jul 17, 2020

Flamefire commented Sep 16, 2020

lw commented Sep 18, 2020

Flamefire commented Jul 15, 2020 •

edited by pytorch-probot bot

Flamefire commented Jul 16, 2020 •

edited

Flamefire commented Jul 16, 2020 •

edited