Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_rpc_spawn fails sporadically #41474

Closed
Flamefire opened this issue Jul 15, 2020 · 16 comments
Closed

test_rpc_spawn fails sporadically #41474

Flamefire opened this issue Jul 15, 2020 · 16 comments
Labels
high priority module: flaky-tests Problem is a flaky test in CI module: tensorpipe Related to Tensorpipe RPC Agent triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Flamefire
Copy link
Collaborator

Flamefire commented Jul 15, 2020

🐛 Bug

Running the tests in distributed/rpc/tensorpipe/test_rpc_spawn.py fails randomly on one of your systems. Specifically I see it for:

  • test_init_rpc_then_pg
  • test_init_pg_then_rpc

As the names are similar I expect the issue to be the same

To Reproduce

Steps to reproduce the behavior:

  1. python distributed/rpc/tensorpipe/test_rpc_spawn.py TensorPipeAgentRpcTestWithSpawn.test_init_rpc_then_pg TensorPipeAgentRpcTestWithSpawn.test_init_pg_then_rpc
  2. Repeat until at least one fails

Sample output below. The "Expect process 1" varies between 1 and 2.

======================================================================
FAIL: test_init_rpc_then_pg (__main__.TensorPipeAgentRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
    self._join_processes(fn)
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 349, in _check_return_codes
    i, first_process.exitcode, p.exitcode
  File "/lustre/ssd/ws/s3248973-EasyBuild2/easybuild-haswell/software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1122, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Expect process 1 exit code to match Process 0 exit code of 0, but got -6

Environment

PyTorch version: 1.6.0-rc3
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80

Nvidia driver version: 450.36.06
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.3
[pip3] torch==1.6.0

cc @ezyang @gchanan @zou3519 @osalpekar @jiayisuse @lw @beauby @pritamdamania87 @mrshenli @jjlilley @gqchen @rohan-varma

@mrshenli mrshenli added module: tensorpipe Related to Tensorpipe RPC Agent module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 15, 2020
@mrshenli
Copy link
Contributor

@Flamefire thanks for reporting! Does this error only relate to TensorPipe, or do you see similar failures with RpcTestWithSpawn?

@Flamefire
Copy link
Collaborator Author

Those 2 are the only tests failing for me which have Rpc in the name.

@lw
Copy link
Contributor

lw commented Jul 15, 2020

Any chance you're able to run this with the child processes attached to a debugger and give us the stack traces of when the segfault occurs? That would greatly help as otherwise we're looking for a needle in a haystack... Thanks!

@Flamefire
Copy link
Collaborator Author

Sure. Could you tell me how to do that? Those are started from the python test, aren't they?

@lw
Copy link
Contributor

lw commented Jul 15, 2020

Yes, attaching a debugger to the child process will be a bit tricky. Here is one way to do it. You need to first modify the test code, by adding two lines in here:

@dist_init(setup_rpc=False)
def test_init_rpc_then_pg(self):
rpc.init_rpc(
name=worker_name(self.rank),

just at the beginning of the function, before the init_rpc call. These lines should be:

print(f"Worker {self.rank}: PID {os.getpid()}")
time.sleep(5)

(you may also have to add the os and time imports at the top)

If you now run the tests (you can pass command line flags to the executable to filter only the test you want) they will print a few lines with their rank and PID and then pause for 5 seconds. In that pause you should run, from another terminal, gdb --pid={PID}. Since you said that this typically happens for workers 1 or 2, you could connect to one of those only. (Or to all of them, if you prefer, from four separate terminals). Once GDB has attached to the process and is showing you the prompt, you can resume the program (run the continue command) and wait. If you hit a segfault then GDB will catch it and allow you to do something before the program crashes. At that point you can run backtrace in the GDB prompt and this should print the stack trace of where the program failed. If you don't catch a segfault then you'll have to re-run the whole thing again. Hopefully this error isn't too rare...

@Flamefire
Copy link
Collaborator Author

Ok, I finally managed to get a SIGABRT in gdb. Backtrace:

#0  0x00007f76bbb67387 in raise () from /lib64/libc.so.6
#1  0x00007f76bbb68a78 in abort () from /lib64/libc.so.6
#2  0x00007f76532a0856 in google::LogMessage::Flush() [clone .cold.124] ()
   from /software/glog/0.4.0-GCCcore-8.3.0/lib64/libglog.so.0
#3  0x00007f76532a746c in google::LogMessage::~LogMessage() ()
   from /software/glog/0.4.0-GCCcore-8.3.0/lib64/libglog.so.0
#4  0x00007f7687e63115 in std::_Function_handler<void (tensorpipe::Error const&, torch::distributed::rpc::Message&&), torch::distributed::rpc::TensorPipeAgent::respond(std::shared_ptr<tensorpipe::Pipe>&)::{lambda(tensorpipe::Error const&, torch::distributed::rpc::Message&&)#1}>::_M_invoke(std::_Any_data const&, tensorpipe::Error const&, torch::distributed::rpc::Message&&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007f7687e5b6fe in std::_Function_handler<void (tensorpipe::Error const&, tensorpipe::Message), torch::distributed::rpc::TensorPipeAgent::pipeRead(std::shared_ptr<tensorpipe::Pipe> const&, std::function<void (tensorpipe::Error const&, torch::distributed::rpc::Message&&)>)::{lambda(tensorpipe::Error const&, tensorpipe::Message)#1}>::_M_invoke(std::_Any_data const&, tensorpipe::Error const&, tensorpipe::Message&&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007f764aeea5fa in tensorpipe::Pipe::Impl::readDescriptorFromLoop_(std::function<void (tensorpipe::Error const&, tensorpipe::Message)>)::{lambda(tensorpipe::Error const&, tensorpipe::Message)#1}::operator()(tensorpipe::Error const&, tensorpipe::Message) const ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#7  0x00007f764aeea7ea in std::_Function_handler<void (tensorpipe::Error const&, tensorpipe::Message), tensorpipe::Pipe::Impl::readDescriptorFromLoop_(std::function<void (tensorpipe::Error const&, tensorpipe::Message)>)::{lambda(tensorpip
ssage&&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#8  0x00007f764aeea4ec in tensorpipe::Pipe::Impl::callReadDescriptorCallback_(tensorpipe::(anonymous namespace)::ReadOperation&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#9  0x00007f764aeeaea6 in tensorpipe::Pipe::Impl::advanceReadOperation_(tensorpipe::(anonymous namespace)::ReadOperation&) [clone .isra.561] ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#10 0x00007f764aeed7c0 in tensorpipe::Pipe::Impl::handleError_() ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#11 0x00007f764aef64f8 in std::_Function_handler<void (), void tensorpipe::LazyCallbackWrapper<tensorpipe::Pipe::Impl>::entryPoint_<tensorpipe::Pipe::Impl::readDescriptorOfMessage_(tensorpipe::(anonymous namespace)::ReadOperation&)::{lambda(tensorpipe::Pipe::Impl&)#1}>(tensorpipe::Pipe::Impl&, tensorpipe::Pipe::Impl::readDescriptorOfMessage_(tensorpipe::(anonymous namespace)::ReadOperation&)::{lambda(tensorpipe::Pipe::Impl&)#1}&&, tensorpipe::Error const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#12 0x00007f764aee76b8 in tensorpipe::OnDemandLoop::deferToLoop(std::function<void ()>) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#13 0x00007f764aef42ed in _ZNSt17_Function_handlerIFvRKN10tensorpipe5ErrorEEZNS0_10runIfAliveINS0_4Pipe4ImplEZNS0_19LazyCallbackWrapperIS7_EclIZNS7_24readDescriptorOfMessage_ERNS0_12_GLOBAL__N_113ReadOperationEEUlRS7_E_EEDaOT_EUlSE_S3_D
pOT_E_EEDaRSt23enable_shared_from_thisISG_EOT0_EUlSK_E_E9_M_invokeERKSt9_Any_dataS3_ ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#14 0x00007f764af3750c in tensorpipe::transport::uv::Connection::Impl::readFromLoop(std::function<void (tensorpipe::Error const&, void const*, unsigned long)>)::{lambda(tensorpipe::Error const&, void const*, unsigned long)#1}::operator()(tensorpipe::Error const&, void const*, unsigned long) const ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#15 0x00007f764af38689 in tensorpipe::transport::uv::Connection::Impl::handleError_() ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#16 0x00007f764af3b4e3 in tensorpipe::transport::uv::Connection::Impl::readCallbackFromLoop_(long, uv_buf_t const*)
    ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#17 0x00007f764af3e78a in tensorpipe::transport::uv::StreamHandle<tensorpipe::transport::uv::TCPHandle, uv_tcp_s>::uv__read_cb(uv_stream_s*, long, uv_buf_t const*) ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#18 0x00007f764af7d377 in uv__read ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#19 0x00007f764af7d818 in uv__stream_io ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#20 0x00007f764af82dba in uv__io_poll ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#21 0x00007f764af779eb in uv_run ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#22 0x00007f764af54525 in tensorpipe::transport::uv::Loop::loop() ()
   from /software/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/lib/libtensorpipe.so
#23 0x00007f768a2110ff in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#24 0x00007f76bc109ea5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f76bbc2f8dd in clone () from /lib64/libc.so.6

@Flamefire
Copy link
Collaborator Author

Flamefire commented Jul 16, 2020

Rebuilding glog with debug info reveals this:

#0  0x00007fa0d1fda387 in raise () from /lib64/libc.so.6
#1  0x00007fa0d1fdba78 in abort () from /lib64/libc.so.6
#2  0x00007fa0697159af in Lock (this=<optimized out>)
    at /dev/shm/s3248973-EasyBuild/glog/0.4.0/GCCcore-8.3.0/glog-0.4.0/src/base/mutex.h:272
#3  MutexLock (mu=<optimized out>, this=<optimized out>)
    at /dev/shm/s3248973-EasyBuild/glog/0.4.0/GCCcore-8.3.0/glog-0.4.0/src/base/mutex.h:290
#4  google::LogMessage::Flush (this=0x7fa00e88da10)
    at /dev/shm/s3248973-EasyBuild/glog/0.4.0/GCCcore-8.3.0/glog-0.4.0/src/logging.cc:1361

The code is void Mutex::Lock() { SAFE_PTHREAD(pthread_mutex_lock); } with #define SAFE_PTHREAD(fncall) if (is_safe_ && fncall(&mutex_) != 0) abort(). The mutex itself is a is defined as a global static: static Mutex log_mutex;

So it seems pthread_mutex_lock fails.

As I also sometimes see output like:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W0716 10:52:25.812986 32276 tensorpipe_agent.cpp:491] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown)

I assume the following happens:

  • The client process is being shut down with a thread still running
  • Static variables are being deinitialized including that global mutex
  • An error event gets received by the tensorpipe worker thread
  • It tries to log that as a potential error and depending on if that happens before or after the mutex was destroyed the result is the message or a crash

@Flamefire
Copy link
Collaborator Author

Flamefire commented Jul 16, 2020

And finally I verified this by adding some debug logs to glog:

pthread_mutex_destroy(10857) --> 0
pthread_mutex_destroy(10857) --> 0
pthread_mutex_destroy(10857) --> 0
pthread_mutex_destroy(10857) --> 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0716 11:48:07.276598 10965 tensorpipe_agent.cpp:491] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)
pthread_mutex_lock(10857) --> 22

So PID 10857 destroys all mutexes (there are 4 for all others too), then tries to log the EOF event (moved the logging before the MutexLock) and gets an EINVAL from pthread_mutex_lock --> Invalid mutex passed.

The agent even checks for EOF before logging:

error.isOfType<tensorpipe::transport::EOFError>()) {

However judging from the output "EOF: end of file", the type of the error is not tensorpipe::transport::EOFError but a UVError because the what of EOFErrorreturns onlyeof`

Proposed solution:
Check for instances of TP_CREATE_ERROR(UVError, and replace by code which creates an EOFError if the error code is an uv EOF and an UVError otherwise

@lw
Copy link
Contributor

lw commented Jul 16, 2020

Wow, thank you so much for this thorough investigation, it really helps a lot!

I looked at the stack trace and I think your diagnosis is right. I believe the problem is that the TensorPipe agent instance is a global static variable and thus, when the process exits, its destruction races with the destruction of the glog mutexes and thus sometimes one of TensorPipe's event loops tries to log a message after the mutex was destroyed.

The reason this problem happens only in those two tests is probably simple: it seems we forgot to add a rpc.shutdown() call at the end of their test methods. (In most tests this is done automatically by the @dist_init decorator, but this does not happen when it we pass setup_rpc=False). Could you try adding that call at the end of these tests and check if they are still flaky?

The shutdown call fixes the problem because it clears the global static RPC agent earlier, before the process exits.

As for whether it is a requirement of the API that the user always calls rpc.shutdown() or if the library is supposed to work correctly even if users don't do that, I am not sure. @mrshenli might have some opinion on that.

As for the EOFError vs UVError approach, it could work in this specific instance but I'm afraid it could still be brittle. That is because there is no guarantee that we'll always get an EOF error when agents shut down, sometimes we could get a ECONNRESET, or possibly others. Figuring out which ones to map to EOFError and which ones to keep as UVErrors will not really scale. Also, in this specific case, the warning that gets logged is something we want to get rid of because it's somewhat expected and shouldn't thus pollute the logs, but we haven't found a clear way yet to detect it. See also #40094.

@Flamefire
Copy link
Collaborator Author

As for whether it is a requirement of the API that the user always calls rpc.shutdown() or if the library is supposed to work correctly even if users don't do that, I am not sure. @mrshenli might have some opinion on that.

One way to fix this is to log a message to a dummy sink which ignores the message in the constructor of your global static. This forces initialization of the glog global statics prior to yours and hence shedules destruction after it. Obviously avoiding global static objects would be best but I'm aware that this is not always possible.

As for the EOFError vs UVError approach, [...] the warning that gets logged is something we want to get rid of because it's somewhat expected and shouldn't thus pollute the logs, but we haven't found a clear way yet to detect it. See also #40094.

So yes there are 2 issues: One is the usage of GLOG on termination and the other that the intended fix from #40094 is missing the UVError::EOF. As there is a dedicated subclass for EOF introducing a function for creating an error out of an uv error code and translating the uv EOF into EOFError instead seems like the right approach to fix the 2nd issue. I see no point having to different EOF errors.

@Flamefire
Copy link
Collaborator Author

Could you try adding that call at the end of these tests and check if they are still flaky?

Sorry I forgot that. I tried that now (changed /lib/python3.7/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py) and reran. It seems to work. But there is also still the "WARNING: Logging before InitGoogleLogging() is written to STDERR" message.

@lw
Copy link
Contributor

lw commented Jul 16, 2020

It seems to work.

Awesome! I'll send out a PR to fix this shortly.

But there is also still the "WARNING: Logging before InitGoogleLogging() is written to STDERR" message.

I think that's normal/expected, and probably happens for all tests (right?). That's because nowhere in PyTorch we call InitGoogleLogging(). It's up to the end user to call it, so that it can be passed the cmdline flags etc.

One way to fix this is to log a message to a dummy sink which ignores the message in the constructor of your global static.

That seems to be something that you know better than me (I've never been able to find good docs on glog, if you know any please send them my way!). If there's a way to formalize the dependency of the global RPC agent on the glog mutex so that we reliably get the right destruction order, that would be great! I see perhaps a couple of issues though: the global RPC agent is just a shared_ptr, which starts unset, so initializing glog in the constructor of the agent wouldn't necessarily create a dependency between glog and that shared_ptr. Second, PyTorch doesn't always use glog, and it has its own fallback in case glog isn't available, and I believe this will limit the number of "advanced features" we can use.

In fact, I think PyTorch isn't built with glog by default, which means the problem you hit only occurs for people who build their own PyTorch. This is probably why we didn't catch it earlier...

I see no point having to different EOF errors.

They do provide different information. In case of that specific backend, an EOF error is raised when a read returns 0, whereas a UVError is returned whenever there is an async error on a socket. So keeping them separate helps more easily chasing down problems. Also, different backends could choose to handle error differently, and may not even have a "native" EOF error (think of backends doing RDMA). I'm not convinced this is the right approach.

My view is that the RPC agent shouldn't even inspect the error type at all, except for the PipeClosedError which is very special. The fact that, at the moment, we also check for the EOFError is just because it was the most common error type being logged and thus silencing it allowed us to avoid a lot of spurious log lines. However, once we have a proper solution for avoiding these lines, we won't have to special-case EOFError anymore.

lw added a commit that referenced this issue Jul 16, 2020
Addresses this bug report: #41474

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

[ghstack-poisoned]
lw added a commit that referenced this issue Jul 16, 2020
…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
@lw
Copy link
Contributor

lw commented Jul 16, 2020

The fix for the tests is in #41558, and the discussion on what to do for the general problem is in #41561.

lw added a commit that referenced this issue Jul 16, 2020
…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
lw added a commit that referenced this issue Jul 16, 2020
Pull Request resolved: #41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474
ghstack-source-id: 107947863

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
@Flamefire
Copy link
Collaborator Author

I think that's normal/expected, and probably happens for all tests (right?)

Actually not. For the tests discussed here I see it for 0-3 of the subprocesses. GLog is initialized from InitCaffeLogging called from GlobalInit. I see it called from many different files. So I'd say if it isn't called for this test then this is an oversight rather than intention as it is done in other test files.

That seems to be something that you know better than me

That is just leveraging the C++ guarantee that destruction order is the inverse of the finishing construction order. But if that static global is a shared_ptr than not much can be done here except e.g. initializing the shared_ptr (to empty or whatever it currently is) from a function that makes the logging mutex be constructed. That's maybe something to ask the google guys.

They do provide different information. In case of that specific backend, an EOF error is raised when a read returns 0, whereas a UVError is returned whenever there is an async error on a socket

Ok then question to evaluate: What information does an UVError value "EOF" provide? I'd expect that it is pretty much the same, but I'll leave it to you as you seem to be more firm on that. It was just a suggestion which looks right.

The fact that, at the moment, we also check for the EOFError is just because it was the most common error type being logged and thus silencing it allowed us to avoid a lot of spurious log lines.

Then either UVError::EOF could be added to that list with the same reasoning or it could be converted to EOFError to be handled like it (see above)

However, once we have a proper solution for avoiding these lines, we won't have to special-case EOFError anymore.

A similar thing can be used as suggested above for glog: Have an instance which is destructed before whatever happens that causes the EOF/PipeClosed. The order can be ensured by using constructor order. The destructor would then gracefully shutdown the threads "somehow". Again just ideas. I'm not familiar with the code base so I can just outline what I think is feasible.

lw added a commit that referenced this issue Jul 17, 2020
…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
lw added a commit that referenced this issue Jul 17, 2020
Pull Request resolved: #41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474
ghstack-source-id: 107995887

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
lw added a commit that referenced this issue Jul 22, 2020
…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
lw added a commit that referenced this issue Jul 22, 2020
Pull Request resolved: #41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474
ghstack-source-id: 108231453

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
@Flamefire
Copy link
Collaborator Author

@lw Sorry for commenting on a closed issue but I just wanted to verify something you said:

In fact, I think PyTorch isn't built with glog by default,

Could you verify this? I found the following lines which seem to suggest that using glog, gflags and protobuf is required:

pytorch/CMakeLists.txt

Lines 756 to 761 in b85568a

if((NOT USE_GLOG) OR (NOT USE_GFLAGS) OR BUILD_CUSTOM_PROTOBUF)
message(WARNING
"Generated cmake files are only fully tested if one builds "
"with system glog, gflags, and protobuf. Other settings may "
"generate files that are not well tested.")
endif()

I'm asking because I'm seeing an issue on POWER systems producing unexpected NaN results or floating point exceptions (FPE) when using glog. Those disappear when not using glog.

@lw
Copy link
Contributor

lw commented Sep 18, 2020

I am by no means an expert. I said that because at one point I was curious about this so I downloaded vanilla PyTorch from PyPI, built a C++ extension on top of it which allowed me to invoke the IsUsingGoogleLogging function and saw that it returned false. That may have changed since though.

constexpr bool IsUsingGoogleLogging() {
#ifdef C10_USE_GLOG
return true;
#else
return false;
#endif
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: flaky-tests Problem is a flaky test in CI module: tensorpipe Related to Tensorpipe RPC Agent triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants