ROCm CI is intermittently failing with std::out_of_range #49652

mruberry · 2020-12-20T09:15:10Z

See https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.10-py3.6-test1/126//console and https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.10-py3.6-test1/110//console for examples.

As best I can tell this error always happens around the same place in the test suite (but note that in these examples the actual tests where the error occurs are different).

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @gqchen @pearu @nikitaved @soulitzer @jeffdaily @sunway513

mruberry · 2020-12-21T16:24:24Z

This issue may be happening on other builds. This just popped up, for example:

https://app.circleci.com/pipelines/github/pytorch/pytorch/253785/workflows/f0a0c844-62dc-455a-97d4-fd9c221c5a39/jobs/9777526

But the error is different.

jeffdaily · 2020-12-21T17:05:07Z

We do have an engineer looking into this. We started seeing this error intermittently perhaps a month back when our CI was on ROCm 3.9. Right before we switched to ROCm 3.10, the error became more frequent. We've had difficulty reproducing it on our local systems. But we will continue our efforts. Good to know it might not be specific to ROCm, but it certainly seems to be more frequent there.

jeffdaily · 2021-01-04T21:28:06Z

@mruberry our engineer @jaglinux was able to produce a stack trace.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7803921 in __GI_abort () at abort.c:79
#2  0x00007ffff480f84a in __gnu_cxx::__verbose_terminate_handler () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff480df47 in __cxxabiv1::__terminate (handler=<optimized out>) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007ffff480df7d in std::terminate () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00007ffff480e15a in __cxxabiv1::__cxa_throw (obj=obj@entry=0x55557c2680a0, tinfo=0x7ffff48c77a8 <typeinfo for std::out_of_range>, dest=0x7ffff481a1b4 <std::out_of_range::~out_of_range()>)
    at /home/nwani/m3/conda-bld/ne-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00007ffff4828ad2 in std::__throw_out_of_range_fmt (__fmt=<optimized out>) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/functexcept.cc:96
#7  0x00007fffe8148130 in torch::autograd::Engine::ready_queue_by_index(std::shared_ptr<torch::autograd::ReadyQueue>, int) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fffe8151dcb in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fffe81499e9 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffed5c257a in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#11 0x00007ffff482a19d in std::execute_native_thread_routine (__p=0x55555a2668c0) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#12 0x00007ffff7bbb6db in start_thread (arg=0x7ffa9d7f4700) at pthread_create.c:463
#13 0x00007ffff78e471f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Engine::ready_queue_by_index(), if passed a device_index of NO_DEVICE (-2), it will be used directly in device_ready_queues_.at(device_index).

pytorch/torch/csrc/autograd/engine.cpp

Lines 1040 to 1051 in 74dcb6d

    
           auto Engine::ready_queue_by_index(std::shared_ptr<ReadyQueue> cpu_ready_queue, int device_index) -> std::shared_ptr<ReadyQueue> { 
        
             if (device_index == CPU_DEVICE) { 
        
               // return the cpu ready queue passed in 
        
               TORCH_INTERNAL_ASSERT(cpu_ready_queue); 
        
               return cpu_ready_queue; 
        
             } else { 
        
               // See Note [Allocating GPUs to autograd threads] 
        
               // NB: This function would become obsolete if we truly allocated a CPU thread 
        
               // per device, rather than colocate. 
        
               return device_ready_queues_.at(device_index); 
        
             } 
        
           }

Should the if condition be changed to if (device_index < 0) to capture both the CPU_DEVICE and NO_DEVICE cases? Or should it be if (device_index == CPU_DEVICE || device_index == NO_DEVICE)?

mruberry · 2021-01-05T03:48:58Z

Awesome, let me cc in our relevant experts here, @albanD @ngimel

ezyang · 2021-01-05T15:58:53Z

I think the invariant is that this should never be called with NO_DEVICE.

The relevant call site is

      local_graph_task->mark_as_completed_and_run_post_processing();

      auto base_owner = local_graph_task->owner_;
      // The current worker thread finish the graph_task, but the owning thread
      // of the graph_task might be sleeping on pop() if it does not have work.
      // So we need to send a dummy function task to the owning thread just to
      // ensure that it's not sleeping, so that we can exit the thread_main.
      // If it has work, it might see that graph_task->outstanding_tasks_ == 0
      // before it gets to the task, but it's a no-op anyway.
      //
      // NB: This is not necessary if the current thread is the owning thread.
      if (worker_device != base_owner) {
        // Synchronize outstanding_tasks_ with queue mutex
        std::atomic_thread_fence(std::memory_order_release);
        ready_queue_by_index(local_graph_task->cpu_ready_queue_, base_owner)
            ->push(NodeTask(local_graph_task, nullptr, InputBuffer(0)));
      }

local_graph_task->owner_ should have a device associated with it at all times.

albanD · 2021-01-05T21:07:54Z

I think we have a race in the engine in the case where the task is queued on a worker thread here. And the worker thread start processing it faster than the next line that runs in this function that sets the owner_ here.

You can argue that this happens because the call site that @ezyang mention above does not lock the mutex for the local_graph_task before reading its owner.
Maybe a better invariant to have here would be to make sure we never push on the queue a partial graph task. And so we should delay pushing onto the queue until we are finished with the graph task creation and we have released the lock.

ezyang · 2021-01-06T14:54:44Z

So therefore, not only a ROCm problem

mruberry · 2021-01-06T14:58:16Z

Also removing "module: flaky-tests" even though this is causing intermittent test failures. This will let it appear in our triage review meeting.

jeffdaily · 2021-01-08T21:17:30Z

@albanD can we add a TORCH_INTERNAL_ASSERT to capture the invariant as mentioned by @ezyang in ready_queue_by_index()? Just in case.

Summary: Follow up to #49652 Pull Request resolved: #50372 Reviewed By: zhangguanheng66 Differential Revision: D25872203 Pulled By: albanD fbshipit-source-id: 8d6f30f17fba856c5c34c08372767349a250983d

…rch#50164) Summary: This solves a race condition where the worker thread might see a partially initialized graph_task Fixes pytorch#49652 I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR. Pull Request resolved: pytorch#50164 Reviewed By: zou3519 Differential Revision: D25824954 Pulled By: albanD fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb

mruberry added module: rocm AMD GPU support for Pytorch module: flaky-tests Problem is a flaky test in CI triage review labels Dec 20, 2020

mruberry added high priority and removed triage review labels Dec 21, 2020

pytorch-probot bot added the triage review label Dec 21, 2020

zou3519 added the module: autograd Related to torch.autograd, and the autograd engine in general label Jan 4, 2021

ezyang removed the module: rocm AMD GPU support for Pytorch label Jan 6, 2021

mruberry removed the module: flaky-tests Problem is a flaky test in CI label Jan 6, 2021

albanD mentioned this issue Jan 6, 2021

Autograd engine, only enqueue task when it is fully initialized #50164

Closed

facebook-github-bot closed this as completed in fc2ead0 Jan 8, 2021

peterjc123 mentioned this issue Jan 9, 2021

pytorch_windows_vs2019_py36_cuda11.1_test1 intermittently fails #50051

Closed

albanD mentioned this issue Jan 11, 2021

Add range assert in autograd engine queue lookup #50372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm CI is intermittently failing with std::out_of_range #49652

ROCm CI is intermittently failing with std::out_of_range #49652

mruberry commented Dec 20, 2020 •

edited by pytorch-probot bot

mruberry commented Dec 21, 2020

jeffdaily commented Dec 21, 2020

jeffdaily commented Jan 4, 2021

mruberry commented Jan 5, 2021

ezyang commented Jan 5, 2021

albanD commented Jan 5, 2021

ezyang commented Jan 6, 2021

mruberry commented Jan 6, 2021

jeffdaily commented Jan 8, 2021

ROCm CI is intermittently failing with std::out_of_range #49652

ROCm CI is intermittently failing with std::out_of_range #49652

Comments

mruberry commented Dec 20, 2020 • edited by pytorch-probot bot

mruberry commented Dec 21, 2020

jeffdaily commented Dec 21, 2020

jeffdaily commented Jan 4, 2021

mruberry commented Jan 5, 2021

ezyang commented Jan 5, 2021

albanD commented Jan 5, 2021

ezyang commented Jan 6, 2021

mruberry commented Jan 6, 2021

jeffdaily commented Jan 8, 2021

mruberry commented Dec 20, 2020 •

edited by pytorch-probot bot