Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm CI is intermittently failing with std::out_of_range #49652

Closed
mruberry opened this issue Dec 20, 2020 · 9 comments
Closed

ROCm CI is intermittently failing with std::out_of_range #49652

mruberry opened this issue Dec 20, 2020 · 9 comments
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general triage review

Comments

@mruberry
Copy link
Collaborator

mruberry commented Dec 20, 2020

See https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.10-py3.6-test1/126//console and https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.10-py3.6-test1/110//console for examples.

As best I can tell this error always happens around the same place in the test suite (but note that in these examples the actual tests where the error occurs are different).

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @gqchen @pearu @nikitaved @soulitzer @jeffdaily @sunway513

@mruberry mruberry added module: rocm AMD GPU support for Pytorch module: flaky-tests Problem is a flaky test in CI triage review labels Dec 20, 2020
@mruberry
Copy link
Collaborator Author

This issue may be happening on other builds. This just popped up, for example:

https://app.circleci.com/pipelines/github/pytorch/pytorch/253785/workflows/f0a0c844-62dc-455a-97d4-fd9c221c5a39/jobs/9777526

But the error is different.

@jeffdaily
Copy link
Collaborator

We do have an engineer looking into this. We started seeing this error intermittently perhaps a month back when our CI was on ROCm 3.9. Right before we switched to ROCm 3.10, the error became more frequent. We've had difficulty reproducing it on our local systems. But we will continue our efforts. Good to know it might not be specific to ROCm, but it certainly seems to be more frequent there.

@jeffdaily
Copy link
Collaborator

@mruberry our engineer @jaglinux was able to produce a stack trace.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7803921 in __GI_abort () at abort.c:79
#2  0x00007ffff480f84a in __gnu_cxx::__verbose_terminate_handler () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff480df47 in __cxxabiv1::__terminate (handler=<optimized out>) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007ffff480df7d in std::terminate () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00007ffff480e15a in __cxxabiv1::__cxa_throw (obj=obj@entry=0x55557c2680a0, tinfo=0x7ffff48c77a8 <typeinfo for std::out_of_range>, dest=0x7ffff481a1b4 <std::out_of_range::~out_of_range()>)
    at /home/nwani/m3/conda-bld/ne-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00007ffff4828ad2 in std::__throw_out_of_range_fmt (__fmt=<optimized out>) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/functexcept.cc:96
#7  0x00007fffe8148130 in torch::autograd::Engine::ready_queue_by_index(std::shared_ptr<torch::autograd::ReadyQueue>, int) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fffe8151dcb in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fffe81499e9 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffed5c257a in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /root/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#11 0x00007ffff482a19d in std::execute_native_thread_routine (__p=0x55555a2668c0) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#12 0x00007ffff7bbb6db in start_thread (arg=0x7ffa9d7f4700) at pthread_create.c:463
#13 0x00007ffff78e471f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Engine::ready_queue_by_index(), if passed a device_index of NO_DEVICE (-2), it will be used directly in device_ready_queues_.at(device_index).

auto Engine::ready_queue_by_index(std::shared_ptr<ReadyQueue> cpu_ready_queue, int device_index) -> std::shared_ptr<ReadyQueue> {
if (device_index == CPU_DEVICE) {
// return the cpu ready queue passed in
TORCH_INTERNAL_ASSERT(cpu_ready_queue);
return cpu_ready_queue;
} else {
// See Note [Allocating GPUs to autograd threads]
// NB: This function would become obsolete if we truly allocated a CPU thread
// per device, rather than colocate.
return device_ready_queues_.at(device_index);
}
}

Should the if condition be changed to if (device_index < 0) to capture both the CPU_DEVICE and NO_DEVICE cases? Or should it be if (device_index == CPU_DEVICE || device_index == NO_DEVICE)?

@zou3519 zou3519 added the module: autograd Related to torch.autograd, and the autograd engine in general label Jan 4, 2021
@mruberry
Copy link
Collaborator Author

mruberry commented Jan 5, 2021

Awesome, let me cc in our relevant experts here, @albanD @ngimel

@ezyang
Copy link
Contributor

ezyang commented Jan 5, 2021

I think the invariant is that this should never be called with NO_DEVICE.

The relevant call site is

      local_graph_task->mark_as_completed_and_run_post_processing();

      auto base_owner = local_graph_task->owner_;
      // The current worker thread finish the graph_task, but the owning thread
      // of the graph_task might be sleeping on pop() if it does not have work.
      // So we need to send a dummy function task to the owning thread just to
      // ensure that it's not sleeping, so that we can exit the thread_main.
      // If it has work, it might see that graph_task->outstanding_tasks_ == 0
      // before it gets to the task, but it's a no-op anyway.
      //
      // NB: This is not necessary if the current thread is the owning thread.
      if (worker_device != base_owner) {
        // Synchronize outstanding_tasks_ with queue mutex
        std::atomic_thread_fence(std::memory_order_release);
        ready_queue_by_index(local_graph_task->cpu_ready_queue_, base_owner)
            ->push(NodeTask(local_graph_task, nullptr, InputBuffer(0)));
      }

local_graph_task->owner_ should have a device associated with it at all times.

@albanD
Copy link
Collaborator

albanD commented Jan 5, 2021

I think we have a race in the engine in the case where the task is queued on a worker thread here. And the worker thread start processing it faster than the next line that runs in this function that sets the owner_ here.

You can argue that this happens because the call site that @ezyang mention above does not lock the mutex for the local_graph_task before reading its owner.
Maybe a better invariant to have here would be to make sure we never push on the queue a partial graph task. And so we should delay pushing onto the queue until we are finished with the graph task creation and we have released the lock.

@ezyang ezyang removed the module: rocm AMD GPU support for Pytorch label Jan 6, 2021
@ezyang
Copy link
Contributor

ezyang commented Jan 6, 2021

So therefore, not only a ROCm problem

@mruberry mruberry removed the module: flaky-tests Problem is a flaky test in CI label Jan 6, 2021
@mruberry
Copy link
Collaborator Author

mruberry commented Jan 6, 2021

Also removing "module: flaky-tests" even though this is causing intermittent test failures. This will let it appear in our triage review meeting.

@jeffdaily
Copy link
Collaborator

@albanD can we add a TORCH_INTERNAL_ASSERT to capture the invariant as mentioned by @ezyang in ready_queue_by_index()? Just in case.

facebook-github-bot pushed a commit that referenced this issue Jan 11, 2021
Summary:
Follow up to  #49652

Pull Request resolved: #50372

Reviewed By: zhangguanheng66

Differential Revision: D25872203

Pulled By: albanD

fbshipit-source-id: 8d6f30f17fba856c5c34c08372767349a250983d
hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this issue Jan 14, 2021
…rch#50164)

Summary:
This solves a race condition where the worker thread might
see a partially initialized graph_task

Fixes pytorch#49652

I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR.

Pull Request resolved: pytorch#50164

Reviewed By: zou3519

Differential Revision: D25824954

Pulled By: albanD

fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general triage review
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants