pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522

satheeshxolo · 2021-09-23T07:38:46Z

🐛 Bug

While trying to run the tests under pytorch/test/ using pytest's boxing option "--forked" (to isolate some tests crashing), I see that many tests hang with --forked.

To Reproduce

python -u -m pytest test_ops.py --forked -svk test_out_cos_cpu_float32
Steps to reproduce the behavior:

Install pytest package along with the packages pytest-xdist and pytest-forked
From pytorch/test/ run: python -u -m pytest test_ops.py --forked -svk test_out_cos_cpu_float32
The test should hang unless we kill the process (On Linux 'ps' would still show the python process running after ^C)

Expected behavior

I am expecting that there is some option given to run the tests under boxed option with pytest.
Tests shouldn't hang with pytest's --forked option.

Environment

Collecting environment information...
PyTorch version: 1.9.0a0+git09dfd6d
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.6
Libc version: glibc-2.17

Python version: 3.7 (64-bit runtime)
Python platform: Linux-4.15.0-145-generic-x86_64-with-debian-buster-sid
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] torch==1.0.0+767d490.dirty
[pip3] torch-dataloader==1.0.0+767d490.dirty
[pip3] numpy==1.21.2
[pip3] torch==1.9.0a0+git09dfd6d
[conda] torch 1.0.0+767d490.dirty pypi_0 pypi
[conda] torch-dataloader 1.0.0+767d490.dirty pypi_0 pypi
[conda] mkl 2019.0 118
[conda] mkl-include 2019.0 118
[conda] numpy 1.21.2 pypi_0 pypi
[conda] numpy-base 1.20.3 py37h39b7dee_0
[conda] torch 1.9.0a0+git09dfd6d pypi_0 pypi

Additional context

I tried a first level triaging of the issue. The problem seem to trigger from the handling of "noncontiguous" tensors in make_tensor() utility (torch/testing/_creation.py on latest, but in torch/testing/_internal/common_utils.py in 1.9.0).
"noncontiguous" tensors have a conditional additional call to torch.repeat_interleave(result, 2, dim=-1). The kernel implementation of "repeat_interleave" in aten/src/ATen/native/Repeat.cpp uses at::parallel_for() which can use multiple threads in the execution. My guess is that this usage of repeat_interleave for "noncontiguous" causes the hang when used with pytest's "--forked" option.

cc @mruberry

H-Huang · 2021-09-23T20:57:39Z

We don't officially support pytest so some features for it may be missing. Can you elaborate on the use case for --forked? Isn't pytest already able to run tests in isolation?

satheeshxolo · 2021-09-24T05:57:31Z

Use-case: In my team's work we are trying to integrate pytorch for our backend hardware and so we write kernels and hook to dispatched pytorch op APIs. We run pytorch test suite in "boxed"/forked pytest environment to isolate failed test cases from passing cases. If we run tests in series (without forked), we might face "cascaded false failures" if one tests fails (because our stack is still under development).
Since, I found this issue running with pytest --forked, i was thinking that there might be a way to support it atleast through an env variable option. E.g., It might have been good if there was something like:

if noncontiguous and numel > 1:
if os.environ.get('PYTORCH_TEST_WITH_PYTEST_FORKED', "0FF") == "ON":
#implementation that doesn't use repeat_interleave, might be less efficient
else:
#implementation that uses torch.repeat_interleave

mruberry · 2021-09-24T15:37:03Z

It sounds like the issue is the use of multiple threads in this mode, however, which suggests that just replacing one repeat_interleave call is unlikely to have the desired effect. What about disabling parallelism and using only a single thread?

satheeshxolo · 2021-09-27T04:13:35Z

If I disable multiple threads in pytest forked (-n1, single thread), then there is no hang. But, it would be nice if make_tensor() also has a single threaded implementation that doesn't end up using at::parallel_for().

mruberry · 2021-09-29T05:47:53Z

@satheeshxolo But is make_tensor() being multithreaded the only reason the test suite doesn't work while doing this? That seems very unlikely.

satheeshxolo · 2021-09-29T07:05:17Z

@mruberry - in my observations, the noncontig handling based on torch.repeat_interleave() (in-turn based on at::parallel_for()) seems to be the point where the execution gets into some deadlock while running with pytest's --forked.

mruberry · 2021-09-29T07:16:19Z

@satheeshxolo Yes but if you fix that what breaks next?

satheeshxolo · 2021-09-29T09:33:15Z

@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue.

mruberry · 2021-09-30T05:45:34Z

@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue.

You can probably just change make_tensor to ignore the noncontiguous kwarg while debugging

H-Huang added module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522

pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522

satheeshxolo commented Sep 23, 2021 •

edited by pytorch-probot bot

H-Huang commented Sep 23, 2021

satheeshxolo commented Sep 24, 2021

mruberry commented Sep 24, 2021

satheeshxolo commented Sep 27, 2021

mruberry commented Sep 29, 2021

satheeshxolo commented Sep 29, 2021

mruberry commented Sep 29, 2021

satheeshxolo commented Sep 29, 2021

mruberry commented Sep 30, 2021

pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522

pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522

Comments

satheeshxolo commented Sep 23, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

H-Huang commented Sep 23, 2021

satheeshxolo commented Sep 24, 2021

mruberry commented Sep 24, 2021

satheeshxolo commented Sep 27, 2021

mruberry commented Sep 29, 2021

satheeshxolo commented Sep 29, 2021

mruberry commented Sep 29, 2021

satheeshxolo commented Sep 29, 2021

mruberry commented Sep 30, 2021

satheeshxolo commented Sep 23, 2021 •

edited by pytorch-probot bot