Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522

Open
satheeshxolo opened this issue Sep 23, 2021 · 9 comments
Labels
module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@satheeshxolo
Copy link

satheeshxolo commented Sep 23, 2021

馃悰 Bug

While trying to run the tests under pytorch/test/ using pytest's boxing option "--forked" (to isolate some tests crashing), I see that many tests hang with --forked.

To Reproduce

python -u -m pytest test_ops.py --forked -svk test_out_cos_cpu_float32
Steps to reproduce the behavior:

  1. Install pytest package along with the packages pytest-xdist and pytest-forked
  2. From pytorch/test/ run: python -u -m pytest test_ops.py --forked -svk test_out_cos_cpu_float32
  3. The test should hang unless we kill the process (On Linux 'ps' would still show the python process running after ^C)

Expected behavior

I am expecting that there is some option given to run the tests under boxed option with pytest.
Tests shouldn't hang with pytest's --forked option.

Environment

Collecting environment information...
PyTorch version: 1.9.0a0+git09dfd6d
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.6
Libc version: glibc-2.17

Python version: 3.7 (64-bit runtime)
Python platform: Linux-4.15.0-145-generic-x86_64-with-debian-buster-sid
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] torch==1.0.0+767d490.dirty
[pip3] torch-dataloader==1.0.0+767d490.dirty
[pip3] numpy==1.21.2
[pip3] torch==1.9.0a0+git09dfd6d
[conda] torch 1.0.0+767d490.dirty pypi_0 pypi
[conda] torch-dataloader 1.0.0+767d490.dirty pypi_0 pypi
[conda] mkl 2019.0 118
[conda] mkl-include 2019.0 118
[conda] numpy 1.21.2 pypi_0 pypi
[conda] numpy-base 1.20.3 py37h39b7dee_0
[conda] torch 1.9.0a0+git09dfd6d pypi_0 pypi

Additional context

I tried a first level triaging of the issue. The problem seem to trigger from the handling of "noncontiguous" tensors in make_tensor() utility (torch/testing/_creation.py on latest, but in torch/testing/_internal/common_utils.py in 1.9.0).
"noncontiguous" tensors have a conditional additional call to torch.repeat_interleave(result, 2, dim=-1). The kernel implementation of "repeat_interleave" in aten/src/ATen/native/Repeat.cpp uses at::parallel_for() which can use multiple threads in the execution. My guess is that this usage of repeat_interleave for "noncontiguous" causes the hang when used with pytest's "--forked" option.

cc @mruberry

@H-Huang
Copy link
Member

H-Huang commented Sep 23, 2021

We don't officially support pytest so some features for it may be missing. Can you elaborate on the use case for --forked? Isn't pytest already able to run tests in isolation?

@H-Huang H-Huang added module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 23, 2021
@satheeshxolo
Copy link
Author

Use-case: In my team's work we are trying to integrate pytorch for our backend hardware and so we write kernels and hook to dispatched pytorch op APIs. We run pytorch test suite in "boxed"/forked pytest environment to isolate failed test cases from passing cases. If we run tests in series (without forked), we might face "cascaded false failures" if one tests fails (because our stack is still under development).
Since, I found this issue running with pytest --forked, i was thinking that there might be a way to support it atleast through an env variable option. E.g., It might have been good if there was something like:

if noncontiguous and numel > 1:
if os.environ.get('PYTORCH_TEST_WITH_PYTEST_FORKED', "0FF") == "ON":
#implementation that doesn't use repeat_interleave, might be less efficient
else:
#implementation that uses torch.repeat_interleave

@mruberry
Copy link
Collaborator

It sounds like the issue is the use of multiple threads in this mode, however, which suggests that just replacing one repeat_interleave call is unlikely to have the desired effect. What about disabling parallelism and using only a single thread?

@satheeshxolo
Copy link
Author

If I disable multiple threads in pytest forked (-n1, single thread), then there is no hang. But, it would be nice if make_tensor() also has a single threaded implementation that doesn't end up using at::parallel_for().

@mruberry
Copy link
Collaborator

@satheeshxolo But is make_tensor() being multithreaded the only reason the test suite doesn't work while doing this? That seems very unlikely.

@satheeshxolo
Copy link
Author

@mruberry - in my observations, the noncontig handling based on torch.repeat_interleave() (in-turn based on at::parallel_for()) seems to be the point where the execution gets into some deadlock while running with pytest's --forked.

@mruberry
Copy link
Collaborator

@satheeshxolo Yes but if you fix that what breaks next?

@satheeshxolo
Copy link
Author

@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue.

@mruberry
Copy link
Collaborator

@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue.

You can probably just change make_tensor to ignore the noncontiguous kwarg while debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants