New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch framework tests using make_tensor hangs with pytest's boxed exec option "--forked" #65522
Comments
We don't officially support pytest so some features for it may be missing. Can you elaborate on the use case for --forked? Isn't pytest already able to run tests in isolation? |
Use-case: In my team's work we are trying to integrate pytorch for our backend hardware and so we write kernels and hook to dispatched pytorch op APIs. We run pytorch test suite in "boxed"/forked pytest environment to isolate failed test cases from passing cases. If we run tests in series (without forked), we might face "cascaded false failures" if one tests fails (because our stack is still under development). if noncontiguous and numel > 1: |
It sounds like the issue is the use of multiple threads in this mode, however, which suggests that just replacing one repeat_interleave call is unlikely to have the desired effect. What about disabling parallelism and using only a single thread? |
If I disable multiple threads in pytest forked (-n1, single thread), then there is no hang. But, it would be nice if make_tensor() also has a single threaded implementation that doesn't end up using at::parallel_for(). |
@satheeshxolo But is make_tensor() being multithreaded the only reason the test suite doesn't work while doing this? That seems very unlikely. |
@mruberry - in my observations, the noncontig handling based on torch.repeat_interleave() (in-turn based on at::parallel_for()) seems to be the point where the execution gets into some deadlock while running with pytest's --forked. |
@satheeshxolo Yes but if you fix that what breaks next? |
@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue. |
You can probably just change make_tensor to ignore the noncontiguous kwarg while debugging |
馃悰 Bug
While trying to run the tests under pytorch/test/ using pytest's boxing option "--forked" (to isolate some tests crashing), I see that many tests hang with --forked.
To Reproduce
python -u -m pytest test_ops.py --forked -svk test_out_cos_cpu_float32
Steps to reproduce the behavior:
Expected behavior
I am expecting that there is some option given to run the tests under boxed option with pytest.
Tests shouldn't hang with pytest's --forked option.
Environment
Collecting environment information...
PyTorch version: 1.9.0a0+git09dfd6d
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.6
Libc version: glibc-2.17
Python version: 3.7 (64-bit runtime)
Python platform: Linux-4.15.0-145-generic-x86_64-with-debian-buster-sid
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] torch==1.0.0+767d490.dirty
[pip3] torch-dataloader==1.0.0+767d490.dirty
[pip3] numpy==1.21.2
[pip3] torch==1.9.0a0+git09dfd6d
[conda] torch 1.0.0+767d490.dirty pypi_0 pypi
[conda] torch-dataloader 1.0.0+767d490.dirty pypi_0 pypi
[conda] mkl 2019.0 118
[conda] mkl-include 2019.0 118
[conda] numpy 1.21.2 pypi_0 pypi
[conda] numpy-base 1.20.3 py37h39b7dee_0
[conda] torch 1.9.0a0+git09dfd6d pypi_0 pypi
Additional context
I tried a first level triaging of the issue. The problem seem to trigger from the handling of "noncontiguous" tensors in make_tensor() utility (torch/testing/_creation.py on latest, but in torch/testing/_internal/common_utils.py in 1.9.0).
"noncontiguous" tensors have a conditional additional call to torch.repeat_interleave(result, 2, dim=-1). The kernel implementation of "repeat_interleave" in aten/src/ATen/native/Repeat.cpp uses at::parallel_for() which can use multiple threads in the execution. My guess is that this usage of repeat_interleave for "noncontiguous" causes the hang when used with pytest's "--forked" option.
cc @mruberry
The text was updated successfully, but these errors were encountered: