Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_cond_cpu tests fail when running with numpy compiled against OpenBLAS 0.3.15 #67675

Closed
casparvl opened this issue Nov 2, 2021 · 1 comment
Assignees
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: NaNs and Infs Problems related to NaN and Inf handling in floating point module: openblas triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@casparvl
Copy link

casparvl commented Nov 2, 2021

🐛 Bug

The following tests fail in the PyTorch 1.10.0 test suite when using numpy compiled with OpenBLAS 0.3.15:

ERROR: test_cond_cpu_complex128 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_complex64 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_float32 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_float64 (__main__.TestLinalgCPU)

The failure looks like this:

======================================================================
ERROR: test_cond_cpu_complex128 (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1635, in test_cond
    run_test_case(a, p)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1600, in run_test_case
    result_numpy = np.linalg.cond(input.cpu().numpy(), p)
  File "<__array_function__ internals>", line 5, in cond
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 1780, in cond
    r = norm(x, p, axis=(-2, -1)) * norm(invx, p, axis=(-2, -1))
  File "<__array_function__ internals>", line 5, in norm
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 2600, in norm
    ret = _multi_svd_norm(x, row_axis, col_axis, sum)
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 2354, in _multi_svd_norm
    result = op(svd(y, compute_uv=False), axis=-1)
  File "<__array_function__ internals>", line 5, in svd
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 1672, in svd
    s = gufunc(a, signature=signature, extobj=extobj)
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

The reason is that OpenBLAS changed behavior as to how NaN's are propagated, see numpy/numpy#18914

In the particular case of these PyTorch tests, the input is

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])

And the call being made when the test fails is:

np.linalg.cond(input,'nuc')

Note that it is only the nuc norm that fails, all the others seem to return with inf.

To Reproduce

Steps to reproduce the behavior:

  1. Build PyTorch 1.10.0 from source with a numpy that is build against OpenBLAS 0.3.15
  2. Run the run_test.py -i test_linalg.TestLinalgCPU tests

Instead of running the test case, one can also run the failing commands from the test case directly:

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])
np.linalg.cond(input,'nuc')

Expected behavior

I expected the test to pass :)

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0): 1.10.0
  • OS (e.g., Linux): CentOS 8
  • How you installed PyTorch (conda, pip, source): Source
  • Build command you used (if compiling from source):
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.10.0 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=72 BLAS=Eigen WITH_BLAS=open USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/lib64 CUDNN_INCLUDE_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/tmp/sw_stack_gpu/software/NCCL/2.10.3-GCCcore-10.3.0-CUDA-11.3.1/include USE_METAL=0   /tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python setup.py build
  • Python version: 3.9.5
  • CUDA/cuDNN version: CUDA: 11.3.1 / cuDNN: 8.2.1.32
  • GPU models and configuration: N/A
  • Any other relevant information: numpy 1.20.3, OpenBLAS 0.3.15

Additional context

The PyTorch tests use the np.linalg functions as a reference, and check if the equivalent torch.linalg calls produce the same result. In this case, the np.linalg function fails because of changed OpenBLAS behavior. This failure is LAPACK implementation-dependent.

There is still discussion in numpy/numpy#18914 as to how to deal with the changed OpenBLAS behavior. It seems likely that we should conclude that behavior of np.linalg.cond is undefined for inputs that contain a NaN or that are non-invertible / singular. This will at least be the behavior of all past versions of numpy that are compiled with OpenBLAS 0.3.15. Thus, using test cases in a test suite that rely on singular/non-inveritble inputs for np.linalg.cond as a reference is probably not a great idea.

The test that fails is the one that uses singular input on purpose

# test for singular input

In my view, this test should simply be skipped.

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

@mruberry mruberry added module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: NaNs and Infs Problems related to NaN and Inf handling in floating point triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 2, 2021
@mruberry
Copy link
Collaborator

mruberry commented Nov 2, 2021

Thanks for reporting this issue, @casparvl!

lezcano added a commit that referenced this issue Nov 2, 2021
@lezcano lezcano self-assigned this Nov 2, 2021
lezcano added a commit that referenced this issue Nov 2, 2021
implementations

Fixes #67675

cc mruberry

[ghstack-poisoned]
lezcano added a commit that referenced this issue Nov 2, 2021
implementations

Fixes #67675

ghstack-source-id: 684ad941ae271a5ef3a9c0553d32ab4b1bdb99b7
Pull Request resolved: #67679
lezcano added a commit that referenced this issue Nov 3, 2021
implementations

Fixes #67675

cc mruberry

[ghstack-poisoned]
lezcano added a commit that referenced this issue Nov 3, 2021
implementations

Fixes #67675

ghstack-source-id: 0b37c1ba7d522143ee3b3835ac2fe5595a2a54d2
Pull Request resolved: #67679
lezcano added a commit that referenced this issue Nov 11, 2021
implementations

Fixes #67675

cc mruberry

[ghstack-poisoned]
lezcano added a commit that referenced this issue Nov 11, 2021
…ations

Fixes #67675

ghstack-source-id: 705569c1a653288aad45c624e3ab0fbeaddcb956
Pull Request resolved: #67679
seemethere added a commit that referenced this issue Nov 15, 2021
Summary:
Pull Request resolved: #67679

implementations

Fixes #67675

cc mruberry

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32368698

Pulled By: mruberry

fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
ghstack-source-id: 856e69d7e57f4e8cd8c794feda9487f006c7dfde
seemethere added a commit that referenced this issue Nov 15, 2021
…n using OpenBLAS"

implementations

Fixes #67675

cc mruberry

Differential Revision: [D32368698](https://our.internmc.facebook.com/intern/diff/D32368698)

[ghstack-poisoned]
seemethere added a commit that referenced this issue Nov 15, 2021
implementations

Fixes #67675

cc mruberry

Differential Revision: [D32368698](https://our.internmc.facebook.com/intern/diff/D32368698)

[ghstack-poisoned]
jambayk pushed a commit to jambayk/pytorch that referenced this issue Feb 11, 2022
…7679)

Summary:
Pull Request resolved: pytorch#67679

implementations

Fixes pytorch#67675

cc mruberry

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32368698

Pulled By: mruberry

fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53
jambayk pushed a commit to jambayk/pytorch that referenced this issue Feb 14, 2022
…7679)

Summary:
Pull Request resolved: pytorch#67679

implementations

Fixes pytorch#67675

cc mruberry

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32368698

Pulled By: mruberry

fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53
jambayk pushed a commit to jambayk/pytorch that referenced this issue Feb 14, 2022
…7679)

Summary:
Pull Request resolved: pytorch#67679

implementations

Fixes pytorch#67675

cc mruberry

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32368698

Pulled By: mruberry

fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53
malfet pushed a commit that referenced this issue Feb 15, 2022
…72820)

Summary:
Pull Request resolved: #67679

implementations

Fixes #67675

cc mruberry

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32368698

Pulled By: mruberry

fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53

Co-authored-by: lezcano <lezcano-93@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: NaNs and Infs Problems related to NaN and Inf handling in floating point module: openblas triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants