test_cond_cpu tests fail when running with `numpy` compiled against `OpenBLAS 0.3.15` #67675

casparvl · 2021-11-02T10:59:35Z

🐛 Bug

The following tests fail in the PyTorch 1.10.0 test suite when using numpy compiled with OpenBLAS 0.3.15:

ERROR: test_cond_cpu_complex128 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_complex64 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_float32 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_float64 (__main__.TestLinalgCPU)

The failure looks like this:

======================================================================
ERROR: test_cond_cpu_complex128 (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1635, in test_cond
    run_test_case(a, p)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1600, in run_test_case
    result_numpy = np.linalg.cond(input.cpu().numpy(), p)
  File "<__array_function__ internals>", line 5, in cond
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 1780, in cond
    r = norm(x, p, axis=(-2, -1)) * norm(invx, p, axis=(-2, -1))
  File "<__array_function__ internals>", line 5, in norm
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 2600, in norm
    ret = _multi_svd_norm(x, row_axis, col_axis, sum)
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 2354, in _multi_svd_norm
    result = op(svd(y, compute_uv=False), axis=-1)
  File "<__array_function__ internals>", line 5, in svd
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 1672, in svd
    s = gufunc(a, signature=signature, extobj=extobj)
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

The reason is that OpenBLAS changed behavior as to how NaN's are propagated, see numpy/numpy#18914

In the particular case of these PyTorch tests, the input is

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])

And the call being made when the test fails is:

np.linalg.cond(input,'nuc')

Note that it is only the nuc norm that fails, all the others seem to return with inf.

To Reproduce

Steps to reproduce the behavior:

Build PyTorch 1.10.0 from source with a numpy that is build against OpenBLAS 0.3.15
Run the run_test.py -i test_linalg.TestLinalgCPU tests

Instead of running the test case, one can also run the failing commands from the test case directly:

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])
np.linalg.cond(input,'nuc')

Expected behavior

I expected the test to pass :)

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch Version (e.g., 1.0): 1.10.0
OS (e.g., Linux): CentOS 8
How you installed PyTorch (conda, pip, source): Source
Build command you used (if compiling from source):

USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.10.0 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=72 BLAS=Eigen WITH_BLAS=open USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/lib64 CUDNN_INCLUDE_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/tmp/sw_stack_gpu/software/NCCL/2.10.3-GCCcore-10.3.0-CUDA-11.3.1/include USE_METAL=0   /tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python setup.py build

Python version: 3.9.5
CUDA/cuDNN version: CUDA: 11.3.1 / cuDNN: 8.2.1.32
GPU models and configuration: N/A
Any other relevant information: numpy 1.20.3, OpenBLAS 0.3.15

Additional context

The PyTorch tests use the np.linalg functions as a reference, and check if the equivalent torch.linalg calls produce the same result. In this case, the np.linalg function fails because of changed OpenBLAS behavior. This failure is LAPACK implementation-dependent.

There is still discussion in numpy/numpy#18914 as to how to deal with the changed OpenBLAS behavior. It seems likely that we should conclude that behavior of np.linalg.cond is undefined for inputs that contain a NaN or that are non-invertible / singular. This will at least be the behavior of all past versions of numpy that are compiled with OpenBLAS 0.3.15. Thus, using test cases in a test suite that rely on singular/non-inveritble inputs for np.linalg.cond as a reference is probably not a great idea.

The test that fails is the one that uses singular input on purpose

pytorch/test/test_linalg.py

Line 1637 in cd51d2a

# test for singular input

In my view, this test should simply be skipped.

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

The text was updated successfully, but these errors were encountered:

mruberry · 2021-11-02T11:21:28Z

Thanks for reporting this issue, @casparvl!

implementations Fixes #67675 [ghstack-poisoned]

implementations Fixes #67675 cc mruberry [ghstack-poisoned]

implementations Fixes #67675 ghstack-source-id: 684ad941ae271a5ef3a9c0553d32ab4b1bdb99b7 Pull Request resolved: #67679

implementations Fixes #67675 cc mruberry [ghstack-poisoned]

implementations Fixes #67675 ghstack-source-id: 0b37c1ba7d522143ee3b3835ac2fe5595a2a54d2 Pull Request resolved: #67679

implementations Fixes #67675 cc mruberry [ghstack-poisoned]

…ations Fixes #67675 ghstack-source-id: 705569c1a653288aad45c624e3ab0fbeaddcb956 Pull Request resolved: #67679

Summary: Pull Request resolved: #67679 implementations Fixes #67675 cc mruberry Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32368698 Pulled By: mruberry fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53 Signed-off-by: Eli Uriegas <eliuriegas@fb.com> ghstack-source-id: 856e69d7e57f4e8cd8c794feda9487f006c7dfde

…n using OpenBLAS" implementations Fixes #67675 cc mruberry Differential Revision: [D32368698](https://our.internmc.facebook.com/intern/diff/D32368698) [ghstack-poisoned]

implementations Fixes #67675 cc mruberry Differential Revision: [D32368698](https://our.internmc.facebook.com/intern/diff/D32368698) [ghstack-poisoned]

…7679) Summary: Pull Request resolved: pytorch#67679 implementations Fixes pytorch#67675 cc mruberry Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32368698 Pulled By: mruberry fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53

…72820) Summary: Pull Request resolved: #67679 implementations Fixes #67675 cc mruberry Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32368698 Pulled By: mruberry fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53 Co-authored-by: lezcano <lezcano-93@hotmail.com>

casparvl mentioned this issue Nov 2, 2021

{devel}[foss/2021a] PyTorch v1.10.0, torchvision v0.11.1, Horovod v0.23.0 w/ Python 3.9.5 + CUDA-11.3.1 easybuilders/easybuild-easyconfigs#14233

Merged

lezcano mentioned this issue Nov 2, 2021

Fix failing test due to a bug in NumPy when using OpenBLAS #67679

Closed

lezcano added a commit that referenced this issue Nov 2, 2021

Fix failing test due to a bug in NumPy when using some BLAS

08d479c

implementations Fixes #67675 [ghstack-poisoned]

lezcano self-assigned this Nov 2, 2021

lezcano added a commit that referenced this issue Nov 2, 2021

Update on "Fix failing test due to a bug in NumPy when using OpenBLAS"

187da84

implementations Fixes #67675 cc mruberry [ghstack-poisoned]

lezcano added a commit that referenced this issue Nov 2, 2021

Fix failing test due to a bug in NumPy when using some BLAS

41ccf26

implementations Fixes #67675 ghstack-source-id: 684ad941ae271a5ef3a9c0553d32ab4b1bdb99b7 Pull Request resolved: #67679

casparvl mentioned this issue Nov 2, 2021

test_svd_errors_and_warnings tests fail on CPU #67693

Closed

lezcano added a commit that referenced this issue Nov 3, 2021

Update on "Fix failing test due to a bug in NumPy when using OpenBLAS"

1a01df1

implementations Fixes #67675 cc mruberry [ghstack-poisoned]

lezcano added a commit that referenced this issue Nov 3, 2021

Fix failing test due to a bug in NumPy when using some BLAS

5fbcd13

implementations Fixes #67675 ghstack-source-id: 0b37c1ba7d522143ee3b3835ac2fe5595a2a54d2 Pull Request resolved: #67679

t10-13rocket mentioned this issue Nov 6, 2021

Intellectual property owned by me #67959

Closed

mruberry added the module: openblas label Nov 10, 2021

lezcano added a commit that referenced this issue Nov 11, 2021

Update on "Fix failing test due to a bug in NumPy when using OpenBLAS"

a63ee85

implementations Fixes #67675 cc mruberry [ghstack-poisoned]

lezcano added a commit that referenced this issue Nov 11, 2021

Fix failing test due to a bug in NumPy when using some BLAS implement…

68cc450

…ations Fixes #67675 ghstack-source-id: 705569c1a653288aad45c624e3ab0fbeaddcb956 Pull Request resolved: #67679

facebook-github-bot closed this as completed in 43874d7 Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_cond_cpu tests fail when running with `numpy` compiled against `OpenBLAS 0.3.15` #67675

test_cond_cpu tests fail when running with `numpy` compiled against `OpenBLAS 0.3.15` #67675

casparvl commented Nov 2, 2021 •

edited by pytorch-probot bot

mruberry commented Nov 2, 2021

test_cond_cpu tests fail when running with numpy compiled against OpenBLAS 0.3.15 #67675

test_cond_cpu tests fail when running with numpy compiled against OpenBLAS 0.3.15 #67675

Comments

casparvl commented Nov 2, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mruberry commented Nov 2, 2021

test_cond_cpu tests fail when running with `numpy` compiled against `OpenBLAS 0.3.15` #67675

test_cond_cpu tests fail when running with `numpy` compiled against `OpenBLAS 0.3.15` #67675

casparvl commented Nov 2, 2021 •

edited by pytorch-probot bot