Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad numerical matrix inversion with Ubuntu 18.04, gcc 7.5, and a Xeon Skylake-W processor #31007

Closed
NathanDunfield opened this issue Dec 4, 2020 · 14 comments

Comments

@NathanDunfield
Copy link

I built Sage 9.2 from source on Ubuntu 18.04 with gcc 7.5 in a Docker container running on an iMac Pro with a Xeon W-2150B processor (Skylake-W). Inversion of CDF matrices gives bad answers, which I traced down to scipy and finally to OpenBLAS 0.3.9. Attached is a simple script that takes an 18 x 18 matrix of complex numbers, "inverts" it, multiplies by the original, and compares to the identity. The expected behavior is:

sage -python inverse_prob.py
6e-15 # or similar 

but on the system in question I get a wildly wrong answer:

28.453890763868475

The OpenBLAS library was built with Haswell as the target, as its file name is:

libopenblas_haswellp-r0.3.9.so

Copying a version OpenBLAS built on Sandy Bridge hardware fixes the problem. On the same physical hardware with the Xeon W-2150B processor, the problem does not manifest itself in:

a) macOS 10.15, Python 3.8, scipy 1.5.2

b) Debian Sid, with official SageMath 9.2 package.

c) Ubuntu 18.04 and 20.04, with the python3-scipy package.

d) Latest Anaconda Docker image.

The failing Docker image in question has been posted as computop/sage:9.2a1.

Possibly related to #27961.

CC: @jhpalmieri @vbraun @mkoeppe @embray

Component: numerical

Branch/Commit: u/mkoeppe/bad_numerical_matrix_inversion_with_ubuntu_18_04__gcc_7_5__and_a_xeon_skylake_w_processor @ 2cc103f

Reviewer: Nathan Dunfield, Markus Wageringel

Issue created by migration from https://trac.sagemath.org/ticket/31007

@NathanDunfield NathanDunfield added this to the sage-9.3 milestone Dec 4, 2020
@NathanDunfield
Copy link
Author

Attachment: inverse_prob.py.gz

@NathanDunfield
Copy link
Author

comment:3

I have confirmed that this issue is fixed in the upstream repo:

git clone https://github.com/xianyi/OpenBLAS/
cd OpenBLAS
make NO_AVX512=1  # Sage already adds this option, see #27961
make PREFIX=/sage/local install

However, the problem is present even in the most recent official release of 0.3.12.

@mkoeppe
Copy link
Member

mkoeppe commented Dec 5, 2020

comment:4

Is there a particular commit that we can apply as a patch?

@NathanDunfield
Copy link
Author

Attachment: avx2.patch.gz

@NathanDunfield
Copy link
Author

comment:5

By bisecting, I determined that it is the commit fix avx2 detection which fixes the problem. It is possible to apply this as a patch to our current OpenBLAS 0.3.9, see attachment, and this does resolve the issue on my system.

The underlying issue seems to be this, where OpenBLAS is incorrectly identifying my processor as lacking AVX2, even though it has it. I don't exactly understand how this causes the observed behavior; if I add NO_AVX2=1 the make args the everything works fine (this is equivalent to compiling for Sandy Bridge). However, clearly applying the patch would be an improvement.

@mkoeppe
Copy link
Member

mkoeppe commented Dec 6, 2020

@mkoeppe
Copy link
Member

mkoeppe commented Dec 6, 2020

Commit: 2cc103f

@mkoeppe
Copy link
Member

mkoeppe commented Dec 6, 2020

New commits:

2cc103fbuild/pkgs/openblas/patches: Add avx2.patch

@mkoeppe
Copy link
Member

mkoeppe commented Dec 6, 2020

Author: Nathan Dunfield

@mwageringel
Copy link

comment:8

This does not seem to be relevant anymore, as the fix is included in Openblas 0.3.13, which is contained in Sage since 9.3.beta6.

@NathanDunfield
Copy link
Author

comment:9

Replying to @mwageringel:

This does not seem to be relevant anymore, as the fix is included in Openblas 0.3.13, which is contained in Sage since 9.3.beta6.

Yes, I agree this was fixed by #29913 (Upgrade OpenBLAS to 0.3.13). What is the correct way to close this ticket in this situation?

@mwageringel
Copy link

comment:10

Replying to @NathanDunfield:

What is the correct way to close this ticket in this situation?

Thanks for the confirmation. We can set the milestone to duplicate and give a positive review.

@mwageringel
Copy link

Reviewer: Nathan Dunfield, Markus Wageringel

@mwageringel
Copy link

Changed author from Nathan Dunfield to none

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants