New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failures on Xeon Cascade lake introduced in 0.3.19 #3505
Comments
|
can you try #3498 please ? unfortunately my only avx512 host at the moment is a lowly quadcore where I could not reproduce the (other) problem |
|
I've just started a build applying that fix (see the Note that we build using |
|
Actually I can reproduce the failures on a machine to which I have SSH access, so feel free to suggest things to try. |
|
the full patch from #3498 would remove the special handling of DYNAMIC_ARCH (the revert to the other, older dgemm kernel - though i do not recall earlier errors from that). |
|
I'm unable to reproduce the error if I drop BTW, why don't |
|
Though #3498 with |
|
Makes some sense as the error observed in #3494 may be specific to something in Ubuntu 18.04 (binutils or their snapshot/build of gcc 7.5.0). The switch to a different dgemm kernel for DYNAMIC_ARCH builds was added to work around an older issue where builds made on Sandybridge were observed to fail on Skylake. Possibly the old 4x8 DGEMM does not coexist well with the small-matrix kernels added recently) |
OK. What do you think we should do in the short term to fix the failures on Fedora?
I see. If/once they are relatively reliable, it would be useful to have a way to make them fail. It's easy to disable that if failures happen at random, but catching bugs early could be useful for Fedora IMO. BTW, a related question I have: have you considered using a versioning scheme which allows distinguishing pure bugfix releases (which should be backported in Fedora) from feature releases (which should not)? Ideally this kind of bug wouldn't have been backported to existing Fedora versions so that we have more time to handle it. |
go with the patch from #3498
not doable I think, given the number of architectures supported. (unless you expect me to do a release whenever anything bad pops up anywhere, or more frequent releases in general) |
OK, thanks!
My idea was not really to have more frequent releases or more backports, but to communicate more clearly to packagers and users what to expect from a given version. This is just semantic versioning, which many projects follow. If most OpenBLAS releases mix new features/improvements and bug fixes, it would be clearer IMHO to increase the minor version number rather than only the patch number. If you do that, nothing forces you to make bugfix/patch releases, but if at some point a situation happens when e.g. a serious regression was introduced in the last release, shipping a patch in a bugfix release will make it clear to packagers and distributors that they should upgrade and can do it with minimal risks of breakage. |
|
API does not change through decades, maybe it is worth noting PR-s where bugs are fixed so those at backporting can pick them up. |
I see test failures when running OpenBLAS 0.3.19 tests on a Xeon Cascade lake (see below), which did not appear using 0.3.18 (full build output here):
Details about the CPU are here. The exrcept above is for the libopenblasp64.so library, which is built with
TARGET=CORE2 DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 NUM_THREADS=128 USE_THREAD=1 INTERFACE64=1. But the build log contains error for other variants too (we build many, notably withINTERFACE64=0andUSE_THREAD=0).This triggers an incredible number of failures when running Julia tests.
Unfortunately I do not have direct access to that machine, I can only trigger builds of the RPM package. I can try a few build or runtime options though if that's useful.
Might be related to #3494. See also https://bugzilla.redhat.com/show_bug.cgi?id=1982856.
The text was updated successfully, but these errors were encountered: