Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance on AMD Ryzen and Threadripper #1461

Open
tkswe88 opened this issue Feb 14, 2018 · 121 comments
Open

performance on AMD Ryzen and Threadripper #1461

tkswe88 opened this issue Feb 14, 2018 · 121 comments

Comments

@tkswe88
Copy link

tkswe88 commented Feb 14, 2018

This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion

@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.

@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.

@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).

@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?

@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0

@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter).
As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.

@tkswe88
Copy link
Author

tkswe88 commented Feb 14, 2018

Here, I report on a first test with modified cache sizes.

The cache sizes reported from lstopo are
for every core
L1d: 32KB
L1i: 64KB
L2: 512KB
and for each of the four CCXs (4 cores per CCX)
L3: 8MB (i.e. 32 MB in total)

Hence, I performed a test on the current development version.

Assuming L1 and L2 would have to be reported per core and L3 in total, I modified the relevant lines of getarch.c to
#if defined (FORCE_ZEN)
#define FORCE
#define FORCE_INTEL
#define ARCHITECTURE "X86"
#define SUBARCHITECTURE "ZEN"
#define ARCHCONFIG "-DZEN "
"-DL1_CODE_SIZE=65536 -DL1_CODE_LINESIZE=64 -DL1_CODE_ASSOCIATIVE=8 "
"-DL1_DATA_SIZE=32768 -DL1_DATA_LINESIZE=64 -DL2_CODE_ASSOCIATIVE=8 "
"-DL2_SIZE=524288 -DL2_LINESIZE=64 -DL2_ASSOCIATIVE=8 "
"-DL3_SIZE=33554432 -DL3_LINESIZE=64 -DL3_ASSOCIATIVE=8 "
"-DITB_DEFAULT_ENTRIES=64 -DITB_SIZE=4096 "
"-DDTB_DEFAULT_ENTRIES=64 -DDTB_SIZE=4096 "
"-DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSE4_1 -DHAVE_SSE4_2 "
"-DHAVE_SSE4A -DHAVE_MISALIGNSSE -DHAVE_128BITFPU -DHAVE_FASTMOVU -DHAVE_CFLUSH "
"-DHAVE_AVX -DHAVE_FMA3 -DFMA3"
#define LIBNAME "zen"
#define CORENAME "ZEN"
#endif

I hope these settings are correctly translated from lstopo.
Next, I ran
sudo make clean
sudo make TARGET=ZEN USE_OPENMP=1 BINARY=64 FC=gfortran

After this, config.h reads as
#define OS_LINUX 1
#define ARCH_X86_64 1
#define C_GCC 1
#define 64BIT 1
#define PTHREAD_CREATE_FUNC pthread_create
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define ZEN
#define L1_CODE_SIZE 65536
#define L1_CODE_LINESIZE 64
#define L1_CODE_ASSOCIATIVE 8
#define L1_DATA_SIZE 32768
#define L1_DATA_LINESIZE 64
#define L2_CODE_ASSOCIATIVE 8
#define L2_SIZE 524288
#define L2_LINESIZE 64
#define L2_ASSOCIATIVE 8
#define L3_SIZE 33554432
#define L3_LINESIZE 64
#define L3_ASSOCIATIVE 8
#define ITB_DEFAULT_ENTRIES 64
#define ITB_SIZE 4096
#define DTB_DEFAULT_ENTRIES 64
#define DTB_SIZE 4096
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_SSE4A
#define HAVE_MISALIGNSSE
#define HAVE_128BITFPU
#define HAVE_FASTMOVU
#define HAVE_CFLUSH
#define HAVE_AVX
#define HAVE_FMA3
#define FMA3
#define CORE_ZEN
#define CHAR_CORENAME "ZEN"
#define SLOCAL_BUFFER_SIZE 24576
#define DLOCAL_BUFFER_SIZE 32768
#define CLOCAL_BUFFER_SIZE 12288
#define ZLOCAL_BUFFER_SIZE 8192
#define GEMM_MULTITHREAD_THRESHOLD 4

Eventually, I installed using
sudo make PREFIX=/usr/local install

After this, config.h has changed and reads as
#define OS_LINUX 1
#define ARCH_X86_64 1
#define C_GCC 1
#define 64BIT 1
#define PTHREAD_CREATE_FUNC pthread_create
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define ZEN
#define L1_CODE_SIZE 32768
#define L1_CODE_ASSOCIATIVE 8
#define L1_CODE_LINESIZE 64
#define L1_DATA_SIZE 32768
#define L1_DATA_ASSOCIATIVE 8
#define L1_DATA_LINESIZE 64
#define L2_SIZE 524288
#define L2_ASSOCIATIVE 8
#define L2_LINESIZE 64
#define L3_SIZE 33554432
#define L3_ASSOCIATIVE 10
#define L3_LINESIZE 64
#define ITB_SIZE 4096
#define ITB_ASSOCIATIVE 0
#define ITB_ENTRIES 64
#define DTB_SIZE 4096
#define DTB_ASSOCIATIVE 0
#define DTB_DEFAULT_ENTRIES 64
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_SSE4A
#define HAVE_AVX
#define HAVE_FMA3
#define HAVE_CFLUSH
#define HAVE_MISALIGNSSE
#define HAVE_128BITFPU
#define HAVE_FASTMOVU
#define NUM_SHAREDCACHE 1
#define NUM_CORES 1
#define CORE_ZEN
#define CHAR_CORENAME "ZEN"
#define SLOCAL_BUFFER_SIZE 24576
#define DLOCAL_BUFFER_SIZE 32768
#define CLOCAL_BUFFER_SIZE 12288
#define ZLOCAL_BUFFER_SIZE 8192
#define GEMM_MULTITHREAD_THRESHOLD 4

Note that in particular L1_CODE_SIZE was reset to 32KB. Also some of the L?_ASSOCIATIVE values have changed.

Looking at /usr/local/include/openblas_config.h, which was generated during the installation, has copied the entries of config.h generated during compilation (i.e. it has the right L1_CODE_SIZE of 64KB).

I have not observed any performance improvement from the modification of the caches. But I wonder whether the changes to config.h (L1_CODE_SIZE) during installation may have adverse effects on performance.

I will do more tests using HASWELL targets instead of ZEN in the development version and try to find out about loop unrolling.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 14, 2018

I suspect as you ran the make install without repeating the previous TARGET=ZEN argument it re-ran getarch, and that seems to misdetect at least the L1 code size. As no part of the library got rebuilt, and the correct version of openblas_config.h got installed this should not cause any problems.
Edited to add: there does not appear to be any use of L1_CODE_SIZE in the code anyway, even L1_DATA_LINESIZE appears to be used by a few old targets only. It would seem that the only hardware parameters to get right would be DTB_ENTRIES (used in level2 BLAS loop unrolling and POTRF) and L2_DATA_SIZE (used for buffer allocation in driver/others/memory.c) . Both seem correct in what you wrote above.

@tkswe88
Copy link
Author

tkswe88 commented Feb 14, 2018

@martin-frbg Your guessed correctly. Including the TRAGET=ZEN argument in the installation let to the right entries in config.h after installation and this did not improve performance.

@martin-frbg
Copy link
Collaborator

BTW this was also the case in the original LibGoto - L2 size used only to derive xGEMM_P parameters for Core2, Opteron and earlier, L1 and L3 size apparently unused. Seems most of the hardware parameters are detected and reported "just in case" now, but perhaps cpu development has stabilized in the sense that a given type id will no longer have variants that vary in L1 or L2 properties.
(There is one oddity in l2param.h where it uses L1_DATA_LINESIZE to determine an offset, but again your value appears to be correct already.)

@brada4
Copy link
Contributor

brada4 commented Feb 14, 2018

Somewhere deep in Wikipedia it is said that zen-epic-ripper changes cache from write-through to write-back. That may mean that effective cache easily halves.

@martin-frbg
Copy link
Collaborator

Somehow I doubt that, or at least its relevance for the detected cache sizes. On the other hand I feel there is a need to find out which of the many parameters in param.h and elsewhere that are faithfully copied for new cpu models are actually used in the current code, and how critical their influence is.
Starting from the fragment of the param.h change I linked to above, it looks to me that SNUMOPT is
completely unused, and DNUMOPT has a single appearance in a debug statement where it seems to
be part of a theoretical efficiency factor for the syrk function.

@tkswe88
Copy link
Author

tkswe88 commented Feb 16, 2018

I have been looking a bit more at the suspected performance shortage on the AMD Threadripper 1950X, its reasons and the consequences of thread-oversubscription on the TR1950X.

  1. Regarding the previously reported 20% lower performance of the TR1950X compared to the i7-4770 using 1, 2 and 4 threads, I need to correct that this was for the total runtime of my code for one particular example. For dposv, dsyrk and zgbtrf using openblas 0.2.19, 0.2.20 and 0.3.30, the i7-4770 needs about 30-35% less time than the TR1950X in my example. I stumbled across a document (seemingly from AMD) on this webpage
    http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf recommending strongly to avoid software prefetch on Ryzen platforms, because this would prevent loop unrolling. The test example written in C and presented in the pdf reports a 15% speedup by not prefetching. However, the presented example is for compilation using Microsoft Visual Studio, and it is for this setup that software prefetching prevents loop unrolling. Do you think this might be a potential problem in openblas using compilation with gcc or is there hardcoded loop unrolling in openblas which would not be susceptible to this?
    @brada4: On the TR1950X, I monitored the clock speeds on all active cores on /proc/cpuinfo and under load they all seem to run at 3.75 GHz with little variation (base frequency of TR1950X is 3.4 GHz). Under load the frequencies of the i7-4770 were 3.5 to 3.9 GHz (according to i7z_GUI). So this is not a big difference, it seems.

  2. Using 8-16 threads on the TR1950X, I observed a version-related speedup of 2-5% in parallelised dsyrk and dposv for each of these numbers of threads, when switching from v. 0.2.19 to v. 0.2.20 or the development version. For 1 thread, there was no difference. For 4 threads, the improvement was at best 1%. For ZGBTRF and ZGBTRS, I cannot make an educated statement about possible improvements between the versions because of my self-inflicted thread oversubcription (cf. thread safety in openblas 0.2.20 #1425 and below). However, selecting a recent openblas version has advantages unless one oversubscribes on the threads (next point).

  3. Regarding the thread-oversubscription (calling parallelised ZGBTRF and ZGBTRS from within an OpenMP parallelised loop in my code), I run my code linked to openblas 0.2.19, 0.2.20 and the development version. There is an interesting behaviour for the total run-time of this loop with respect to the number of threads and the openblas version:

no. threads runtime 0.2.19 [s] runtime 0.2.20 [s] runtime devel [s]
1 35 32.5 32.5
4 11.7 11.9 11.8
8 8.3 14.5 25.3
12 6.1 20.3 32.3
16 6.1 24.3 40.1

These numbers are for compilation of openblas and my code using gcc/gfortran v. 7.2 with optimisation level -O2 (-O3 reduces the run times by about 1 s). So, up to 4 threads the performance is comparable, and the thread-oversubscription does not really seem to play a big role. For a larger number of threads, v.0.2.19 still sees a decent speedup, but for versions 0.2.20 and devel, there is clear detoriation in performance when opting for a higher number of threads. Note that these examples where run after correcting lapack-netlib/SRC/Makefile of versions 0.2.20 and devel and recompiling (cf. #1425). So, I get correct results in all of these tests.

Since there are general improvements in speed when selecting a recent openblas version, I would want to get over the problem with thread-oversubscription. However, I would need sequential versions of ZGBTRF and ZGBTRS inside the aforementioned loop and parallelised versions of DSYRK, DPOSV, etc in other parts of the code. This seems to require compilation of sequential and parallel versions of the openblas library, linking to these and then somehow picking the right (sequential or parallel) versions of the required BLAS and LAPACK routines. Now, if openblas came as a Fortran module, this would be a no-brainer, because one could just use the "use ..., only :: ... => ..." mechanism of Fortran to restrict import of symbols and to re-name them. I have been searching the net for possible linker-related solutions that provide similar mechanisms, but to no avail. Do you have a document or webpage with a solution to this at hand?

@martin-frbg
Copy link
Collaborator

Some thoughts -

  1. prefetch should be addressed automatically by gcc for C and Fortran. The Haswell assembly files have a bunch of prefetcht0 instructions that may need looking at.
  2. Need to look again at what these two functions call internally.
  3. There have been both LAPACK version updates and pthread safety fixes between each pair of releases, and develop has new GEMM thread sheduling from 2D thread distribution for multi-threaded GEMMs #1320. In addition, develop has a less efficient AXPY at the moment due to Wrong results for small DTRMV on x86-64 #1332 (which would be easy to revert).
    But offhand there is nothing I would immediately blame the observed and rather serious loss of performance on. (This may warrant another git bisect...)

@tkswe88
Copy link
Author

tkswe88 commented Feb 16, 2018

Fine, I will run git bisect on Monday.

@martin-frbg
Copy link
Collaborator

Thanks. BTW there is a utility function openblas_set_num_threads() but according to #803 not all parts of the code honor it.

@martin-frbg
Copy link
Collaborator

ZGBTRF did not change between LAPACK 3.6.1 and 3.8.0, and is mostly IZAMAX + ZSCAL + ZGERU + ZTRSM + ZGEMM. Nothing I'd immediately identify as having undergone any drastic changes since 0.2.19. (Except the post-0.2.20 GEMM thread scheduling mentioned above).

Perhaps a good starting point would be 9e4b697 - a few months after 0.2.19, and shortly before both a big LAPACK update (with risk of collateral damage) and a set of thread safety fixes. If all is well up to that point, it should still display the good performance of 0.2.19.
Next would be something like 99880f7 , with my thread fixes in place and the LAPACK update stabilized, immediately before the first attempt at adding Zen support. (You will need the lapack Makefile fix from here on).
Then fa6a920 - a few weeks before 0.2.20, nothing significant should have changed in between, and 00c42dc, well past 0.2.20 and shortly before the rewrite of the GEMM thread scheduler.

@tkswe88
Copy link
Author

tkswe88 commented Feb 19, 2018

I have finished bisecting between 0.2.19 and 0.2.20 to find the cause of the performance degradation reported in point 3 (see above) when ZGBTRF and ZGBTRS are called from within an OpenMP parallelised loop. The resulting output is:

87c7d10 is the first bad commit
commit 87c7d10
Author: Martin Kroeker martin@ruby.chemie.uni-freiburg.de
Date: Sun Jan 8 23:33:51 2017 +0100

Fix thread data races detected by helgrind 3.12

Ref. #995, may possibly help solve issues seen in 660,883

:040000 040000 9f41e2cd82dc83e84b65d32000d6341cc7e417a8 bcd37e483226009ef28ac179f7268fe419e0b73d M driver

As already reported in point 3 (see above), there was an additional performance degradation between 0.2.20 and the development version. Would you like to have the bisection results on that, too, or shall we see whether fixing the problem between 0.2.19 and 0.2.20 removes the additional degradation?

@martin-frbg
Copy link
Collaborator

This is bad, as it means we will probably need someone more experienced with thread programming than me to improve this. :-(
(There must be millions of such people, but seeing that my PR went in unchallenged probably none of them on this project) In the worst case, we may be stuck with a decision between fast or safe code.
At least it should be only one of the two files affected by that PR that plays a role here - in an OpenMP build, blas_server_omp.c replaces blas_server.c so we should have only my changes in memory.c to worry about. As they were driven by helgrind reports I still believe they were substantially correct, but perhaps they can simply be made conditional on ifndef OPENMP

Given your find, it is entirely possible that the later degradation is from #1299 where I touched the same files again, nine months older but probably no wiser. (A brute force test could be to drop the 0.2.19 memory.c into current develop and see if this restores the original performance - I think it would still compile despite some intervening changes)

@tkswe88
Copy link
Author

tkswe88 commented Feb 20, 2018

I have copied driver/others/memory.c from 0.2.19 to the development version and recompiled successfully. This has restored the good performance observed in 0.2.19.

@martin-frbg
Copy link
Collaborator

I have now merged a (supposed) fix that uses all the additional locks only in multithreaded builds that do not employ OpenMP. This should restore pre-0.2.20 OpenMP performance without completely reverting the thread safety fixes, hopefully something can be done about their (probable) performance impact on pure pthreads builds in the future. (Though possibly the impact was worst with OpenMP, if the change was adding another layer of locks on top of what OpenMP already imposed)

@tkswe88
Copy link
Author

tkswe88 commented Feb 20, 2018

The new version does not deliver correct results when compiled with

make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran
make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran PREFIX=/usr/local install

and running more than 1 thread.

@martin-frbg
Copy link
Collaborator

Sorry. Seems I added ifdefs around a few locks that were there unconditionally in 0.2.19. Somehow none of the standard tests was able to flag this on my hardware.

@tkswe88
Copy link
Author

tkswe88 commented Feb 20, 2018

Sorry, I still get wrong results.

@martin-frbg
Copy link
Collaborator

Hmm, thanks. I had missed one spot near line 1120 that had a blas_unlock from the earlier version still commented out, hopefully this was causing it. Apart from that, there are only cases where I had to move a lock outside an if() block that uses a thread variable in the conditional - I do not see a functional difference but can duplicate the offending code block if necessary.

@tkswe88
Copy link
Author

tkswe88 commented Feb 20, 2018

I have recompiled, but still get wrong results except for when only 1 thread is used.

  • Using 2 and 4 threads, I just get wrong results.
  • Using 8 and 12 threads, the computations get so far off that IEEE_DIVIDE_BY_ZERO floating-point exceptions are signalling from my own code.
  • Using 16 threads, I even got an output from somehwere in ZGBTRF or ZGBTRS:
    BLAS : Bad memory unallocation! : 64 0x7f775d046000

Hopefully, the latter message can help to locate the problem.

@martin-frbg
Copy link
Collaborator

I have reverted the previous commit for now. While that "bad unallocation" message does originate from memory.c, I do not understand how the revised version of my changes could be causing it.

@martin-frbg
Copy link
Collaborator

Unfortunately I cannot reproduce your problem with any of the tests, not with the software I normally use.
BLAS-Tester suggests there is currently a problem with TRMV, but this is unrelated to the version of memory.c in use (and may be fallout from attempts to fix #1332)

@tkswe88
Copy link
Author

tkswe88 commented Feb 21, 2018

Do you think there are any further test I could do or would you recommend to just copy driver/others/memory.c from 0.2.19 to the development version to get higher performance (despite the thread safety issues in this older version)?

@martin-frbg
Copy link
Collaborator

I am about to upload another PR where absolutely all locking-related changes will be encapsulated in
if(n)def USE_OPENMP . If that still does not work for you, some simplified testcase will be needed.

@tkswe88
Copy link
Author

tkswe88 commented Feb 21, 2018

With the updated memory.c, the results are fine, but the run-time performance is as degraded as in v. 0.2.20

@martin-frbg
Copy link
Collaborator

Weird. I believe a side-by-side comparison of the old and new memory.c will show that they are now functionally equivalent (with respect to locking) for USE_OPENMP=1. Did you do a full rebuild (starting from make clean) ?

@martin-frbg
Copy link
Collaborator

Running cpp -I../.. -DUSE_OPENMP -DSMP on both "old" and "new" memory.c definitely leads to functionally equivalent codes with just a few shuffled lines. The only other major difference between the codes is my addition of cgroup support in get_num_procs (for #1155, see PR #1239 for the actual code change in memory.c), perhaps you could try commenting that one out as well.

@brada4
Copy link
Contributor

brada4 commented Oct 27, 2018

And with NUMA on (aka channel) and 8 threads (OPENBLAS_NUM_THREADS=8 ./testsuite)?
Compiler optimisations are largely indifferent since biggest part of work is done by custom assembly code.

@martin-frbg
Copy link
Collaborator

Is your compiler getting -march=zen from its default configuration ? (Not that I expect much improvement from it, as the bottleneck is probably somewhere in the hand-coded assembly, but perhaps cpu-specific optimizer settings are more important than going from -O2 to -O3 ?)

@jcolafrancesco
Copy link

jcolafrancesco commented Oct 27, 2018

My compiler did not get -march automatically.
I've tried with -march=znver1 (zen was not allowed) : no significant changes.

Where is testsuite located ?

What about SMT (it was activated for now), is there any recommandation ? I will test ASAP

@martin-frbg
Copy link
Collaborator

I can only refer to Agner Fog's analysis at https://www.agner.org/optimize/microarchitecture.pdf where (as I understand it) he comes to the conclusion that multithreading is more efficient than it was on earlier AMD and Intel designs (with the caveat that inter-thread communication should be kept within the same 4-cpu die if possible). Unfortunately I cannot identify any obvious bottlenecks in the current (Haswell) kernels even after reading his description of the microarchitectures.

@martin-frbg
Copy link
Collaborator

And I suspect brada4 was only using "testsuite" as shorthand for your current numpy code. (There are some test codes in the benchmark directory of the distribution and there is xianyi's BLAS-Tester project that is derived from the ATLAS test codes, but I am not aware of anything actually named testsuite)

@tkswe88
Copy link
Author

tkswe88 commented Oct 28, 2018

@jcolafrancesco : Many thanks for the tests!

@martin-frbg and @brada4 : Sorry for not being in contact after promissing test. First, one of my RAM modules was broken. After that, I had too much teaching.

With a reservation, I can confirm @jcolafrancesco 's findings that changing the SWITH_RATE parameter has pretty marignal effect. I tested this 3 weeks ago or so with a freshly downloaded openblas version and one from spring. Here, the multi-threaded performance was worse in the recently downloaded version and basically degraded to the level, before @martin-frbg fixed issue #1468 on February 28.

Unless this description raises immediate suspicions about the degraded performance, I can download the last development version, recheck and bisect.

Also, AMD published an updated blis libflame version on their developer webpage a couple of months ago or so. I have not had the time yet to look at how much this has improved threadripper support.

@tkswe88
Copy link
Author

tkswe88 commented Oct 28, 2018

@jcolafrancesco : Regarding SMT, I have not seen any performance benefits from this on the TR1950X. Going beyond 16 threads, it was rather slight performance degradation by a few per cent. Generally, I recommend that you test setting thread affinity in both UMA and NUMA model to see what works best for your specific work tasks. From what I have seen, it is beneficial to set thread affinity and avoid having your data move through the InfiniFabric between the two main parts of the CPU to different DIMM modules. For my typical workloads, it is fastest to run 2or 4 simulations in parallel on 8 or 4 cores, respectively, and just do that in NUMA.
I have never done any tests with SMT switched off in memory, because I assumed that setting the thread affinity would be sufficient to avoid adverse effects.

@martin-frbg
Copy link
Collaborator

@tkswe88 memory.c should be basically the same as #1468 from end of february (where you confirmed that the performance was back to earlier levels) - since then, there has been a very promising rewrite of the code with the intent to use thread-local storage. Unfortunately this new code had unexpected side effects at first, so I decided to make the "old" version the default again after two consecutive releases that were "broken" for some high-profile codes that indirectly rely on OpenBLAS.
This makes the current memory.c twice as big as the original with an ugly #ifdef to switch between the two implementations at compile time.

@brada4
Copy link
Contributor

brada4 commented Oct 29, 2018

20k * 20k * 8b * 3 makes 9.6GB for double precision... Should get better with working RAM

@jcolafrancesco
Copy link

jcolafrancesco commented Oct 30, 2018

Yubeic also posted on AMD's website : https://community.amd.com/thread/228619

He is communicating the same results and someone is comparing them with the theoretical limit he could reach with its Threadripper (canyouseeme 23 mai 2018 04:35 ):

220k^3/22.4687=712.1GFLOPS, 3.85G8ops*16cores=492.8GFLOPS
...
if it's double precision, your result exceeds the upper limit? I'm wondering if some algorithm like strassen is used inside.

Same poster is obtaining 0.36 seconds on dgemm with 4096*4096 matrices and this is totally on par with tkswe88's results (around 0.3 seconds) and mine.

Do you think that canyouseeme's reasoning is correct ? In that case it would cast some doubts on yubeic's results. I'm starting so think that we should just stop to take those results into account.

@jcolafrancesco
Copy link

canyouseeme is also reporting (22 mai 2018 03:04) :

ps. dgemm() run faster when OPENBLAS_CORETYPE=EXCAVATOR than HASWELL on TR 1950x.

I've made some tests with a new C code directly using the cblas API :

Target=EXCAVATOR : 
./test_cblas 1000
took 0.026141 seconds 
./test_cblas 2000
took 0.095920 seconds 
./test_cblas 4000
took 0.625256 seconds 
./test_cblas 8000
took 4.839866 seconds 
./test_cblas 10000
took 9.572862 seconds 
./test_cblas 20000
took 74.236080 seconds 

Target=ZEN : 
./test_cblas 1000
took 0.021527 seconds 
./test_cblas 2000
took 0.090566 seconds 
./test_cblas 4000
took 0.613909 seconds 
./test_cblas 8000
took 5.138182 seconds 
./test_cblas 10000
took 9.232099 seconds 
./test_cblas 20000
took 73.998256 seconds 

I'm not observing any significant change.

@tkswe88
Copy link
Author

tkswe88 commented Oct 31, 2018

@martin-frbg What specific preprocessor flag is required during compilation of memory.c to restore the optimal performance?

@martin-frbg
Copy link
Collaborator

@tkswe88 there should be no specific flag necessary - the active code in memory.c should be in the same state as on february 28, except that the file is now approximately twice as big as it also contains a reimplementation that uses thread-local storage (which is available by setting USE_TLS=1)

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

USE_TLS=1 ... which was accidentally enabled in Makefile.rule shortly but long enought to land in a release, and should be commented out if found.
Also for sake of experimentation you may restore older memory.c, the outside interfaces have not changed since.

@inoryy
Copy link

inoryy commented Dec 29, 2018

what's the status of this ticket and is there a way to help it move forward?

@martin-frbg
Copy link
Collaborator

Needs testers especially for Threadripper, and generally someone who knows how to optimize code for Zen. I just bought a Ryzen 2700 and hope to be able to do at least some quick and simple tests in the next few days. (Top on my list is to see how the recent improvements for Haswell map to Ryzen, e.g. optimal settings of SWITCH_RATIO and GEMM_PREFERED_SIZE in param.h)

@inoryy
Copy link

inoryy commented Dec 29, 2018

Not much experience optimizing code, but I do have access to a TR1950x and can regularly run tests on it. Could you describe in detail what kind of tests would you need?

@martin-frbg
Copy link
Collaborator

I am not that sure myself, but for a start it would probably help to run the dgemm and sgemm benchmarks to see if increasing the SWITCH_RATIO for ZEN in param.h from the default of 4 has a similar impact on performance as it did for HASWELL where it is now 32 - my very first tests suggest that 16 may be optimal for the 8-core Ryzen, but 32 did not appear to be much worse. Second task would be to see if fenrus75's recent improvements in the Haswell xGEMM_BETA and xGEMM_ONCOPY routines (see entries referencing skylake files in kernel/x86_64/KERNEL.HASWELL) are applicable to the Zen architecture despite its allegedly inferior AVX2 hardware.
Going beyond that will probably require some in-depth experience with coding to processor specs, possibly trying things like rearranging or replacing avx2 instructions. Some pointers are in
https://www.agner.org/optimize/blog/read.php?i=838 but this is a bit outside my expertise.

@tkswe88
Copy link
Author

tkswe88 commented Jan 19, 2019

There is a new best practice guide from the PRACE project for AMD EPYC architectures on http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-AMD.pdf
This might be helpful in optimising code.

@brada4
Copy link
Contributor

brada4 commented Jan 20, 2019

Thats more about configuring node in a compute cluster. AMD does not make 3dnow processors for quite some decades....

@tkswe88
Copy link
Author

tkswe88 commented Mar 1, 2019

@martin-frbg and @brada4 I have adverse performance effects on the TR1950X when using the memory.c file as provided in the current current development version (0.3.6) and runnin gon more than 8 cores. It does not seem to matter whether I define USE_TLS or not. I have alos tried commenting out lines 1045-1047 in Makefile.system. When replacing memory.c by the one of #1468, I get good performance again.
Could you give a hint on what compile time settings are required to have the preprocessor construct a from the current memory.c a version that is identical to that of #1468?
I am at a loss with this.

@martin-frbg
Copy link
Collaborator

Can you tell if 0.3.5 worked for you ? (Some earlier changes that might have affected locking were inadvertently removed in 0.3.5 and put back in develop). From your reference to #1468 I assume you are compiling with OpenMP, so
gcc -DNO_WARMUP -DUSE_OPENMP -DOS_LINUX -DSMP -DSMP_SERVER -E -I../.. memory.c >memory_processed.c
should reflect what the compiler sees - this should be very similar to what was in #1468

@martin-frbg
Copy link
Collaborator

Actually #1468 was among the changes that were accidentally dropped in the creation of the "Frankenstein" version of memory.c that combined both TLS and non-TLS versions - this is supposed to have been rectified by #2004 three weeks ago but there may be an interaction with intervening changes from #1785 & #1814

@martin-frbg
Copy link
Collaborator

gcc -I../.. -DOS_LINUX -DUSE_THREAD=1 -DUSE_OPENMP=1 -E -P memory.c will probably give the most concise preprocessing output.
And from this the crucial difference appears to be in the locking around the loop over memory[position].used (around line 1070 of the old memory.c, or line 2700 of the current version) - #1468 had the LOCK_COMMAND(&alloc_lock) around individual calls inside the do loop and conditional on !defined(USE_OPENMP) while the change from PR #1785 - being based on a version that inadvertently missed the changes from #1468 - not only moved the lock outside the loop in response to the observation of too many lock/unlock system calls but already lacked the conditionals. I seem to have missed this in #2004.

@tkswe88
Copy link
Author

tkswe88 commented Mar 4, 2019

@martin-frbg I followed #2040 to https://raw.githubusercontent.com/xianyi/OpenBLAS/af480b02a4a45df377acf9be0d6078609bb345c2/driver/others/memory.c, copied this version of memory.c into a snapshot of the openblas development version from Friday and recompiled. This has restored the performance to where I would like it to be. Many thanks for your help and effort!

About a year ago or so, I presented a first comparison of openblas and blis/flame performance. Since then blis and libflame have caught up quite a bit and seem to have become threads-safe. In my standard code, I can use either openblas or blis/flame with OpenMP parallelisation. openblas tends to be around 1-4 % faster on small and moderately sized workloads. However, when pure DGEMM performance is required and the matrices become rather big, at the moment, openblas seems to be the clear winner. See below for a comparison with 1 and 16 threads. The results were obtained with openblas development and blis versions downloaded last Friday and after compilation using gcc9.0 on Ubuntu Linux 18.04 (4.15 kernel).
dgemm_amd_tr1950x

@martin-frbg
Copy link
Collaborator

Thank you very much for the confirmation, merging that PR now.

TiborGY added a commit to TiborGY/OpenBLAS that referenced this issue Jul 7, 2019
* With the Intel compiler on Linux, prefer ifort for the final link step 

icc has known problems with mixed-language builds that ifort can handle just fine. Fixes OpenMathLib#1956

* Rename operands to put lda on the input/output constraint list

* Fix wrong constraints in inline assembly

for OpenMathLib#2009

* Fix inline assembly constraints

rework indices to allow marking argument lda4 as input and output. For OpenMathLib#2009

* Fix inline assembly constraints

rework indices to allow marking argument lda as input and output.

* Fix inline assembly constraints

* Fix inline assembly constraints

* Fix inline assembly constraints in Bulldozer TRSM kernels

rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For OpenMathLib#2009

* Correct range_n limiting

same bug as seen in OpenMathLib#1388, somehow missed in corresponding PR OpenMathLib#1389

* Allow multithreading TRMV again

revert workaround introduced for issue OpenMathLib#1332 as the actual cause appears to be my incorrect fix from OpenMathLib#1262 (see OpenMathLib#1388)

* Fix error introduced during cleanup

* Reduce list of kernels in the dynamic arch build

to make compilation complete reliably within the 1h limit again

* init

* move fix to right place

* Fix missing -c option in AVX512 test

* Fix AVX512 test always returning false due to missing compiler option

* Make x86_32 imply NO_AVX2, NO_AVX512 in addition to NO_AVX

fixes OpenMathLib#2033

* Keep xcode8.3 for osx BINARY=32 build

as xcode10 deprecated i386

* Make sure that AVX512 is disabled in 32bit builds

for OpenMathLib#2033

* Improve handling of NO_STATIC and NO_SHARED

to avoid surprises from defining either as zero. Fixes OpenMathLib#2035 by addressing some concerns from OpenMathLib#1422

* init

* address warning introed with OpenMathLib#1814 et al

* Restore locking optimizations for OpenMP case

restore another accidentally dropped part of OpenMathLib#1468 that was missed in OpenMathLib#2004 to address performance regression reported in OpenMathLib#1461

* HiSilicon tsv110 CPUs optimization branch

add HiSilicon tsv110 CPUs  optimization branch

* add TARGET support for  HiSilicon tsv110 CPUs

* add TARGET support for HiSilicon tsv110 CPUs

* add TARGET support for HiSilicon tsv110 CPUs

* Fix module definition conflicts between LAPACK and ReLAPACK

for OpenMathLib#2043

* Do not compile in AVX512 check if AVX support is disabled

xgetbv is function depends on NO_AVX being undefined - we could change that too, but that combo is unlikely to work anyway

* ctest.c : add __POWERPC__ for PowerMac

* Fix crash in sgemm SSE/nano kernel on x86_64

Fix bug OpenMathLib#2047.

Signed-off-by: Celelibi <celelibi@gmail.com>

* param.h : enable defines for PPC970 on DarwinOS

fixes:
gemm.c: In function 'sgemm_':
../common_param.h:981:18: error: 'SGEMM_DEFAULT_P' undeclared (first use in this function)
 #define SGEMM_P  SGEMM_DEFAULT_P
                  ^

* common_power.h: force DCBT_ARG 0 on PPC970 Darwin

without this, we see
../kernel/power/gemv_n.S:427:Parameter syntax error
and many more similar entries

that relates to this assembly command
dcbt 8, r24, r18

this change makes the DCBT_ARG = 0
and openblas builds through to completion on PowerMac 970
Tests pass

* Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1

for issue OpenMathLib#2048

* make DYNAMIC_ARCH=1 package work on TSV110.

* make DYNAMIC_ARCH=1 package work on TSV110

* Add Intel Denverton

for OpenMathLib#2048

* Add Intel Denverton

* Change 64-bit detection as explained in OpenMathLib#2056

* Trivial typo fix

as suggested in OpenMathLib#2022

* Disable the AVX512 DGEMM kernel (again)

Due to as yet unresolved errors seen in OpenMathLib#1955 and OpenMathLib#2029

* Use POSIX getenv on Cygwin

The Windows-native GetEnvironmentVariable cannot be relied on, as
Cygwin does not always copy environment variables set through Cygwin
to the Windows environment block, particularly after fork().

* Fix for OpenMathLib#2063: The DllMain used in Cygwin did not run the thread memory
pool cleanup upon THREAD_DETACH which is needed when compiled with
USE_TLS=1.

* Also call CloseHandle on each thread, as well as on the event so as to not leak thread handles.

* AIX asm syntax changes needed for shared object creation

* power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself

* Expose CBLAS interfaces for I?MIN and I?MAX

* Build CBLAS interfaces for I?MIN and I?MAX

* Add declarations for ?sum and cblas_?sum

* Add interface for ?sum (derived from ?asum)

* Add ?sum

* Add implementations of ssum/dsum and csum/zsum

as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure

* Add ARM implementations of ?sum

(trivial copies of the respective ?asum with the fabs calls removed)

* Add ARM64 implementations of ?sum

as trivial copies of the respective ?asum kernels with the fabs calls removed

* Add ia64 implementation of ?sum

as trivial copy of asum with the fabs calls removed

* Add MIPS implementation of ?sum

as trivial copy of ?asum with the fabs calls removed

* Add MIPS64 implementation of ?sum

as trivial copy of ?asum with the fabs replaced by mov to preserve code structure

* Add POWER implementation of ?sum

as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure

* Add SPARC implementation of ?sum

as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure

* Add x86 implementation of ?sum

as trivial copy of ?asum with the fabs calls removed

* Add x86_64 implementation of ?sum

as trivial copy of ?asum with the fabs calls removed

* Add ZARCH implementation of ?sum

as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed

* Detect 32bit environment on 64bit ARM hardware

for OpenMathLib#2056, using same approach as OpenMathLib#2058

* Add cmake defaults for ?sum kernels

* Add ?sum

* Add ?sum definitions for generic kernel

* Add declarations for ?sum

* Add -lm and disable EXPRECISION support on *BSD

fixes OpenMathLib#2075

* Add in runtime CPU detection for POWER.

* snprintf define consolidated to common.h

* Support INTERFACE64=1

* Add support for INTERFACE64 and fix XERBLA calls

1. Replaced all instances of "int" with "blasint"
2. Added string length as "hidden" third parameter in calls to fortran XERBLA

* Correct length of name string in xerbla call

* Avoid out-of-bounds accesses in LAPACK EIG tests

see Reference-LAPACK/lapack#333

* Correct INFO=4 condition

* Disable reallocation of work array in xSYTRF

as it appears to cause memory management problems (seen in the LAPACK tests)

* Disable repeated recursion on Ab_BR in ReLAPACK xGBTRF

due to crashes in LAPACK tests

* sgemm/strmm

* Update Changelog with changes from 0.3.6

* Increment version to 0.3.7.dev

* Increment version to 0.3.7.dev

* Misc. typo fixes

Found via `codespell -q 3 -w -L ith,als,dum,nd,amin,nto,wis,ba -S ./relapack,./kernel,./lapack-netlib`

* Correct argument of CPU_ISSET for glibc <2.5

fixes OpenMathLib#2104

* conflict resolve

* Revert reference/ fixes

* Revert Changelog.txt typos

* Disable the SkyLakeX DGEMMITCOPY kernel as well

as a stopgap measure for numpy/numpy#13401 as mentioned in OpenMathLib#1955

* Disable DGEMMINCOPY as well for now

OpenMathLib#1955

* init

* Fix errors in cpu enumeration with glibc 2.6

for OpenMathLib#2114

* Change two http links to https

Closes OpenMathLib#2109

* remove redundant code OpenMathLib#2113

* Set up CI with Azure Pipelines

[skip ci]

* TST: add native POWER8 to CI

* add native POWER8 testing to
Travis CI matrix with ppc64le
os entry

* Update link to IBM MASS library, update cpu support status

* first try migrating one of the arm builds from travis

* fix tabbing in azure commands

* Update azure-pipelines.yml

take out offending lines (although stolen from https://github.com/conda-forge/opencv-feedstock azure-pipelines fiie)

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* DOC: Add Azure CI status badge

* Add ARMV6 build to azure CI setup (OpenMathLib#2122)

using aytekinar's Alpine image and docker script from the Travis setup

[skip ci]

* TST: Azure manylinux1 & clean-up

* remove some of the steps & comments
from the original Azure yml template

* modify the trigger section to use
develop since OpenBLAS primarily uses
this branch; use the same batching
behavior as downstream projects NumPy/
SciPy

* remove Travis emulated ARMv6 gcc build
because this now happens in Azure

* use documented Ubuntu vmImage name for Azure
and add in a manylinux1 test run to the matrix

[skip appveyor]

* Add NO_AFFINITY to available options on Linux, and set it to ON

to match the gmake default. Fixes second part of OpenMathLib#2114

* Replace ISMIN and ISAMIN kernels on all x86_64 platforms (OpenMathLib#2125)

* Mark iamax_sse.S as unsuitable for MIN due to issue OpenMathLib#2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for OpenMathLib#2116

* Move ARMv8 gcc build from Travis to Azure

* Move ARMv8 gcc build from Travis to Azure

* Update .travis.yml

* Test drone CI

* install make

* remove sudo

* Install gcc

* Install perl

* Install gfortran and add a clang job

* gfortran->gcc-gfortran

* Switch to ubuntu and parallel jobs

* apt update

* Fix typo

* update yes

* no need of gcc in clang build

* Add a cmake build as well

* Add cmake builds and print options

* build without lapack on cmake

* parallel build

* See if ubuntu 19.04 fixes the ICE

* Remove qemu armv8 builds

* arm32 build

* Fix typo

* TST: add SkylakeX AVX512 CI test

* adapt the C-level reproducer code for some
recent SkylakeX AVX512 kernel issues, provided
by Isuru Fernando and modified by Martin Kroeker,
for usage in the utest suite

* add an Intel SDE SkylakeX emulation utest run to
the Azure CI matrix; a custom Docker build was required
because Ubuntu image provided by Azure does not support
AVX512VL instructions

* Add option USE_LOCKING for single-threaded build with locking support

for calling from concurrent threads

* Add option USE_LOCKING for single-threaded build with locking support

* Add option USE_LOCKING for SMP-like locking in USE_THREAD=0 builds

* Add option USE_LOCKING but keep default settings intact

* Remove unrelated change

* Do not try ancient PGI hacks with recent versions of that compiler

should fix OpenMathLib#2139

* Build and run utests in any case, they do their own checks for fortran availability

* Add softfp support in min/max kernels

fix for OpenMathLib#1912

* Revert "Add softfp support in min/max kernels"

* Separate implementations of AMAX and IAMAX on arm

As noted in OpenMathLib#1912 and comment on OpenMathLib#1942, the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register

* Ensure correct output for DAMAX with softfp

* Use generic kernels for complex (I)AMAX to support softfp

* improved zgemm power9 based on power8

* upload thread safety test folder

* hook up c++ thread safety test (main Makefile)

*  add c++ thread test option to Makefile.rule

* Document NO_AVX512 

for OpenMathLib#2151

* sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52

* Fix detection of AVX512 capable compilers in getarch

21eda8b introduced a check in getarch.c to test if the compiler is capable of
AVX512. This check currently fails, since the used __AVX2__ macro is only
defined if getarch itself was compiled with AVX2/AVX512 support. Make sure this
is the case by building getarch with -march=native on x86_64. It is only
supposed to run on the build host anyway.

* c_check: Unlink correct file

* power9 zgemm ztrmm optimized

* conflict resolve

* Add gfortran workaround for ABI violations in LAPACKE

for OpenMathLib#2154 (see gcc bug 90329)

* Add gfortran workaround for ABI violations

for OpenMathLib#2154 (see gcc bug 90329)

* Add gfortran workaround for potential ABI violation 

for OpenMathLib#2154

* Update fc.cmake

* Remove any inadvertent use of -march=native from DYNAMIC_ARCH builds

from OpenMathLib#2143, -march=native precludes use of more specific options like -march=skylake-avx512 in individual kernels, and defeats the purpose of dynamic arch anyway.

* Avoid unintentional activation of TLS code via USE_TLS=0

fixes OpenMathLib#2149

* Do not force gcc options on non-gcc compilers

fixes compile failure with pgi 18.10 as reported on OpenBLAS-users

* Update Makefile.x86_64

* Zero ecx with a mov instruction

PGI assembler does not like the initialization in the constraints.

* Fix mov syntax

* new sgemm 8x16

* Update dtrmm_kernel_16x4_power8.S

* PGI compiler does not like -march=native

* Fix build on FreeBSD/powerpc64.

Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl>

* Fix build for PPC970 on FreeBSD pt. 1

FreeBSD needs DCBT_ARG=0 as well.

* Fix build for PPC970 on FreeBSD pt.2

FreeBSD needs those macros too.

* cgemm/ctrmm power9

* Utest needs CBLAS but not necessarily FORTRAN

* Add mingw builds to Appveyor config

* Add getarch flags to disable AVX on x86

(and other small fixes to match Makefile behaviour)

* Make disabling DYNAMIC_ARCH on unsupported systems work

needs to be unset in the cache for the change to have any effect

* Mingw32 needs leading underscore on object names

(also copy BUNDERSCORE settings for FORTRAN from the corresponding Makefile)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants