New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZEN support #1133

Merged
merged 1 commit into from Mar 24, 2017

Conversation

Projects
None yet
3 participants
@steckdenis
Contributor

steckdenis commented Mar 19, 2017

This patch adds the following features

  • ZEN target for static builds
  • Zen/Ryzen auto-detection (CPU family 15h, extended family 8) so that "make" compiles for Zen
  • Dynamic architecture support for Zen

Zen is currently heavily based on Haswell (Excavator param.h tuning, Haswell kernels). I've tried to tune OpenBLAS for Zen but started to get incorrect result notifications. This patch does not do any Zen-specific tuning.

If you are interested, here is what I have observed by trying to optimize several parameters from param.h using blind brute-force:

  • SNUMOPT=8 and DNUMOPT=4 seem to work best (I tested 8/8, 16/8 and 16/16)
  • GEMM_DEFAULT_ALIGN=0x1fffUL is a couple percent faster than any other alignment value.
  • SYMV_P=8 is faster than 4 or 16
  • SWITCH_RATIO=4 works well (any other value decreases performance)
  • The sgemm and zgemm kernels like having their N and M parameters set to 4 (for both kernels). zlinkpack goes from ~15GFLOP to ~22GFLOP while doing so, but I start to get incorrect results. I have seen that KERNEL.ZEN should be changed when N and M are changed, but I don't know what are the valid combinations.
  • GEMM_DEFAULT_OFFSET_A=256 and GEMM_DEFAULT_OFFSET_B=1024 was the fastest combination for the zlinpack.goto benchmark
  • Regarding benchmarks, zlinpack is quite sensitive to the parameters and works best with 8 threads. slinpack really wants only one thread to be used and seems memory-bound: I can do whatever I want with the parameters, performance doesn't change in any meaningful way.

Please note that I have tested all the above values without really knowing what they mean. Some of them may not make any sense.

As a remaining problem, OpenBLAS detects 16 cores while my Ryzen CPU has 8 cores and 16 threads. Manually forcing OMP_NUM_THREADS to 8 leads to quite a nice performance boost as the threads stop competing for cache and memory accesses.

If you want SSH access to a Ryzen 1700 machine (that has a public IP address), we can arrange that.

@brada4

This comment has been minimized.

Show comment
Hide comment
@brada4

brada4 Mar 20, 2017

Contributor

In the second to last paragraph - is it slower with both threads active or no gain over single thread?
In linux it would be something like

OPENBLAS_NUM_THREADS=1 /usr/bin/time taskset 0x1 (random benchmark)
OPENBLAS_NUM_THREADS=2 /usr/bin/time taskset 0x3 (same benchmark)
Contributor

brada4 commented Mar 20, 2017

In the second to last paragraph - is it slower with both threads active or no gain over single thread?
In linux it would be something like

OPENBLAS_NUM_THREADS=1 /usr/bin/time taskset 0x1 (random benchmark)
OPENBLAS_NUM_THREADS=2 /usr/bin/time taskset 0x3 (same benchmark)
@steckdenis

This comment has been minimized.

Show comment
Hide comment
@steckdenis

steckdenis Mar 20, 2017

Contributor

Performance depends on the number of threads in quite a complex way. Here are detailed timing results for zlinpack (last row, "200"):

OPENBLAS_NUM_THREADS=1 taskset 0x1 ./zlinpack.goto => 10.4 GFLOPS
OPENBLAS_NUM_THREADS=2 taskset 0x5 ./zlinpack.goto => 11.7 GFLOPS
OPENBLAS_NUM_THREADS=3 taskset 0x15 ./zlinpack.goto => 15.2 GFLOPS
OPENBLAS_NUM_THREADS=4 taskset 0x55 ./zlinpack.goto => 18.8 GFLOPS
OPENBLAS_NUM_THREADS=5 taskset 0x155 ./zlinpack.goto => 19.1 GFLOPS*
OPENBLAS_NUM_THREADS=6 taskset 0x555 ./zlinpack.goto => 19.4 GFLOPS
OPENBLAS_NUM_THREADS=7 taskset 0x1555 ./zlinpack.goto => 18.8 GFLOPS
OPENBLAS_NUM_THREADS=8 taskset 0x5555 ./zlinpack.goto => 18.0 GFLOPS (with high variance)
OPENBLAS_NUM_THREADS=9 taskset 0x5557 ./zlinpack.goto => 12.6 GFLOPS
OPENBLAS_NUM_THREADS=9 ./zlinpack.goto => 12.3 GFLOPS
OPENBLAS_NUM_THREADS=10 taskset 0x555F ./zlinpack.goto => 12.9 GFLOPS*
OPENBLAS_NUM_THREADS=11 taskset 0x557F ./zlinpack.goto => 12.7 GFLOPS
OPENBLAS_NUM_THREADS=16 taskset 0xFFFF ./zlinpack.goto => 10.6 GFLOPS (with high variance)

  • 0.004 GFLOPS for sizes 20 to 150 (exact sizes affected vary from run to run)!! 50% of the CPU time is in inner_advanced_thread when this performance hit occurs. Bad performance happens in the "Decompose" phase of the benchmark.

We see that going above 8 cores starts to use SMT threads and sort of kills performance. Odd number of threads also seems to exhibit strange behavior.

Here are the results for slinpack:

OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto => 13.5 GFLOPS
OPENBLAS_NUM_THREADS=2 taskset 0x5 ./slinpack.goto => 13.7 GFLOPS
OPENBLAS_NUM_THREADS=3 taskset 0x15 ./slinpack.goto => 15.1 GFLOPS
OPENBLAS_NUM_THREADS=4 taskset 0x55 ./slinpack.goto => 16.3 GFLOPS
OPENBLAS_NUM_THREADS=5 taskset 0x155 ./slinpack.goto => 13.2 GFLOPS*
OPENBLAS_NUM_THREADS=6 taskset 0x555 ./slinpack.goto => 12.0 GFLOPS
OPENBLAS_NUM_THREADS=7 taskset 0x1555 ./slinpack.goto => 11.4 GFLOPS
OPENBLAS_NUM_THREADS=8 taskset 0x5555 ./slinpack.goto => 10.5 GFLOPS
OPENBLAS_NUM_THREADS=9 taskset 0x5557 ./slinpack.goto => 7.7 GFLOPS
OPENBLAS_NUM_THREADS=10 ./slinpack.goto => 6.7 GFLOPS
OPENBLAS_NUM_THREADS=16 ./slinpack.goto => 5.6 GFLOPS

Tests run on an AMD Ryzen 7 1700 at stock clock speeds (3.0 Ghz base, 3.2 Ghz all-core boost, I cannot see on Linux whether boost was enabled), on an MSI B350 Tomahawk board with 2x 4GB DDR4 2600 Mhz RAM.

Contributor

steckdenis commented Mar 20, 2017

Performance depends on the number of threads in quite a complex way. Here are detailed timing results for zlinpack (last row, "200"):

OPENBLAS_NUM_THREADS=1 taskset 0x1 ./zlinpack.goto => 10.4 GFLOPS
OPENBLAS_NUM_THREADS=2 taskset 0x5 ./zlinpack.goto => 11.7 GFLOPS
OPENBLAS_NUM_THREADS=3 taskset 0x15 ./zlinpack.goto => 15.2 GFLOPS
OPENBLAS_NUM_THREADS=4 taskset 0x55 ./zlinpack.goto => 18.8 GFLOPS
OPENBLAS_NUM_THREADS=5 taskset 0x155 ./zlinpack.goto => 19.1 GFLOPS*
OPENBLAS_NUM_THREADS=6 taskset 0x555 ./zlinpack.goto => 19.4 GFLOPS
OPENBLAS_NUM_THREADS=7 taskset 0x1555 ./zlinpack.goto => 18.8 GFLOPS
OPENBLAS_NUM_THREADS=8 taskset 0x5555 ./zlinpack.goto => 18.0 GFLOPS (with high variance)
OPENBLAS_NUM_THREADS=9 taskset 0x5557 ./zlinpack.goto => 12.6 GFLOPS
OPENBLAS_NUM_THREADS=9 ./zlinpack.goto => 12.3 GFLOPS
OPENBLAS_NUM_THREADS=10 taskset 0x555F ./zlinpack.goto => 12.9 GFLOPS*
OPENBLAS_NUM_THREADS=11 taskset 0x557F ./zlinpack.goto => 12.7 GFLOPS
OPENBLAS_NUM_THREADS=16 taskset 0xFFFF ./zlinpack.goto => 10.6 GFLOPS (with high variance)

  • 0.004 GFLOPS for sizes 20 to 150 (exact sizes affected vary from run to run)!! 50% of the CPU time is in inner_advanced_thread when this performance hit occurs. Bad performance happens in the "Decompose" phase of the benchmark.

We see that going above 8 cores starts to use SMT threads and sort of kills performance. Odd number of threads also seems to exhibit strange behavior.

Here are the results for slinpack:

OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto => 13.5 GFLOPS
OPENBLAS_NUM_THREADS=2 taskset 0x5 ./slinpack.goto => 13.7 GFLOPS
OPENBLAS_NUM_THREADS=3 taskset 0x15 ./slinpack.goto => 15.1 GFLOPS
OPENBLAS_NUM_THREADS=4 taskset 0x55 ./slinpack.goto => 16.3 GFLOPS
OPENBLAS_NUM_THREADS=5 taskset 0x155 ./slinpack.goto => 13.2 GFLOPS*
OPENBLAS_NUM_THREADS=6 taskset 0x555 ./slinpack.goto => 12.0 GFLOPS
OPENBLAS_NUM_THREADS=7 taskset 0x1555 ./slinpack.goto => 11.4 GFLOPS
OPENBLAS_NUM_THREADS=8 taskset 0x5555 ./slinpack.goto => 10.5 GFLOPS
OPENBLAS_NUM_THREADS=9 taskset 0x5557 ./slinpack.goto => 7.7 GFLOPS
OPENBLAS_NUM_THREADS=10 ./slinpack.goto => 6.7 GFLOPS
OPENBLAS_NUM_THREADS=16 ./slinpack.goto => 5.6 GFLOPS

Tests run on an AMD Ryzen 7 1700 at stock clock speeds (3.0 Ghz base, 3.2 Ghz all-core boost, I cannot see on Linux whether boost was enabled), on an MSI B350 Tomahawk board with 2x 4GB DDR4 2600 Mhz RAM.

@brada4

This comment has been minimized.

Show comment
Hide comment
@brada4

brada4 Mar 20, 2017

Contributor

Something like this - 1st hyperthread vs both running for a second or 10, i.e if there is gain or loss in concurrent use of same core (as you see with ivy laptop i3 result is not regression):

>OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   6.198883e-03    35147.91 MFlops    4225.47 MFlops   34994.34 MFlops
>OPENBLAS_NUM_THREADS=1 taskset 0x2 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.708052e-03    35225.97 MFlops    4372.92 MFlops   35077.56 MFlops
>OPENBLAS_NUM_THREADS=2 taskset 0x3 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   2.069950e-03    40787.35 MFlops    4014.45 MFlops   40564.54 MFlops
Contributor

brada4 commented Mar 20, 2017

Something like this - 1st hyperthread vs both running for a second or 10, i.e if there is gain or loss in concurrent use of same core (as you see with ivy laptop i3 result is not regression):

>OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   6.198883e-03    35147.91 MFlops    4225.47 MFlops   34994.34 MFlops
>OPENBLAS_NUM_THREADS=1 taskset 0x2 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.708052e-03    35225.97 MFlops    4372.92 MFlops   35077.56 MFlops
>OPENBLAS_NUM_THREADS=2 taskset 0x3 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   2.069950e-03    40787.35 MFlops    4014.45 MFlops   40564.54 MFlops
@steckdenis

This comment has been minimized.

Show comment
Hide comment
@steckdenis

steckdenis Mar 20, 2017

Contributor

Ok, I understand now (I purposefully avoided putting threads on the same SMT cores):

> OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.219890e-03    33108.05 MFlops    6485.93 MFlops   33026.76 MFlops
> OPENBLAS_NUM_THREADS=1 taskset 0x2 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.902124e-03    32844.99 MFlops    6450.78 MFlops   32764.61 MFlops
> OPENBLAS_NUM_THREADS=2 taskset 0x3 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   1.134348e-02    25569.05 MFlops    3279.55 MFlops   25465.27 MFlops

By the way, I'm very impressed by your laptop CPU. Fortunately, I've tested with 8 threads on my Ryzen and I get 195343.00 MFlops, so scaling seems to work well. With all 16 threads used, I get 157717.47 MFlops.

Contributor

steckdenis commented Mar 20, 2017

Ok, I understand now (I purposefully avoided putting threads on the same SMT cores):

> OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.219890e-03    33108.05 MFlops    6485.93 MFlops   33026.76 MFlops
> OPENBLAS_NUM_THREADS=1 taskset 0x2 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.902124e-03    32844.99 MFlops    6450.78 MFlops   32764.61 MFlops
> OPENBLAS_NUM_THREADS=2 taskset 0x3 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   1.134348e-02    25569.05 MFlops    3279.55 MFlops   25465.27 MFlops

By the way, I'm very impressed by your laptop CPU. Fortunately, I've tested with 8 threads on my Ryzen and I get 195343.00 MFlops, so scaling seems to work well. With all 16 threads used, I get 157717.47 MFlops.

@brada4

This comment has been minimized.

Show comment
Hide comment
@brada4

brada4 Mar 20, 2017

Contributor

Indeed yours back your point that 2 threads per core is a loss....
Can you run something like lstopo, i.e if kernel correctly recognizes topology:
lstopo --of console

Contributor

brada4 commented Mar 20, 2017

Indeed yours back your point that 2 threads per core is a loss....
Can you run something like lstopo, i.e if kernel correctly recognizes topology:
lstopo --of console

@steckdenis

This comment has been minimized.

Show comment
Hide comment
@steckdenis

steckdenis Mar 20, 2017

Contributor

Here it is

Machine (7997MB)
  Socket L#0
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
    L3 L#1 (8192KB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#13)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#15)
  HostBridge L#0
    PCIBridge
      PCI 1022:43b7
        Block L#0 "sda"
      PCIBridge
        PCIBridge
          PCI 10ec:8168
            Net L#1 "eth0"
    PCIBridge
      PCI 1002:6779
        GPU L#2 "renderD128"
        GPU L#3 "card0"
        GPU L#4 "controlD64"
    PCIBridge
      PCI 1022:7901
Contributor

steckdenis commented Mar 20, 2017

Here it is

Machine (7997MB)
  Socket L#0
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
    L3 L#1 (8192KB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#13)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#15)
  HostBridge L#0
    PCIBridge
      PCI 1022:43b7
        Block L#0 "sda"
      PCIBridge
        PCIBridge
          PCI 10ec:8168
            Net L#1 "eth0"
    PCIBridge
      PCI 1002:6779
        GPU L#2 "renderD128"
        GPU L#3 "card0"
        GPU L#4 "controlD64"
    PCIBridge
      PCI 1022:7901
@brada4

This comment has been minimized.

Show comment
Hide comment
@brada4

brada4 Mar 21, 2017

Contributor

Looks reasonable,most likely absolutely correct.

Contributor

brada4 commented Mar 21, 2017

Looks reasonable,most likely absolutely correct.

@martin-frbg martin-frbg referenced this pull request Mar 22, 2017

Closed

New release #1118

@martin-frbg

This comment has been minimized.

Show comment
Hide comment
@martin-frbg

martin-frbg Mar 24, 2017

Collaborator

Thanks for the patch - looks good to me, I had only held back on committing to give more senior team members a chance to comment.
OpenBLAS detecting 16 cores is normal (though not always desirable) I think for a system capable of "hyperthreading" (MAX_CPU_NUMBER gets set from NUM_THREADS in the build system). I have not been able to find any whitepaper on optimizing for Ryzen yet - only some rather dubious claims of
"avoid avx" or "avoid software prefetch". Perhaps it would make sense to copy your implementation notes to the wiki so that they do not get buried here.

Collaborator

martin-frbg commented Mar 24, 2017

Thanks for the patch - looks good to me, I had only held back on committing to give more senior team members a chance to comment.
OpenBLAS detecting 16 cores is normal (though not always desirable) I think for a system capable of "hyperthreading" (MAX_CPU_NUMBER gets set from NUM_THREADS in the build system). I have not been able to find any whitepaper on optimizing for Ryzen yet - only some rather dubious claims of
"avoid avx" or "avoid software prefetch". Perhaps it would make sense to copy your implementation notes to the wiki so that they do not get buried here.

@martin-frbg martin-frbg merged commit 66dc10b into xianyi:develop Mar 24, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment