Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

threaded blas performance on windows is terrible #284

Closed
vtjnash opened this Issue · 8 comments

3 participants

@vtjnash

using threaded blas on windows (32 or 64) results in a blas that performs very, very poorly. for example, the lapack tests take several hours each to run.

compiled with

make CC="i686-w64-mingw32-gcc" FC="i686-w64-mingw32-gfortran" RANLIB="i686-w64-mingw32-ranlib" FFLAGS=" -O2 " TARGET= USE_THREAD=1 NUM_THREADS=8 NO_AFFINITY=1 DYNAMIC_ARCH=1 OSNAME=WINNT CROSS=1 HOSTCC=gcc BINARY=32 CFLAGS=" -mincoming-stack-boundary=2" FFLAGS=" -mincoming-stack-boundary=2"

for example, the julia blas performance test has the following timing results:
(with threading)

min, max, mean, std. dev.
julia,dot_tiny,58.058200,73.324103,67.750435,5.823626
julia,dot_small,71.965551,75.205627,72.786046,1.372864
julia,dot_medium,102.544289,105.964556,103.505417,1.406535
julia,dot_large,21.153521,21.542396,21.368576,0.151678
julia,dot_huge,70.526542,70.837475,70.638176,0.118599
julia,axpy_tiny,43.208831,43.528984,43.384161,0.133417
julia,axpy_small,47.944069,48.138507,48.022850,0.071862
julia,axpy_medium,79.940864,80.391760,80.204529,0.211452
julia,axpy_large,20.125178,23.187577,20.815936,1.328680
julia,axpy_huge,69.271075,95.309852,84.800140,14.169267
julia,gemv_tiny,56.256296,69.034732,61.171892,5.993658
julia,gemv_small,40284.989214,82035.794205,59414.517316,18766.900620
julia,gemv_medium,8413.679556,10810.368410,9420.735144,909.435725
julia,gemv_large,780.309291,1388.054798,1047.426965,244.746852
julia,gemv_huge,184.613914,214.284996,196.979338,13.757023
julia,matmul_tiny,152.353958,183.421306,165.649163,14.362720
julia,matmul_small,292.335956,321.847799,300.112640,12.260501
julia,matmul_medium,8990.807773,19287.564680,13197.082591,4304.717517
julia,matmul_large,199.750751,239.018021,220.484218,15.000744

(without threading)

julia,dot_tiny,60.835648,73.707950,68.594230,4.825368
julia,dot_small,72.158313,75.374084,73.595981,1.420370
julia,dot_medium,102.848518,105.566460,104.165333,1.111246
julia,dot_large,21.519768,21.684873,21.584302,0.066756
julia,dot_huge,70.475418,70.773780,70.603479,0.107726
julia,axpy_tiny,71.688980,73.323265,72.164347,0.656889
julia,axpy_small,76.456065,76.585131,76.516743,0.054549
julia,axpy_medium,108.013697,108.359831,108.133043,0.136883
julia,axpy_large,23.016606,23.304910,23.119859,0.108602
julia,axpy_huge,72.085398,98.024442,87.581442,14.143001
julia,gemv_tiny,60.326925,73.127989,67.270542,6.082114
julia,gemv_small,98.632899,108.664897,101.595733,4.027393
julia,gemv_medium,39.360298,51.872221,43.615475,5.693509
julia,gemv_large,48.801440,70.924637,56.904144,8.895629
julia,gemv_huge,80.386730,96.202423,86.608078,7.468927
julia,matmul_tiny,99.737509,108.835869,105.027565,4.034369
julia,matmul_small,262.743655,278.695120,268.484607,6.346791
julia,matmul_medium,753.504495,758.774437,755.278407,2.105928
julia,matmul_large,413.714756,417.176089,414.464516,1.516495
@vtjnash vtjnash referenced this issue in JuliaLang/julia
Closed

blas performance on windows #4139

@xianyi
Owner

Thank you for the test. I will investigate this issue next week.

@xianyi
Owner

Hi @vtjnash ,

I just compiled OpenBLAS on Intel Core-i7 (4 cores) Windows 32-bit machine.

Then, I tested m=n=k=4096 DGEMM with 1, 2, 4 threads.
The performance are 13.9 GFLOPS, 27 GFLOPS, 51.8 GFLOPS, respectively.

Now, I try to run lapack_testing.

Xianyi

@xianyi
Owner

Hi @vtjnash ,

Could you use this test (https://gist.github.com/xianyi/5780018 ) to time dgemm on your machine?

B.T.W, what's CPU type? Do you disable hyper-threading?

Xianyi

@vtjnash
1 thread  -> 4096x4096x4096  17.779832 s   7.730047926 GFLOPS
2 threads -> 4096x4096x4096   9.629764 s  14.272307553 GFLOPS
4 threads -> 4096x4096x4096   4.827839 s  28.468006798 GFLOPS
8 threads -> 4096x4096x4096   2.410538 s  57.015883372 GFLOPS

This is on the julia.mit.edu machine (Intel(R) Xeon(R) CPU E7- 8850 @ 2.00GHz, 80 cores). Hyper-threading is disabled. Note that I have only done the tests in wine and VirtualBox/Win7, but I got the same performance on both.

My suspicion is that BLAS is running just fine, but is getting stuck at some checkpoints for an exceptionally long time, resulting in certain tests performing very poorly, but other tests running just fine. If the lapack test works fine for you, then perhaps it is a generic problem with emulators (I didn't have ready access to a windows box before for doing broader testing).

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... WINNT             
  Architecture     ... x86               
  BINARY           ... 32bit                 
  C compiler       ... GCC  (command line : i686-w64-mingw32-gcc)
  Fortran compiler ... GFORTRAN  (command line : i686-w64-mingw32-gfortran)
  Library Name     ... libopenblas_nehalemp-r0.2.8.a (Multi threaded; Max num-threads is 8)
@cbergstrom

try 64bit linux to rule out some problem with the OS

@vtjnash

rerunning the test in the primary os on the computer (ubuntu 64) shows roughly the same results as above:

4096x4096x4096  17.191794 s   7.994450927 GFLOPS
4096x4096x4096   9.333568 s  14.725231923 GFLOPS
4096x4096x4096   4.686359 s  29.327448766 GFLOPS
4096x4096x4096   2.316957 s  59.318732921 GFLOPS
@xianyi
Owner

Then lapack_testing works fine on my 32-bit windows. It only costs about 10 minutes

@vtjnash

I was able to instead use my mobile dual core i7 (plus hyper-threading) running Windows 7, in VMware and natively, and the vast performance differential disappeared.

time OPENBLAS_NUM_THREADS=1 make lapack-test

real    2m1.210s
user    0m18.233s
sys     0m6.959s

vs. time OPENBLAS_NUM_THREADS=2 make lapack-test

real    5m30.863s
user    0m19.779s
sys     0m6.741s

While the single-threaded version was still faster (by about a factor of 2.5), repeating this test on linux gave identical performance. My suspicion is that memory allocation growth (for spawning threads) is implemented poorly in both virtualbox and wine, and thus this "bug" had nothing to do with openblas.

thanks for looking into this. sorry for wasting your time.

@vtjnash vtjnash closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.