You can clone with
HTTPS or Subversion.
using threaded blas on windows (32 or 64) results in a blas that performs very, very poorly. for example, the lapack tests take several hours each to run.
make CC="i686-w64-mingw32-gcc" FC="i686-w64-mingw32-gfortran" RANLIB="i686-w64-mingw32-ranlib" FFLAGS=" -O2 " TARGET= USE_THREAD=1 NUM_THREADS=8 NO_AFFINITY=1 DYNAMIC_ARCH=1 OSNAME=WINNT CROSS=1 HOSTCC=gcc BINARY=32 CFLAGS=" -mincoming-stack-boundary=2" FFLAGS=" -mincoming-stack-boundary=2"
for example, the julia blas performance test has the following timing results:
min, max, mean, std. dev.
Thank you for the test. I will investigate this issue next week.
Hi @vtjnash ,
I just compiled OpenBLAS on Intel Core-i7 (4 cores) Windows 32-bit machine.
Then, I tested m=n=k=4096 DGEMM with 1, 2, 4 threads.
The performance are 13.9 GFLOPS, 27 GFLOPS, 51.8 GFLOPS, respectively.
Now, I try to run lapack_testing.
Could you use this test (https://gist.github.com/xianyi/5780018 ) to time dgemm on your machine?
B.T.W, what's CPU type? Do you disable hyper-threading?
1 thread -> 4096x4096x4096 17.779832 s 7.730047926 GFLOPS
2 threads -> 4096x4096x4096 9.629764 s 14.272307553 GFLOPS
4 threads -> 4096x4096x4096 4.827839 s 28.468006798 GFLOPS
8 threads -> 4096x4096x4096 2.410538 s 57.015883372 GFLOPS
This is on the julia.mit.edu machine (Intel(R) Xeon(R) CPU E7- 8850 @ 2.00GHz, 80 cores). Hyper-threading is disabled. Note that I have only done the tests in wine and VirtualBox/Win7, but I got the same performance on both.
My suspicion is that BLAS is running just fine, but is getting stuck at some checkpoints for an exceptionally long time, resulting in certain tests performing very poorly, but other tests running just fine. If the lapack test works fine for you, then perhaps it is a generic problem with emulators (I didn't have ready access to a windows box before for doing broader testing).
OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)
OS ... WINNT
Architecture ... x86
BINARY ... 32bit
C compiler ... GCC (command line : i686-w64-mingw32-gcc)
Fortran compiler ... GFORTRAN (command line : i686-w64-mingw32-gfortran)
Library Name ... libopenblas_nehalemp-r0.2.8.a (Multi threaded; Max num-threads is 8)
try 64bit linux to rule out some problem with the OS
rerunning the test in the primary os on the computer (ubuntu 64) shows roughly the same results as above:
4096x4096x4096 17.191794 s 7.994450927 GFLOPS
4096x4096x4096 9.333568 s 14.725231923 GFLOPS
4096x4096x4096 4.686359 s 29.327448766 GFLOPS
4096x4096x4096 2.316957 s 59.318732921 GFLOPS
Then lapack_testing works fine on my 32-bit windows. It only costs about 10 minutes
I was able to instead use my mobile dual core i7 (plus hyper-threading) running Windows 7, in VMware and natively, and the vast performance differential disappeared.
time OPENBLAS_NUM_THREADS=1 make lapack-test
vs. time OPENBLAS_NUM_THREADS=2 make lapack-test
time OPENBLAS_NUM_THREADS=2 make lapack-test
While the single-threaded version was still faster (by about a factor of 2.5), repeating this test on linux gave identical performance. My suspicion is that memory allocation growth (for spawning threads) is implemented poorly in both virtualbox and wine, and thus this "bug" had nothing to do with openblas.
thanks for looking into this. sorry for wasting your time.