Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark OpenBLAS, Intel MKL vs ATLAS #18

Open
ytakeyasu opened this issue Mar 15, 2015 · 6 comments
Open

Benchmark OpenBLAS, Intel MKL vs ATLAS #18

ytakeyasu opened this issue Mar 15, 2015 · 6 comments

Comments

@ytakeyasu
Copy link

Hi,

This is not a problem report, but I'd like to share my benchmark of LAPACK / BLAS library. Because of my huge simulation model, I have been replacing my CPU and math-library. My conclusion is that Intel MKL is the best, OpenBLAS is worth to try.

No. of surface-patch Memory 3.16 GHz Core2 Duo 3.0 GHz Core2 Quad
(GB) ATLAS (sec) OpenBLAS (sec)
6,319 2.5 360 135
9,968 6.0 1,380 510
13,992 11.8 3,600 1,360

The simulation model (smooth-walled 3-section conical horn antenna) consists of surface-patch by SP&SC. Total run-time is measured by gettimeofday() instead of sysconf(). Note that OpenBLAS performs more than the ratio of CPU's core (Duo vs Quad). As shown in the Flat profile below, 90 % of the calculation is zgemm_kernel_n to be parallel by multi-core.

Flat profile:

Each sample counts as 0.01 seconds.
 %    cumulative    self    self    total
 time    seconds    seconds    calls    s/call    s/call    name
 89.99   289.89     289.89    zgemm_kernel_n
 3.25    300.37     10.48    sched_yield
 1.62    305.60     5.23    ztrsm_kernel_LT
 1.45    310.26     4.66    inner_advanced_thread
 0.73    312.61     2.35    39929761    0.00    0.00    nec_context::hintg(double, double, double)

matrix_algebra.cpp is modified for OpenBLAS:

extern "C"
 {
 #include </usr/lib/openblas-base/include/lapacke.h>
 #include </usr/lib/openblas-base/include/cblas.h>
 }
int info = LAPACKE_zgetrf((int) CblasColMajor, (lapack_int) n, (lapack_int) n, (lapack_complex_double*) a_in.data(), (lapack_int) ndim, (lapack_int*) ip.data());
int info = LAPACKE_zgetrs ((int) CblasColMajor, (char) CblasNoTrans, (lapack_int) n, (lapack_int) 1,  (const lapack_complex_double*) a.data(), (lapack_int) ndim, (const lapack_int*) ip.data(),  (lapack_complex_double*) b.data(), (lapack_int) n);

With regard to Transposed matrix, zgetrs.c of OpenBLAS is modified also:

if (trans_arg = = ‘O’) trans = 0;
if (trans_arg = = ‘P’) trans = 1;
if (trans_arg = = ‘Q’) trans = 2;
if (trans_arg = = ‘R’) trans = 3;

This is a dirty solution. It would be appreciated if someone suggest a better solution.
OpenBLAS is superb, but I experienced Memory Seg-Fault in case of over-60GB memory usage and 8 core CPU. Though I've confirmed that this Seg-Fault is NOT caused by NEC2++, but fixing the problem of OpenBLAS was beyond my capability. Then, I migrated to Intel MKL.

No. of surface-patch Memory 3.0 GHz Core2 Quad 2.93 GHz Dual X5570 (8 cores) 2.93 GHz Dual X5570 (8 cores)
(GB) OpenBLAS (sec) OpenBLAS (sec) Intel MKL (sec)
6,319 2.5 135 68 67
9,968 6.0 510 253 247
13,992 11.8 1,360 669 663
19,096 21.9 - 1,671 1,663
24,957 37.3 - 3,760 3,659
31,641 59.9 - 7,633 7,417
39,117 91.5 - Seg-Fault 14,004

matrix_algebra.cpp is modified for Intel MKL:

#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_lapacke.h>
#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_cblas.h>
int info = LAPACKE_zgetrf (CblasColMajor, n, n, (MKL_Complex16*) a_in.data(), ndim, (int*) ip.data());
int info = LAPACKE_zgetrs (CblasColMajor, ‘N’, n, 1, (const MKL_Complex16*) a.data(), ndim, (const int*) ip.data(), (MKL_Complex16*) b.data(), n);

Link options are:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Intel Math Kernel Library Link Line Advisor suggests these options. I used a little bit older version of the resources.
NEC2++ : ver.1.5.1
OpenBLAS : ver.2.5
Intel MKL : ver.11.1
gcc : ver.4.7.2
icc : ver.13.0.1

I hope this may help your serious number-crunching.

Best regards.

Yoshi Takeyasu

@tmolteno
Copy link
Owner

Hi Yoshi

This is very interesting information. I have been working on getting necpp to work with Eigen (eigen.tuxfamily.org), however it has been difficult because eigen aligns rows and columns of matrices with 4-byte address boundaries. I will keep trying, as it will make an interesting comparison as well.

Kind Regards

Tim Molteno

@gaming-hacker
Copy link

you run atlas on a dual core machine and the other on quad so really if adjust the atlas numbers they are about the same. Plus what is cost of openblas? 0 MKL? ++$$$$ i'll stick with openblas

i'd like to see you benchmark plasma, libblis and libflame. i think these will be faster than openblas as they have been updated with current kernels and throw in an openCL comparison if you can. also try an openmpi lib

openblas doesn't have a lot of kernels to tune, so when it ran its generic x86_64 configure it probably didn't determine your cache size correctly and is bombing when malloc returns a null pointer. probably.

@ytakeyasu
Copy link
Author

http://gcdart.blogspot.jp/2013/06/fast-matrix-multiply-and-ml.html

This is a good reference for the discussion.

@ldmtwo
Copy link

ldmtwo commented Mar 28, 2016

Just FYI. MKL is now FREE, free as in free beer, or a free couch on the side of the road used by a guy who looks like Homer Simpson, or free as in the US' ideology on speech. https://software.intel.com/en-us/articles/free_mkl *Disclaimer: These words above are my own and do not reflect the opinion or ideals of Intel. This is not endorsed by any entity.

@ytakeyasu Can you share the full compile args you used to link OpenBLAS and MKL? Thanks

@ytakeyasu
Copy link
Author

Hi,
As I reported in my first post, I used the Intel Math Kernel Library Link Line Advisor to find my link options. The parameters I input to the Advisor are:

• Intel (R) product : Intel(R) MKL 11.1
• OS : Linux
• Usage model of Intel (R) Xeon Phi (TM) Coprocessor : None
• Complier : Intel (R) C/C++
• Architecture : Intel (R) 64
• Dynamic or static linking : Static
• Interface layer : LP64 ( 3-bit integer )
• Sequential or multi-threaded layer : Multi-threaded
• OpenMP library : Intel (R) ( libiomp5 )

then, I got link options as follow:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Regards.

Yoshi Takeyasu

@nardis-miles
Copy link

I hope this is still active. Did you install the libraries yourself, from sources, or did you use the stock atlas and openblas from a repository. ATLAS really has to be tuned to your system. The tuning can give at least factors of 2-3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants