Benchmark OpenBLAS, Intel MKL vs ATLAS #18

Open
ytakeyasu opened this Issue Mar 15, 2015 · 5 comments

Comments

Projects
None yet
4 participants
@ytakeyasu

Hi,

This is not a problem report, but I'd like to share my benchmark of LAPACK / BLAS library. Because of my huge simulation model, I have been replacing my CPU and math-library. My conclusion is that Intel MKL is the best, OpenBLAS is worth to try.

No. of surface-patch Memory 3.16 GHz Core2 Duo 3.0 GHz Core2 Quad
(GB) ATLAS (sec) OpenBLAS (sec)
6,319 2.5 360 135
9,968 6.0 1,380 510
13,992 11.8 3,600 1,360

The simulation model (smooth-walled 3-section conical horn antenna) consists of surface-patch by SP&SC. Total run-time is measured by gettimeofday() instead of sysconf(). Note that OpenBLAS performs more than the ratio of CPU's core (Duo vs Quad). As shown in the Flat profile below, 90 % of the calculation is zgemm_kernel_n to be parallel by multi-core.

Flat profile:

Each sample counts as 0.01 seconds.
 %    cumulative    self    self    total
 time    seconds    seconds    calls    s/call    s/call    name
 89.99   289.89     289.89    zgemm_kernel_n
 3.25    300.37     10.48    sched_yield
 1.62    305.60     5.23    ztrsm_kernel_LT
 1.45    310.26     4.66    inner_advanced_thread
 0.73    312.61     2.35    39929761    0.00    0.00    nec_context::hintg(double, double, double)

matrix_algebra.cpp is modified for OpenBLAS:

extern "C"
 {
 #include </usr/lib/openblas-base/include/lapacke.h>
 #include </usr/lib/openblas-base/include/cblas.h>
 }
int info = LAPACKE_zgetrf((int) CblasColMajor, (lapack_int) n, (lapack_int) n, (lapack_complex_double*) a_in.data(), (lapack_int) ndim, (lapack_int*) ip.data());
int info = LAPACKE_zgetrs ((int) CblasColMajor, (char) CblasNoTrans, (lapack_int) n, (lapack_int) 1,  (const lapack_complex_double*) a.data(), (lapack_int) ndim, (const lapack_int*) ip.data(),  (lapack_complex_double*) b.data(), (lapack_int) n);

With regard to Transposed matrix, zgetrs.c of OpenBLAS is modified also:

if (trans_arg = = ‘O’) trans = 0;
if (trans_arg = = ‘P’) trans = 1;
if (trans_arg = = ‘Q’) trans = 2;
if (trans_arg = = ‘R’) trans = 3;

This is a dirty solution. It would be appreciated if someone suggest a better solution.
OpenBLAS is superb, but I experienced Memory Seg-Fault in case of over-60GB memory usage and 8 core CPU. Though I've confirmed that this Seg-Fault is NOT caused by NEC2++, but fixing the problem of OpenBLAS was beyond my capability. Then, I migrated to Intel MKL.

No. of surface-patch Memory 3.0 GHz Core2 Quad 2.93 GHz Dual X5570 (8 cores) 2.93 GHz Dual X5570 (8 cores)
(GB) OpenBLAS (sec) OpenBLAS (sec) Intel MKL (sec)
6,319 2.5 135 68 67
9,968 6.0 510 253 247
13,992 11.8 1,360 669 663
19,096 21.9 - 1,671 1,663
24,957 37.3 - 3,760 3,659
31,641 59.9 - 7,633 7,417
39,117 91.5 - Seg-Fault 14,004

matrix_algebra.cpp is modified for Intel MKL:

#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_lapacke.h>
#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_cblas.h>
int info = LAPACKE_zgetrf (CblasColMajor, n, n, (MKL_Complex16*) a_in.data(), ndim, (int*) ip.data());
int info = LAPACKE_zgetrs (CblasColMajor, ‘N’, n, 1, (const MKL_Complex16*) a.data(), ndim, (const int*) ip.data(), (MKL_Complex16*) b.data(), n);

Link options are:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Intel Math Kernel Library Link Line Advisor suggests these options. I used a little bit older version of the resources.
NEC2++ : ver.1.5.1
OpenBLAS : ver.2.5
Intel MKL : ver.11.1
gcc : ver.4.7.2
icc : ver.13.0.1

I hope this may help your serious number-crunching.

Best regards.

Yoshi Takeyasu

@tmolteno

This comment has been minimized.

Show comment
Hide comment
@tmolteno

tmolteno Mar 15, 2015

Owner

Hi Yoshi

This is very interesting information. I have been working on getting necpp to work with Eigen (eigen.tuxfamily.org), however it has been difficult because eigen aligns rows and columns of matrices with 4-byte address boundaries. I will keep trying, as it will make an interesting comparison as well.

Kind Regards

Tim Molteno

Owner

tmolteno commented Mar 15, 2015

Hi Yoshi

This is very interesting information. I have been working on getting necpp to work with Eigen (eigen.tuxfamily.org), however it has been difficult because eigen aligns rows and columns of matrices with 4-byte address boundaries. I will keep trying, as it will make an interesting comparison as well.

Kind Regards

Tim Molteno

@gaming-hacker

This comment has been minimized.

Show comment
Hide comment
@gaming-hacker

gaming-hacker Feb 16, 2016

you run atlas on a dual core machine and the other on quad so really if adjust the atlas numbers they are about the same. Plus what is cost of openblas? 0 MKL? ++$$$$ i'll stick with openblas

i'd like to see you benchmark plasma, libblis and libflame. i think these will be faster than openblas as they have been updated with current kernels and throw in an openCL comparison if you can. also try an openmpi lib

openblas doesn't have a lot of kernels to tune, so when it ran its generic x86_64 configure it probably didn't determine your cache size correctly and is bombing when malloc returns a null pointer. probably.

you run atlas on a dual core machine and the other on quad so really if adjust the atlas numbers they are about the same. Plus what is cost of openblas? 0 MKL? ++$$$$ i'll stick with openblas

i'd like to see you benchmark plasma, libblis and libflame. i think these will be faster than openblas as they have been updated with current kernels and throw in an openCL comparison if you can. also try an openmpi lib

openblas doesn't have a lot of kernels to tune, so when it ran its generic x86_64 configure it probably didn't determine your cache size correctly and is bombing when malloc returns a null pointer. probably.

@ytakeyasu

This comment has been minimized.

Show comment
Hide comment

http://gcdart.blogspot.jp/2013/06/fast-matrix-multiply-and-ml.html

This is a good reference for the discussion.

@ldmtwo

This comment has been minimized.

Show comment
Hide comment
@ldmtwo

ldmtwo Mar 28, 2016

Just FYI. MKL is now FREE, free as in free beer, or a free couch on the side of the road used by a guy who looks like Homer Simpson, or free as in the US' ideology on speech. https://software.intel.com/en-us/articles/free_mkl *Disclaimer: These words above are my own and do not reflect the opinion or ideals of Intel. This is not endorsed by any entity.

@ytakeyasu Can you share the full compile args you used to link OpenBLAS and MKL? Thanks

ldmtwo commented Mar 28, 2016

Just FYI. MKL is now FREE, free as in free beer, or a free couch on the side of the road used by a guy who looks like Homer Simpson, or free as in the US' ideology on speech. https://software.intel.com/en-us/articles/free_mkl *Disclaimer: These words above are my own and do not reflect the opinion or ideals of Intel. This is not endorsed by any entity.

@ytakeyasu Can you share the full compile args you used to link OpenBLAS and MKL? Thanks

@ytakeyasu

This comment has been minimized.

Show comment
Hide comment
@ytakeyasu

ytakeyasu Mar 29, 2016

Hi,
As I reported in my first post, I used the Intel Math Kernel Library Link Line Advisor to find my link options. The parameters I input to the Advisor are:

• Intel (R) product : Intel(R) MKL 11.1
• OS : Linux
• Usage model of Intel (R) Xeon Phi (TM) Coprocessor : None
• Complier : Intel (R) C/C++
• Architecture : Intel (R) 64
• Dynamic or static linking : Static
• Interface layer : LP64 ( 3-bit integer )
• Sequential or multi-threaded layer : Multi-threaded
• OpenMP library : Intel (R) ( libiomp5 )

then, I got link options as follow:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Regards.

Yoshi Takeyasu

Hi,
As I reported in my first post, I used the Intel Math Kernel Library Link Line Advisor to find my link options. The parameters I input to the Advisor are:

• Intel (R) product : Intel(R) MKL 11.1
• OS : Linux
• Usage model of Intel (R) Xeon Phi (TM) Coprocessor : None
• Complier : Intel (R) C/C++
• Architecture : Intel (R) 64
• Dynamic or static linking : Static
• Interface layer : LP64 ( 3-bit integer )
• Sequential or multi-threaded layer : Multi-threaded
• OpenMP library : Intel (R) ( libiomp5 )

then, I got link options as follow:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Regards.

Yoshi Takeyasu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment