Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

Closed
bashseb opened this issue Jan 29, 2013 · 9 comments

Comments

Projects
None yet
4 participants
@bashseb
Copy link

commented Jan 29, 2013

Hello,
thank you for providing the openblas lib. My interest is to use your optimized library. I'm not sure if this is the correct place to ask such questions. If no, please let me know.

 OpenBLAS build complete.

  OS               ... Linux
  Architecture     ... x86_64
  BINARY           ... 64bit
  C compiler       ... GCC  (command line : gcc)
  Fortran compiler ... GFORTRAN  (command line : gfortran)
  Library Name     ... libopenblas_sandybridgep-r0.2.5.a (Multi threaded; Max num-threads is 32)

I'm trying to use openblas (git) and as a first test tried to compile lapack-3.4.2. I added BLASLIB = /path/to/libopenblas.a -lpthread to make.inc.

I then tested the lapack tests with different export OPENBLAS_NUM_THREADS=i. I did make cleantesting and make. I can observe that i cores are utilized, but the test results of i>1 are slower than for i=1 or with the system blas (which beats openblas on most tests for i=1). Furthermore, a value of i=32 is interpreted as i=16.

I must have obviously done something wrong., but I don't know exactly how to narrow it down. I'd appreciate any hints.
I've put the result of i=32 on a pastebin http://pastebin.com/BQATuymz

@xianyi

This comment has been minimized.

Copy link
Owner

commented Jan 30, 2013

Hi @bashseb ,

Thank you for the feedback.

What's your CPU? I think it enables hyper-threading feature. Thus, the performance of i=32 is same as i=16.

How do you run lapack test?

What's your system blas? Is it Intel MKL?

Xianyi

@bashseb

This comment has been minimized.

Copy link
Author

commented Jan 30, 2013

thanks @xianyi for your assistance. My CPU is a "Xeon(R) CPU E5-2690 0 @ 2.90GHz". It has 32 threads, 16 physical cores and 2 sockets. I just looked at htop and saw that 16 cores are fully utilized. Now I know that this is the expected behaviour.

I ran the lapack test that is included in the lapack-3.4.2. I guess it's not really a benchmark, but just verification that it works or not. But it outputs numbers on the runtime (see the pastebin). I'm trying to run the hpl benchmark (http://www.netlib.org/benchmark/hpl/) and will report the numbers - is this a good idea?

My system blas is the default CentOS release 6.3 (Final) blas-devel-3.2.1-4.el6.x86_64. gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) (same for gfortran). Here is the log of the lapack tests with the system BLAS http://pastebin.com/ATaUNti4.

EDIT: I looked at hpl and found it a bit confusing. Also, it requires MPI. I'd prefer a more basic benchmark. What's the easiest way to asses the speed of openblas vs system blas?

EDIT2: I guess this cpu counts as 'sandybridge'. I think in this case my compiler is too old and I need AVC, right?

thanks a load!

@bashseb bashseb closed this Jan 30, 2013

@bashseb bashseb reopened this Jan 30, 2013

@xianyi

This comment has been minimized.

Copy link
Owner

commented Jan 30, 2013

Hi @bashseb ,

I just checked the outputs of LAPACK test. Becasue OpenBLAS uses multithreading and the input matrix is very small, the performance is low than the default BLAS. This issue is same as #103 .

HPL is a very good benchmark. Yes, it need MPI. I don't know other basic benchmark for BLAS.
@zchothia ,@ViralBShah, any comments?

I think Red Hat already applied the sandy bridge patch to this gcc version. Thus, you don't need to update your compiler.

Xianyi

@bashseb

This comment has been minimized.

Copy link
Author

commented Jan 30, 2013

@xianyi thanks a lot. HPL benchmarks clearly show a run time advantage of openblas. I multiplied the default Ns by 100 and NBs by 10 and tested it with 1, 2 and 4 openblas threads and 4 openmpi jobs. I see a speed up of about a factor >5 . i=1 yields the best performance. This is probably because the problem is still quite small.

@bashseb bashseb closed this Jan 30, 2013

@ViralBShah

This comment has been minimized.

Copy link
Contributor

commented Feb 3, 2013

HPL is perhaps the best benchmark, but the MPI makes it difficult to measure BLAS performance.

@concretevitamin

This comment has been minimized.

Copy link

commented Aug 4, 2014

Hi @xianyi -- can you elaborate on OpenBlas' behavior when encountered hyper-threading on a virtual machine (e.g. Amazon EC2 instances)?

A concrete scenario: let's say we have 4 physical cores and with HT, 8 threads (2 logical cores per physical core). Is it then valid to set OPENBLAS_NUM_THREADS from 1 up to 4?

What if I used export OPENBLAS_MAIN_FREE=1 -- is it expected then OPENBLAS_NUM_THREADS=8 will be more performant than setting it to 4?

@xianyi

This comment has been minimized.

Copy link
Owner

commented Aug 4, 2014

Hi @concretevitamin ,

It depends on your benchmark. For example, DGEMM and other BLAS3 functions are compute-intensive. Thus, those function cannot get benefit from HT. The performance of OPENBLAS_NUM_THREADS=8 may slower than 4.

Xianyi

@ViralBShah

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2014

Yes, HPL is a good benchmark for a distributed machine, but it requires MPI and a whole host of other tuning to do seriously. If the goal is just to measure BLAS performance, I would just time the DGEMMs, or run peakflops if you are using julia, for example.

@concretevitamin

This comment has been minimized.

Copy link

commented Aug 4, 2014

Thanks for your response. In my case, I am wondering if GEQRF could benefit
from HT.

On Sunday, August 3, 2014, Zhang Xianyi notifications@github.com wrote:

Hi @concretevitamin https://github.com/concretevitamin ,

It depends on your benchmark. For example, DGEMM and other BLAS3 functions
are compute-intensive. Thus, those function cannot get benefit from HT. The
performance of OPENBLAS_NUM_THREADS=8 may slower than 4.

Xianyi


Reply to this email directly or view it on GitHub
#192 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.