openBLAS nested parallelism #2052

SanazGheibi · 2019-03-09T17:14:43Z

Hi,
we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA, bs3, pTmpB, bs2, beta, pTmpC, bs2);
}else {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA2, bs3, pTmpB2, bs2, beta, pTmpC2, bs2);
}
}
Here is the issue:

At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.
To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8
However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.

What is going wrong? and how can we have the desired behavior?
Is there any way we could know the number of threads inside the cblas_dgemm function?

Thank you very much for your time and help

martin-frbg · 2019-03-09T17:26:46Z

How big is your matrix ? OpenBLAS will not use more than one thread if the product of the dimensions M, N and K is smaller than SMP_THRESHOLD_MIN*GEMM_MULTITHREAD_THRESHOLD (65535*4 =256k by default)

SanazGheibi · 2019-03-09T17:33:07Z

Thank you very much @martin-frbg.
The matrices are rather large (M = N = K = 1024 or above). So I don't think that is the issue.

martin-frbg · 2019-03-09T17:58:50Z

I do not think there is a direct way to get the number of threads inside dgemm, you'd either need to look at your running program in a debugger, or instrument interface/gemm.c to print the args.nthreads it has decided to use. Which version of OpenBLAS, what hardware and operating system are you using ?

SanazGheibi · 2019-03-09T19:14:39Z

We are using OpenBLAS 0.3.5 on AMD opteron 6168 and the OS is Ubuntu 16.04 (Xenial).
We have actually done the following:
We modified the function cblas_dgemm.c inside the OpenBLAS directory to print out the number of threads at the very beginning of the function. We used printf("%d\n", omp_get_num_threads()). Then compiled the whole library and linked it to our code. We expected that calling cblas_dgemm would cause the number of its internal threads to be printed, but that didn't happen.

martin-frbg · 2019-03-09T19:18:38Z

You can try the BLAS extension openblas_get_num_threads()

SanazGheibi · 2019-03-09T19:44:36Z

Thank you very much @martin-frbg . I made this change, but still nothing is printed out.

martin-frbg · 2019-03-09T21:07:39Z

That is a bit suspicious, are you sure that your program actually loads OpenBLAS at runtime, and not something else (like single-threaded reference BLAS from netlib) through the "alternatives" mechanism of Ubuntu ?

SanazGheibi · 2019-03-10T00:08:03Z

We explicitly provide the link to libopenblas.so. However, the source code we modify is from an OpenBLAS folder where the only cblas_dgemm.c is inside a folder called lapack-netlib. So that is suspicious as you say.
However, if we remove the nested parallelism structure and leave only one call to cblas_dgemm; and if we set the number of openBLAS threads to different values using the environment variable OPENBLAS_NUM_THREADS , then the resulting runtime is sensitive to the number of threads.

brada4 · 2019-03-10T10:11:41Z

Thats upstream (Netlib LAPACK) stuff that does not run parallel.
cblas symbols are provided directly from OpenBLAS without extra wrapper.

martin-frbg · 2019-03-10T11:32:55Z

Try adding your printout in interface/gemm.c - this file gets compiled twice from the Makefile, once with -DCBLAS and once without, to give both cblas_dgemm and dgemm (as well as sgemm, cgemm, zgemm and their cblas counterparts by (un)defiing DOUBLE and COMPLEX as needed). The BLAS parts of lapack-netlib are not used in OpenBLAS, that directory is only included for LAPACK.
(Sorry for not spotting this last night)

brada4 · 2019-03-10T17:13:46Z

Seeing OpenMP in code - you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.

Complementing to what martin said - you can use ltrace to get list of from which libraries which functions got called, or use perf record ./program ; perf report to find those just using most of CPU time.

More pragmatic approach would be to build against Netlib BLAS provided by Ubuntu, confirming it works at all, then use alternatives to supplant the library with OpenBLAS.

SanazGheibi · 2019-03-10T18:34:02Z

Thank you very much @martin-frbg . I modified interface/gemm.c and put a print statement in each of the functions, but still nothing is printed out when I run my code. I suspect maybe I am doing the linking in a wrong way.

We don't have root access to the system, so we can not install the library after it is compiled
In the parent folder (OpenBLAS-0.3.5), there is a file named libopenblas.so
In the directory where our code resides, I make a new directory called newDIR and copy libopenblas.so into that.
I compile our code using gcc -O3 -fopenmp OURCODE.c -o OUTPUT.out -LnewDIR -lopenblas
Is there something wrong with how I am linking the library? I will really appreciate your help

SanazGheibi · 2019-03-10T18:44:24Z

Thank you very much @brada4 . I have a question. Could you please explain a little more about

you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.

Actually, I am compiling the code using -fopenmp flag and there are two threads in the outer-level of the nested parallel section. Is that enough? Or is there anything else I should do? I am asking because I read somewhere that openMP threads may conflict with openBLAS threads and I suspect maybe that is somehow related to the support you are talking about.

martin-frbg · 2019-03-10T18:50:24Z

When you compile your code with "-lopenblas" this does not automatically ensure that exactly the same version of openblas will be loaded at runtime - there might be so other (and potentially older) version installed somewhere in the default library search paths on the system (like /lib, /usr/lib or /usr/local/lib).
Running ldd on your program should show which libopenblas gets loaded by default, setting the LD_LIBRARY_PATH environment variable to your directory should make it look there.

brada4 · 2019-03-10T19:22:41Z

Namely following FAQ entries apply:
https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts
https://github.com/xianyi/OpenBLAS/wiki/faq#wronglibrary

SanazGheibi · 2019-03-11T01:22:36Z

Thank you very much @martin-frbg . It worked and now the number of threads is printed out. There is just one other issue:
The first time I compiled and linked the library, there was an error that

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option

However, then the number of threads was sensitive to the environment variable OPENBLAS_NUM_THREADS and by changing this variable, the number of threads that were printed out did vary.

After I recompiled the library using USE_OPENMP=1 , there are no more warnings, but now however I modify OPENBLAS_NUM_THREADS, the number of threads that is printed out is always 24 (the maximum number of threads in the system). Is there any way I can fix this problem? Thank you again

SanazGheibi · 2019-03-11T01:53:51Z

Thank you very much @brada4

brada4 · 2019-03-11T04:58:43Z

Probably thread safety improved a lot since that warning was introduced and nothing hangs recently. Thread number detected is not as important as total run time reduction

martin-frbg · 2019-03-11T15:52:00Z

Could be that it is always returning the value of OMP_NUM_THREADS now unfortunately. You can try removing the "#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c (this could be another bug related to my earlier mis-edit uncovered in #2002 - memory.c basically contains two versions of the thread setup code so you will see two definitions of blas_get_cpu_number there).
Despite the recent thread safety improvements I do not think it is safe to mix OPENMP and non-OPENMP codes - the OpenMP management functions will not know anything about plain pthreads outside its control...

SanazGheibi · 2019-03-12T00:01:37Z

Thank you very much @martin-frbg . I removed the

#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c
but it still doesn't work.

And there is another issue:
Inside interface/gemm.c, I have put two print statements:

One at the beginning of the function CNAME . This one prints the number of threads returned by openblas_get_num_threads()
Another one at the very end of the function CNAME. This print statement prints the value args.nthreads

If we remove the nested parallel structure and only call one instance of cblas_dgemm, both the printed values are 24.
However, if we use the nested parallel structure, the printf at the beginning of CNAME, prints 24, but the one at the end of CNAME prints 1. What can be going wrong?

And here is our nested parallel structure (so that you don't have to go all the way up in the early posts):
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
//First call, with first set of arguments
cblas_dgemm();
}else {
//Second call, with second set of arguments
cblas_dgemm();
}
}

SanazGheibi · 2019-03-12T00:19:40Z

Thank you very much @brada4 , but for our case we need to know the number of threads in each block. And the other thing is that we are not getting any runtime improvement compared to the case where we call the two functions sequentially and that is really strange. So there may be something wrong with thread distribution and we need to figure that out.

brada4 · 2019-03-12T05:12:44Z

You can count CPU usage with "time" command - like if user+system > total then you use threads.

martin-frbg · 2019-03-12T09:02:32Z

args.nthreads in interface/gemm.c should only become 1 when the product of the matrix dimensions is small, perhaps print args.m, args.n, args.k at that point as well to in case your code divides the workload unevenly between the two instances. (Print num_cpu_avail(3) just to be sure, though I do not think it could be 1)

SanazGheibi · 2019-03-12T18:47:01Z

Thank you @martin-frbg . For our problem args.m = args.n = args.k >= 512. That was verified after interface/gemm.c printed out these values.

However, the return value of num_cpu_avail(3) is printed out as 1. That is quite surprising. Because there are 24 cpus available in our system.

SanazGheibi · 2019-03-12T18:47:26Z

Thank you @brada4 .

SanazGheibi · 2019-03-12T19:21:35Z

In accordance to my previous comment:
If we only call one instance of cblas_dgemm and remove the nested parallelism, then the output of num_cpu_avail(3) will be 24. Therefore, the idea that maybe the system is in use by other programs cannot hold in this case.

SanazGheibi · 2019-03-12T19:54:47Z

Another thing that is somehow surprising to me is that if I use the following setting for cup affinity:

setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16"

Then regardless of whether or not we are using nested parallelism, the return value of openblas_get_num_threads() at the beginning of CNAME, the value of args.nthreads at the end of CNAME and the return value of num_cpu_avail(3) will all be 1. What can be the reason for all of this? Thank you again.

brada4 · 2019-03-12T20:44:36Z

Are you certain you use same openblas library for each test?

SanazGheibi · 2019-03-12T22:12:28Z

Yes, I am sure. There is only one openBLAS library that is modified to print out number of threads and number of CPUs available. And I am using that.

brada4 · 2019-03-13T07:12:28Z

You can try omp_get_num_threads() , I think openblas_get_. just gets number from there.

martin-frbg · 2019-03-13T07:42:57Z

Meh. Reading the implementation of num_cpu_avail() in common_thread.h, it is hardcoded to return "1" when in an OMP parallel region. (And has been like this since the days of GotoBLAS.) This could be a very old workaround for problems related to thread buffer memory allocation (and rogue overwriting). It will probably take some careful testing to see if the relatively recent introduction of MAX_PARALLEL_NUMBER (NUM_PARALLEL in Makefile.rule) from #1536 is sufficient on its own.

SanazGheibi · 2019-03-14T00:19:19Z

Thank you very much @brada4 and @martin-frbg . I will go through #1536 and see what I can figure out.

SanazGheibi · 2019-03-14T05:23:38Z

Thank you again @martin-frbg . I simply commented out
#ifdef USE_OPENMP
|| omp_in_parallel()
#endif
from common_thread.h and it seems to be working. Now the number of threads inside the openBLAS function can be controlled from the calling function using omp_set_num_threads( ).

However, there is a problem remaining. If we use any of the following affinity settings:
setenv OMP_PLACES cores
setenv OMP_PROC_BIND close
Or
setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23"
Then the number of threads and the number of CPUs available will turn to 1. I really have no idea why that happens.

brada4 · 2019-03-14T05:51:28Z

Each calling thread is sort of constrained to one CPU, there is no easy way out from that. OpenBLAS restricts available CPUs also from existing affinity mask to not oversubscribe dockers, lxc etc.

martin-frbg · 2019-03-14T07:59:32Z

I do not want to put it like that - some of the observed behaviour may simply be the result of even more hidden bugs.

(I am not sure what the documented/expected result of setting GOMP_CPU_AFFINITY to the entire range of available cores is - I would have expected OpenBLAS to handle it the same as if no affinity mask had been set, unless OpenMP itself creates an affinity mask of "all cores" for the first instance and "none" for the second. There is already an open issue - #1653 - about finding and documenting best practices for using OpenBLAS with OpenMP but I feel "more research is needed")

brada4 · 2019-03-14T09:20:18Z

if _places was sockets it would be optimal for multisocket system.
if our affinity mask is one CPU we really dont know if we have any right to break free.

SanazGheibi · 2019-03-14T18:19:27Z

Thank you very much @brada4 and @martin-frbg . Where in the openBLAS library are the affinity masks handled? Is there any way we could check and possibly modify that?

brada4 · 2019-03-14T18:25:39Z

See patches linked in #1155 - it introduces parsing affinity mask as a constraint to available CPUs.
Please try OMP_PLACES=sockets, that should address your problem completely, your 2 OMP parallel threads will settle in socket each.

martin-frbg · 2019-03-14T19:12:30Z

cpu enumeration happens in function get_num_procs() of file driver/others/memory.c ...most recently updated in #2008 - beware that there are two occurences of this in memory.c, one for the USE_TLS=1 branch (experimental code using thread-local storage) and one for USE_TLS=0 - you will probably want to use/change the second instance.

brada4 · 2019-03-15T07:50:06Z

The problem here is that pthread openblas gets side-effect of GOMP pthread setup, which is not the most orthodox configuration.

martin-frbg · 2019-03-15T08:00:15Z

IIRC GOMP on Linux is implemented on top of pthreads, and the data returned by sched_getaffinity should reflect whatever was defined through GOMP_CPU_AFFINITY. So I do not think there is anything unorthodox about this configuration or its interpretation by get_num_procs(). It could simply be that we have another "ifdef USE_OPENMP, report a single core" elsewhere.

brada4 · 2019-03-15T08:11:35Z

Here is the reference that those heavily bent confs are those not working:
#2052 (comment)

SanazGheibi · 2019-03-15T16:26:39Z

Thank you very much @martin-frbg and @brada4 . I will check and see what I can do.

jakub-homola · 2024-03-11T12:34:52Z

I am currently dealing with something similar.

In the main top-level readme, there is this line:

If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

So the reason why openblas_get_num_threads() returned the same as omp_get_num_threads() is because they are both based on the same environment variable.

To achieve the 2x8 nested parallelism, I think you will have to use openblas_set_num_threads() manually inside the parallel region, along with allowing nested OpenMP using e.g. export OMP_MAX_ACTIVE_LEVELS=2. The default-disabled nested OpenMP could have been the reason behind the original issue. I didn't test it, just a suggestion.

puneet336 mentioned this issue Aug 26, 2019

openblas runs single thread with OMP_PROC_BIND=TRUE or GOMP_CPU_AFFINITY #2238

Closed

dev-zero mentioned this issue Jan 10, 2020

He-full-potential.inp hit runtime error cp2k/cp2k#719

Closed

carstenbauer mentioned this issue Jul 13, 2022

OpenBLAS thread affinity carstenbauer/julia-dgemm-noctua#1

Closed

openBLAS nested parallelism #2052

openBLAS nested parallelism #2052

Comments

SanazGheibi commented Mar 9, 2019

martin-frbg commented Mar 9, 2019

SanazGheibi commented Mar 9, 2019

martin-frbg commented Mar 9, 2019

SanazGheibi commented Mar 9, 2019

martin-frbg commented Mar 9, 2019

SanazGheibi commented Mar 9, 2019

martin-frbg commented Mar 9, 2019

SanazGheibi commented Mar 10, 2019

brada4 commented Mar 10, 2019 • edited Loading

martin-frbg commented Mar 10, 2019 • edited Loading

brada4 commented Mar 10, 2019

SanazGheibi commented Mar 10, 2019

SanazGheibi commented Mar 10, 2019

martin-frbg commented Mar 10, 2019

brada4 commented Mar 10, 2019

SanazGheibi commented Mar 11, 2019

SanazGheibi commented Mar 11, 2019

brada4 commented Mar 11, 2019

martin-frbg commented Mar 11, 2019

SanazGheibi commented Mar 12, 2019 • edited Loading

SanazGheibi commented Mar 12, 2019

brada4 commented Mar 12, 2019

martin-frbg commented Mar 12, 2019

SanazGheibi commented Mar 12, 2019 • edited Loading

SanazGheibi commented Mar 12, 2019

SanazGheibi commented Mar 12, 2019

SanazGheibi commented Mar 12, 2019

brada4 commented Mar 12, 2019

SanazGheibi commented Mar 12, 2019 • edited Loading

brada4 commented Mar 13, 2019

martin-frbg commented Mar 13, 2019

SanazGheibi commented Mar 14, 2019

SanazGheibi commented Mar 14, 2019 • edited Loading

brada4 commented Mar 14, 2019

martin-frbg commented Mar 14, 2019

brada4 commented Mar 14, 2019

SanazGheibi commented Mar 14, 2019

brada4 commented Mar 14, 2019

martin-frbg commented Mar 14, 2019

brada4 commented Mar 15, 2019

martin-frbg commented Mar 15, 2019

brada4 commented Mar 15, 2019

SanazGheibi commented Mar 15, 2019

jakub-homola commented Mar 11, 2024

brada4 commented Mar 10, 2019 •

edited

Loading

martin-frbg commented Mar 10, 2019 •

edited

Loading

SanazGheibi commented Mar 12, 2019 •

edited

Loading

SanazGheibi commented Mar 12, 2019 •

edited

Loading

SanazGheibi commented Mar 12, 2019 •

edited

Loading

SanazGheibi commented Mar 14, 2019 •

edited

Loading