Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openBLAS nested parallelism #2052

Open
SanazGheibi opened this issue Mar 9, 2019 · 44 comments
Open

openBLAS nested parallelism #2052

SanazGheibi opened this issue Mar 9, 2019 · 44 comments

Comments

@SanazGheibi
Copy link

Hi,
we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA, bs3, pTmpB, bs2, beta, pTmpC, bs2);
}else {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA2, bs3, pTmpB2, bs2, beta, pTmpC2, bs2);
}
}
Here is the issue:

  • At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.

  • To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8
    However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.

What is going wrong? and how can we have the desired behavior?
Is there any way we could know the number of threads inside the cblas_dgemm function?

Thank you very much for your time and help

@martin-frbg
Copy link
Collaborator

How big is your matrix ? OpenBLAS will not use more than one thread if the product of the dimensions M, N and K is smaller than SMP_THRESHOLD_MIN*GEMM_MULTITHREAD_THRESHOLD (65535*4 =256k by default)

@SanazGheibi
Copy link
Author

Thank you very much @martin-frbg.
The matrices are rather large (M = N = K = 1024 or above). So I don't think that is the issue.

@martin-frbg
Copy link
Collaborator

I do not think there is a direct way to get the number of threads inside dgemm, you'd either need to look at your running program in a debugger, or instrument interface/gemm.c to print the args.nthreads it has decided to use. Which version of OpenBLAS, what hardware and operating system are you using ?

@SanazGheibi
Copy link
Author

We are using OpenBLAS 0.3.5 on AMD opteron 6168 and the OS is Ubuntu 16.04 (Xenial).
We have actually done the following:
We modified the function cblas_dgemm.c inside the OpenBLAS directory to print out the number of threads at the very beginning of the function. We used printf("%d\n", omp_get_num_threads()). Then compiled the whole library and linked it to our code. We expected that calling cblas_dgemm would cause the number of its internal threads to be printed, but that didn't happen.

@martin-frbg
Copy link
Collaborator

You can try the BLAS extension openblas_get_num_threads()

@SanazGheibi
Copy link
Author

Thank you very much @martin-frbg . I made this change, but still nothing is printed out.

@martin-frbg
Copy link
Collaborator

That is a bit suspicious, are you sure that your program actually loads OpenBLAS at runtime, and not something else (like single-threaded reference BLAS from netlib) through the "alternatives" mechanism of Ubuntu ?

@SanazGheibi
Copy link
Author

We explicitly provide the link to libopenblas.so. However, the source code we modify is from an OpenBLAS folder where the only cblas_dgemm.c is inside a folder called lapack-netlib. So that is suspicious as you say.
However, if we remove the nested parallelism structure and leave only one call to cblas_dgemm; and if we set the number of openBLAS threads to different values using the environment variable OPENBLAS_NUM_THREADS , then the resulting runtime is sensitive to the number of threads.

@brada4
Copy link
Contributor

brada4 commented Mar 10, 2019

Thats upstream (Netlib LAPACK) stuff that does not run parallel.
cblas symbols are provided directly from OpenBLAS without extra wrapper.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 10, 2019

Try adding your printout in interface/gemm.c - this file gets compiled twice from the Makefile, once with -DCBLAS and once without, to give both cblas_dgemm and dgemm (as well as sgemm, cgemm, zgemm and their cblas counterparts by (un)defiing DOUBLE and COMPLEX as needed). The BLAS parts of lapack-netlib are not used in OpenBLAS, that directory is only included for LAPACK.
(Sorry for not spotting this last night)

@brada4
Copy link
Contributor

brada4 commented Mar 10, 2019

Seeing OpenMP in code - you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.

Complementing to what martin said - you can use ltrace to get list of from which libraries which functions got called, or use perf record ./program ; perf report to find those just using most of CPU time.

More pragmatic approach would be to build against Netlib BLAS provided by Ubuntu, confirming it works at all, then use alternatives to supplant the library with OpenBLAS.

@SanazGheibi
Copy link
Author

Thank you very much @martin-frbg . I modified interface/gemm.c and put a print statement in each of the functions, but still nothing is printed out when I run my code. I suspect maybe I am doing the linking in a wrong way.

  • We don't have root access to the system, so we can not install the library after it is compiled
  • In the parent folder (OpenBLAS-0.3.5), there is a file named libopenblas.so
  • In the directory where our code resides, I make a new directory called newDIR and copy libopenblas.so into that.
  • I compile our code using gcc -O3 -fopenmp OURCODE.c -o OUTPUT.out -LnewDIR -lopenblas
    Is there something wrong with how I am linking the library? I will really appreciate your help

@SanazGheibi
Copy link
Author

Thank you very much @brada4 . I have a question. Could you please explain a little more about

you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.

Actually, I am compiling the code using -fopenmp flag and there are two threads in the outer-level of the nested parallel section. Is that enough? Or is there anything else I should do? I am asking because I read somewhere that openMP threads may conflict with openBLAS threads and I suspect maybe that is somehow related to the support you are talking about.

@martin-frbg
Copy link
Collaborator

When you compile your code with "-lopenblas" this does not automatically ensure that exactly the same version of openblas will be loaded at runtime - there might be so other (and potentially older) version installed somewhere in the default library search paths on the system (like /lib, /usr/lib or /usr/local/lib).
Running ldd on your program should show which libopenblas gets loaded by default, setting the LD_LIBRARY_PATH environment variable to your directory should make it look there.

@brada4
Copy link
Contributor

brada4 commented Mar 10, 2019

@SanazGheibi
Copy link
Author

Thank you very much @martin-frbg . It worked and now the number of threads is printed out. There is just one other issue:
The first time I compiled and linked the library, there was an error that

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option

However, then the number of threads was sensitive to the environment variable OPENBLAS_NUM_THREADS and by changing this variable, the number of threads that were printed out did vary.

After I recompiled the library using USE_OPENMP=1 , there are no more warnings, but now however I modify OPENBLAS_NUM_THREADS, the number of threads that is printed out is always 24 (the maximum number of threads in the system). Is there any way I can fix this problem? Thank you again

@SanazGheibi
Copy link
Author

Thank you very much @brada4

@brada4
Copy link
Contributor

brada4 commented Mar 11, 2019

Probably thread safety improved a lot since that warning was introduced and nothing hangs recently. Thread number detected is not as important as total run time reduction

@martin-frbg
Copy link
Collaborator

Could be that it is always returning the value of OMP_NUM_THREADS now unfortunately. You can try removing the "#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c (this could be another bug related to my earlier mis-edit uncovered in #2002 - memory.c basically contains two versions of the thread setup code so you will see two definitions of blas_get_cpu_number there).
Despite the recent thread safety improvements I do not think it is safe to mix OPENMP and non-OPENMP codes - the OpenMP management functions will not know anything about plain pthreads outside its control...

@SanazGheibi
Copy link
Author

SanazGheibi commented Mar 12, 2019

Thank you very much @martin-frbg . I removed the

#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c
but it still doesn't work.

And there is another issue:
Inside interface/gemm.c, I have put two print statements:

  • One at the beginning of the function CNAME . This one prints the number of threads returned by openblas_get_num_threads()
  • Another one at the very end of the function CNAME. This print statement prints the value args.nthreads

If we remove the nested parallel structure and only call one instance of cblas_dgemm, both the printed values are 24.
However, if we use the nested parallel structure, the printf at the beginning of CNAME, prints 24, but the one at the end of CNAME prints 1. What can be going wrong?

And here is our nested parallel structure (so that you don't have to go all the way up in the early posts):
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
//First call, with first set of arguments
cblas_dgemm();
}else {
//Second call, with second set of arguments
cblas_dgemm();
}
}

@SanazGheibi
Copy link
Author

Thank you very much @brada4 , but for our case we need to know the number of threads in each block. And the other thing is that we are not getting any runtime improvement compared to the case where we call the two functions sequentially and that is really strange. So there may be something wrong with thread distribution and we need to figure that out.

@brada4
Copy link
Contributor

brada4 commented Mar 12, 2019

You can count CPU usage with "time" command - like if user+system > total then you use threads.

@martin-frbg
Copy link
Collaborator

args.nthreads in interface/gemm.c should only become 1 when the product of the matrix dimensions is small, perhaps print args.m, args.n, args.k at that point as well to in case your code divides the workload unevenly between the two instances. (Print num_cpu_avail(3) just to be sure, though I do not think it could be 1)

@SanazGheibi
Copy link
Author

SanazGheibi commented Mar 12, 2019

Thank you @martin-frbg . For our problem args.m = args.n = args.k >= 512. That was verified after interface/gemm.c printed out these values.

However, the return value of num_cpu_avail(3) is printed out as 1. That is quite surprising. Because there are 24 cpus available in our system.

@SanazGheibi
Copy link
Author

Thank you @brada4 .

@SanazGheibi
Copy link
Author

In accordance to my previous comment:
If we only call one instance of cblas_dgemm and remove the nested parallelism, then the output of num_cpu_avail(3) will be 24. Therefore, the idea that maybe the system is in use by other programs cannot hold in this case.

@SanazGheibi
Copy link
Author

Another thing that is somehow surprising to me is that if I use the following setting for cup affinity:

setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16"

Then regardless of whether or not we are using nested parallelism, the return value of openblas_get_num_threads() at the beginning of CNAME, the value of args.nthreads at the end of CNAME and the return value of num_cpu_avail(3) will all be 1. What can be the reason for all of this? Thank you again.

@brada4
Copy link
Contributor

brada4 commented Mar 12, 2019

Are you certain you use same openblas library for each test?

@SanazGheibi
Copy link
Author

SanazGheibi commented Mar 12, 2019

Yes, I am sure. There is only one openBLAS library that is modified to print out number of threads and number of CPUs available. And I am using that.

@brada4
Copy link
Contributor

brada4 commented Mar 13, 2019

You can try omp_get_num_threads() , I think openblas_get_. just gets number from there.

@martin-frbg
Copy link
Collaborator

Meh. Reading the implementation of num_cpu_avail() in common_thread.h, it is hardcoded to return "1" when in an OMP parallel region. (And has been like this since the days of GotoBLAS.) This could be a very old workaround for problems related to thread buffer memory allocation (and rogue overwriting). It will probably take some careful testing to see if the relatively recent introduction of MAX_PARALLEL_NUMBER (NUM_PARALLEL in Makefile.rule) from #1536 is sufficient on its own.

@SanazGheibi
Copy link
Author

Thank you very much @brada4 and @martin-frbg . I will go through #1536 and see what I can figure out.

@SanazGheibi
Copy link
Author

SanazGheibi commented Mar 14, 2019

Thank you again @martin-frbg . I simply commented out
#ifdef USE_OPENMP
|| omp_in_parallel()
#endif
from common_thread.h and it seems to be working. Now the number of threads inside the openBLAS function can be controlled from the calling function using omp_set_num_threads( ).

However, there is a problem remaining. If we use any of the following affinity settings:
setenv OMP_PLACES cores
setenv OMP_PROC_BIND close
Or
setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23"
Then the number of threads and the number of CPUs available will turn to 1. I really have no idea why that happens.

@brada4
Copy link
Contributor

brada4 commented Mar 14, 2019

Each calling thread is sort of constrained to one CPU, there is no easy way out from that. OpenBLAS restricts available CPUs also from existing affinity mask to not oversubscribe dockers, lxc etc.

@martin-frbg
Copy link
Collaborator

I do not want to put it like that - some of the observed behaviour may simply be the result of even more hidden bugs.

(I am not sure what the documented/expected result of setting GOMP_CPU_AFFINITY to the entire range of available cores is - I would have expected OpenBLAS to handle it the same as if no affinity mask had been set, unless OpenMP itself creates an affinity mask of "all cores" for the first instance and "none" for the second. There is already an open issue - #1653 - about finding and documenting best practices for using OpenBLAS with OpenMP but I feel "more research is needed")

@brada4
Copy link
Contributor

brada4 commented Mar 14, 2019

if _places was sockets it would be optimal for multisocket system.
if our affinity mask is one CPU we really dont know if we have any right to break free.

@SanazGheibi
Copy link
Author

Thank you very much @brada4 and @martin-frbg . Where in the openBLAS library are the affinity masks handled? Is there any way we could check and possibly modify that?

@brada4
Copy link
Contributor

brada4 commented Mar 14, 2019

See patches linked in #1155 - it introduces parsing affinity mask as a constraint to available CPUs.
Please try OMP_PLACES=sockets, that should address your problem completely, your 2 OMP parallel threads will settle in socket each.

@martin-frbg
Copy link
Collaborator

cpu enumeration happens in function get_num_procs() of file driver/others/memory.c ...most recently updated in #2008 - beware that there are two occurences of this in memory.c, one for the USE_TLS=1 branch (experimental code using thread-local storage) and one for USE_TLS=0 - you will probably want to use/change the second instance.

@brada4
Copy link
Contributor

brada4 commented Mar 15, 2019

The problem here is that pthread openblas gets side-effect of GOMP pthread setup, which is not the most orthodox configuration.

@martin-frbg
Copy link
Collaborator

IIRC GOMP on Linux is implemented on top of pthreads, and the data returned by sched_getaffinity should reflect whatever was defined through GOMP_CPU_AFFINITY. So I do not think there is anything unorthodox about this configuration or its interpretation by get_num_procs(). It could simply be that we have another "ifdef USE_OPENMP, report a single core" elsewhere.

@brada4
Copy link
Contributor

brada4 commented Mar 15, 2019

Here is the reference that those heavily bent confs are those not working:
#2052 (comment)

@SanazGheibi
Copy link
Author

Thank you very much @martin-frbg and @brada4 . I will check and see what I can do.

@jakub-homola
Copy link

I am currently dealing with something similar.

In the main top-level readme, there is this line:

If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

So the reason why openblas_get_num_threads() returned the same as omp_get_num_threads() is because they are both based on the same environment variable.

To achieve the 2x8 nested parallelism, I think you will have to use openblas_set_num_threads() manually inside the parallel region, along with allowing nested OpenMP using e.g. export OMP_MAX_ACTIVE_LEVELS=2. The default-disabled nested OpenMP could have been the reason behind the original issue. I didn't test it, just a suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants