-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openBLAS nested parallelism #2052
Comments
How big is your matrix ? OpenBLAS will not use more than one thread if the product of the dimensions M, N and K is smaller than |
Thank you very much @martin-frbg. |
I do not think there is a direct way to get the number of threads inside dgemm, you'd either need to look at your running program in a debugger, or instrument interface/gemm.c to print the args.nthreads it has decided to use. Which version of OpenBLAS, what hardware and operating system are you using ? |
We are using OpenBLAS 0.3.5 on AMD opteron 6168 and the OS is Ubuntu 16.04 (Xenial). |
You can try the BLAS extension openblas_get_num_threads() |
Thank you very much @martin-frbg . I made this change, but still nothing is printed out. |
That is a bit suspicious, are you sure that your program actually loads OpenBLAS at runtime, and not something else (like single-threaded reference BLAS from netlib) through the "alternatives" mechanism of Ubuntu ? |
We explicitly provide the link to libopenblas.so. However, the source code we modify is from an OpenBLAS folder where the only cblas_dgemm.c is inside a folder called lapack-netlib. So that is suspicious as you say. |
Thats upstream (Netlib LAPACK) stuff that does not run parallel. |
Try adding your printout in interface/gemm.c - this file gets compiled twice from the Makefile, once with -DCBLAS and once without, to give both |
Seeing OpenMP in code - you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections. Complementing to what martin said - you can use ltrace to get list of from which libraries which functions got called, or use More pragmatic approach would be to build against Netlib BLAS provided by Ubuntu, confirming it works at all, then use alternatives to supplant the library with OpenBLAS. |
Thank you very much @martin-frbg . I modified interface/gemm.c and put a print statement in each of the functions, but still nothing is printed out when I run my code. I suspect maybe I am doing the linking in a wrong way.
|
Thank you very much @brada4 . I have a question. Could you please explain a little more about
Actually, I am compiling the code using |
When you compile your code with "-lopenblas" this does not automatically ensure that exactly the same version of openblas will be loaded at runtime - there might be so other (and potentially older) version installed somewhere in the default library search paths on the system (like /lib, /usr/lib or /usr/local/lib). |
Namely following FAQ entries apply: |
Thank you very much @martin-frbg . It worked and now the number of threads is printed out. There is just one other issue:
However, then the number of threads was sensitive to the environment variable OPENBLAS_NUM_THREADS and by changing this variable, the number of threads that were printed out did vary. After I recompiled the library using USE_OPENMP=1 , there are no more warnings, but now however I modify OPENBLAS_NUM_THREADS, the number of threads that is printed out is always 24 (the maximum number of threads in the system). Is there any way I can fix this problem? Thank you again |
Thank you very much @brada4 |
Probably thread safety improved a lot since that warning was introduced and nothing hangs recently. Thread number detected is not as important as total run time reduction |
Could be that it is always returning the value of OMP_NUM_THREADS now unfortunately. You can try removing the "#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c (this could be another bug related to my earlier mis-edit uncovered in #2002 - memory.c basically contains two versions of the thread setup code so you will see two definitions of blas_get_cpu_number there). |
Thank you very much @martin-frbg . I removed the
And there is another issue:
If we remove the nested parallel structure and only call one instance of cblas_dgemm, both the printed values are 24. And here is our nested parallel structure (so that you don't have to go all the way up in the early posts): |
Thank you very much @brada4 , but for our case we need to know the number of threads in each block. And the other thing is that we are not getting any runtime improvement compared to the case where we call the two functions sequentially and that is really strange. So there may be something wrong with thread distribution and we need to figure that out. |
You can count CPU usage with "time" command - like if user+system > total then you use threads. |
args.nthreads in interface/gemm.c should only become 1 when the product of the matrix dimensions is small, perhaps print args.m, args.n, args.k at that point as well to in case your code divides the workload unevenly between the two instances. (Print num_cpu_avail(3) just to be sure, though I do not think it could be 1) |
Thank you @martin-frbg . For our problem args.m = args.n = args.k >= 512. That was verified after interface/gemm.c printed out these values. However, the return value of num_cpu_avail(3) is printed out as 1. That is quite surprising. Because there are 24 cpus available in our system. |
Thank you @brada4 . |
In accordance to my previous comment: |
Another thing that is somehow surprising to me is that if I use the following setting for cup affinity:
Then regardless of whether or not we are using nested parallelism, the return value of openblas_get_num_threads() at the beginning of CNAME, the value of args.nthreads at the end of CNAME and the return value of num_cpu_avail(3) will all be 1. What can be the reason for all of this? Thank you again. |
Are you certain you use same openblas library for each test? |
Yes, I am sure. There is only one openBLAS library that is modified to print out number of threads and number of CPUs available. And I am using that. |
You can try omp_get_num_threads() , I think openblas_get_. just gets number from there. |
Meh. Reading the implementation of num_cpu_avail() in common_thread.h, it is hardcoded to return "1" when in an OMP parallel region. (And has been like this since the days of GotoBLAS.) This could be a very old workaround for problems related to thread buffer memory allocation (and rogue overwriting). It will probably take some careful testing to see if the relatively recent introduction of MAX_PARALLEL_NUMBER (NUM_PARALLEL in Makefile.rule) from #1536 is sufficient on its own. |
Thank you very much @brada4 and @martin-frbg . I will go through #1536 and see what I can figure out. |
Thank you again @martin-frbg . I simply commented out However, there is a problem remaining. If we use any of the following affinity settings: |
Each calling thread is sort of constrained to one CPU, there is no easy way out from that. OpenBLAS restricts available CPUs also from existing affinity mask to not oversubscribe dockers, lxc etc. |
I do not want to put it like that - some of the observed behaviour may simply be the result of even more hidden bugs. (I am not sure what the documented/expected result of setting GOMP_CPU_AFFINITY to the entire range of available cores is - I would have expected OpenBLAS to handle it the same as if no affinity mask had been set, unless OpenMP itself creates an affinity mask of "all cores" for the first instance and "none" for the second. There is already an open issue - #1653 - about finding and documenting best practices for using OpenBLAS with OpenMP but I feel "more research is needed") |
if _places was sockets it would be optimal for multisocket system. |
Thank you very much @brada4 and @martin-frbg . Where in the openBLAS library are the affinity masks handled? Is there any way we could check and possibly modify that? |
See patches linked in #1155 - it introduces parsing affinity mask as a constraint to available CPUs. |
cpu enumeration happens in function get_num_procs() of file driver/others/memory.c ...most recently updated in #2008 - beware that there are two occurences of this in memory.c, one for the USE_TLS=1 branch (experimental code using thread-local storage) and one for USE_TLS=0 - you will probably want to use/change the second instance. |
The problem here is that pthread openblas gets side-effect of GOMP pthread setup, which is not the most orthodox configuration. |
IIRC GOMP on Linux is implemented on top of pthreads, and the data returned by sched_getaffinity should reflect whatever was defined through GOMP_CPU_AFFINITY. So I do not think there is anything unorthodox about this configuration or its interpretation by get_num_procs(). It could simply be that we have another "ifdef USE_OPENMP, report a single core" elsewhere. |
Here is the reference that those heavily bent confs are those not working: |
Thank you very much @martin-frbg and @brada4 . I will check and see what I can do. |
I am currently dealing with something similar. In the main top-level readme, there is this line:
So the reason why To achieve the 2x8 nested parallelism, I think you will have to use openblas_set_num_threads() manually inside the parallel region, along with allowing nested OpenMP using e.g. |
Hi,
we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA, bs3, pTmpB, bs2, beta, pTmpC, bs2);
}else {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA2, bs3, pTmpB2, bs2, beta, pTmpC2, bs2);
}
}
Here is the issue:
At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.
To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8
However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.
What is going wrong? and how can we have the desired behavior?
Is there any way we could know the number of threads inside the cblas_dgemm function?
Thank you very much for your time and help
The text was updated successfully, but these errors were encountered: