Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openblas and openmp #2265

Closed
bill-hager opened this issue Sep 21, 2019 · 8 comments · Fixed by #4441
Closed

openblas and openmp #2265

bill-hager opened this issue Sep 21, 2019 · 8 comments · Fixed by #4441

Comments

@bill-hager
Copy link

I have tried to use openblas with Tim Davis' SuiteSparse package. I have download openblas
from either redhat on my dell desktop or from ubuntu on my thinkpad lap; in either case, I have similar problems. The problem occurs when his software tries to perform a supernodal cholesky factorization. This requires use of dgemm in BLAS. On my 32 processor desktop, the time to perform the factorization is 1000 slower than it should be. On my 8 processor laptop, the time is 7 times slower than it should be. When I use profiling, I find that 57% of the time is spent in blas_thread_server and 35% of the time is in alloc_map. If after the factorization is complete, I immediately perform the factorization again, then the time drops to 0.1 seconds on either machine, the correct factorization time (on the 32 processor desktop, the time was 86 seconds for the initial factorization). The current version of SuiteSparse is using OpenMP, so there seems to be some problem with the openmp coding inside the openblas. If I essentially turn off threading with "setenv OMP_NUM_THREADS 1", then the factorization time is 0.2 seconds, and the huge run times were significantly reduced. Nonetheless, the time is still twice what it would be if threading would work. Is it possible to fix dgemm so that with openmp, the multiprocessor threading will work. dgemm in openblas does work correctly with ptreads; it is with openmp threading that it does not seem to work. But again, if call the factorization routine, the initial factorization takes 86 seconds, and then if I immediately refactor the matrix, it takes 0.1 seconds. On the other hand, if I factor the matrix, then exit the routine where I factor the matrix, do some work in other routines, then return to the routine where I call the factorization, it will take another 86 seconds to do the factorization. This drop from 86 seconds to 0.1 seconds only happens if the second factorization occurs immediately after the first one.

@martin-frbg
Copy link
Collaborator

Which version(s) of OpenBLAS ? Slowness on (only) the first run makes it sound like some cache contention issue, what are your other OPENMP environment variables ? (Could be related to #1653, which unfortunately has no clear resolution so far)

@brada4
Copy link
Contributor

brada4 commented Sep 22, 2019

What CPU? 32 Processors (That barely fit under the desk) XOR 32 Cores XOR 32 Hyperthreads?
About the immediately-ness - are you loading data from the hard drive?

EDIT: what do you mean by "from RedHat" ? They have love to ATLAS, not OpeNBLAS. You can get OpenBLAS from Fedora EPEL v0.3.3, or better do your rpmbuild from Fedora's own 0.3.7 SRPM.

@brada4
Copy link
Contributor

brada4 commented Sep 22, 2019

Typically please compare
perf record ./sample ; perf report
vs
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 perf record ; perf report

Try to paste together text with what happens inside suitesparse calls and openblas library.

@bill-hager
Copy link
Author

bill-hager commented Sep 22, 2019 via email

@martin-frbg
Copy link
Collaborator

E5-2687Wv2 will be Sandybridge target, 8 cores/16 threads, and on a two-socket system an added problem could be tasks getting pushed from one socket to the other. i7-8650 will be using Haswell.
OpenBLAS version still of interest, as early 0.3.x had some performance issues due to unnecessary locking (though if pthreads performance is normal it is probably not one of those).

@brada4
Copy link
Contributor

brada4 commented Sep 22, 2019

First test w xeon would be to set to 8 cores to use one side of numa without HT pseudocores
Does it get close to 8x better than 1 core?

@brada4
Copy link
Contributor

brada4 commented Sep 25, 2019

Include/cholmod_supernodal.h

 * BLAS routines:
 * dtrsv        solve Lx=b or L'x=b, L non-unit diagonal, x and b stride-1
 * dtrsm        solve LX=B or L'X=b, L non-unit diagonal
 * dgemv        y=y-A*x or y=y-A'*x (x and y stride-1)
 * dgemm        C=A*B', C=C-A*B, or C=C-A'*B
 * dsyrk        C=tril(A*A')

dtrsv is not parallel
... rest are guarded ...
dsyrk is not guarded against excess parallelism like we had a plan a while ago #1886 , would be nice to re-confirm with the profiler that this is failing.

@martin-frbg
Copy link
Collaborator

Whatever went wrong there in 2019... with current OpenBLAS I get to within 5 percent of the speed of the 2024.0 MKL on comparable hardware when running Suitesparse-7.5.1's CHOLMOD on large matrix problems from the SuiteSparse Matrix Collection. The speed difference negligible when the (already suspect) multithreading threshold in GEMV is increased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants