-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No need to set OMP num_threads #3546
Conversation
Reason given for this change in #2775 (@Guobing-Chen) was |
Yes. We are using different workloads. Rbenchmark tries to calculate the Eigenvalues of a 640x640 random matrix. The "eigen" function will continuously call "dgeev_" which will call "exec_blas" and each time a new num will be calculated. If with "num_threads(num)", eg. If there are 112 logical cores in total on the hardware. If the first time the calculated num is 50, then openMP will create 50 threads to deal with the workload. The second time if the calculated num is 112, then openMP will create 112 new threads to deal with the workload, and it does not reuse the old ones. But if without "num_threads(num)", there are 112 threads in OpenMP thread pool and it will reuse them for two times' operation. New threads' creation will cause much overhead and it is the root cause why rbenchmark has very poor performance without this patch. |
will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created.
That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R |
I tried "benchmarks/scripts/R/deig.R" and here are the openMP version (USE_THREAD=1 USE_OPENMP=1) results. With the patch: |
Impressive indeed. How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself. |
Pthread version is "USE_THREAD=1 USE_OPENMP=0" and here are the pthread version results: |
Looking deeper at the benchmark script:
|
I wonder if this can be fixed without bringing back the original problem. Perhaps by making it conditional on the fraction of the total threads available (e.g. run with |
Thats other problem with threading thresholds. Yes, lots of time is wasted then the heuristics there fail and spin too many threads. Characteristic part on sandybridge 2-core non-ht NUC pre-fix pthread dgemm.R, versions recent.
thr=1
EDIT: I doubted initial measurements as they were just summary, not raw, so asked to use known consistent benchmark. Precision tool turned up more favourable result than initial assessment. |
@martin-frbg Except the benchmarks/scripts/R/deig.R, what other benchmark do I need to verify. For deig.R this benchmark, it seems it is not stable enough, so I add the loop count to 20 and collect the following data on Ice Lake.
with patch
with patch VS without patch
Without this path, the larger size, the more time needs to be completed, so I only pickup 3 sizes. The patch can improve this benchmark. I have no idea what else benchmarks I have to verify with the patch. |
We found a similar slowdown case in the easybuild community, using numpy and svd (https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py) In the end the problem is that both #2775 and this fix are right, depending on the use case, and the compiler used. When you do an svd via Here's an example for
Using a little After a google search, and inspired by https://stackoverflow.com/questions/24440118/openmp-parallell-region-overhead-increase-when-num-threads-varies
Intel (or clang for that matter) doesn't have this issue to the same extent, but also has no speedup for 2 threads:
I'm not quite sure what the best solution is. Clearly if the number of threads stays low, there is a performance benefit (with libgomp), in this case about 32 (which happens to be exactly 64/2 threads so mostly linear versus the number of threads). But switching too often is catastrophic by a factor 40 or so. Perhaps a heuristic that could work is to keep track of say the latest 32 openmp regions and set |
Thanks. I have not found a solution I like - keeping track of past behaviour adds its own overhead and is not particularly good at predicting the future unless the program really does the same computation over and over. Perhaps something as trivial as introducing a new environment variable "OPENBLAS_ADAPTIVE" to choose between pre- and post-#2775 behaviour on startup would already help? |
attempting to supersede this with #3703, using a new environment variable to choose between the two modes of operation |
Since #3703 has been merged, can this PR now be closed? |
No need to set num_threads, as num_threads(num) will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created.
With this patch, pts/rbenchmark-1.0.3 will be 2.375x improved (0.385 secs VS 0.164 secs) on the Ice Lake server under CentOS 8.
Here are the steps on how to run rbenchmark on CentOS 8:
$ sudo dnf install R
$ make TARGET=CORE2 USE_THREAD=1 USE_OPENMP=1 FC=gfortran CC=gcc LIBPREFIX="libopenblas" INTERFACE64=0
$ wget http://www.phoronix-test-suite.com/benchmark-files/rbenchmarks-20160105.tar.bz2
$ tar -xf rbenchmarks-20160105.tar.bz2
$ cd rbenchmarks
$ export LD_LIBRARY_PATH=
$ Rscript R-benchmark-25/R-benchmark-25.R
The benchmark' result is like "Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.166433631462761".