No need to set OMP num_threads #3546

kangshan1157 · 2022-02-25T02:02:32Z

No need to set num_threads, as num_threads(num) will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created.
With this patch, pts/rbenchmark-1.0.3 will be 2.375x improved (0.385 secs VS 0.164 secs) on the Ice Lake server under CentOS 8.

Here are the steps on how to run rbenchmark on CentOS 8:

Install R package
$ sudo dnf install R
Build your openblas
$ make TARGET=CORE2 USE_THREAD=1 USE_OPENMP=1 FC=gfortran CC=gcc LIBPREFIX="libopenblas" INTERFACE64=0
Download R benchmark
$ wget http://www.phoronix-test-suite.com/benchmark-files/rbenchmarks-20160105.tar.bz2
$ tar -xf rbenchmarks-20160105.tar.bz2
$ cd rbenchmarks
$ export LD_LIBRARY_PATH=
$ Rscript R-benchmark-25/R-benchmark-25.R
The benchmark' result is like "Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.166433631462761".

martin-frbg · 2022-02-25T07:03:31Z

Reason given for this change in #2775 (@Guobing-Chen) was
"In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API."
so I am a bit sceptical you may just be comparing different situations/workloads

kangshan1157 · 2022-02-25T07:30:42Z

Reason given for this change in #2775 (@Guobing-Chen) was "In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API." so I am a bit sceptical you may just be comparing different situations/workloads

Yes. We are using different workloads. Rbenchmark tries to calculate the Eigenvalues of a 640x640 random matrix. The "eigen" function will continuously call "dgeev_" which will call "exec_blas" and each time a new num will be calculated. If with "num_threads(num)", eg. If there are 112 logical cores in total on the hardware. If the first time the calculated num is 50, then openMP will create 50 threads to deal with the workload. The second time if the calculated num is 112, then openMP will create 112 new threads to deal with the workload, and it does not reuse the old ones. But if without "num_threads(num)", there are 112 threads in OpenMP thread pool and it will reuse them for two times' operation. New threads' creation will cause much overhead and it is the root cause why rbenchmark has very poor performance without this patch.

will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created.

brada4 · 2022-03-03T00:52:00Z

That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R
Namely it permits gc() in metered section.
How does pthread version perform?

kangshan1157 · 2022-03-03T06:45:21Z

That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R Namely it permits gc() in metered section. How does pthread version perform?

I tried "benchmarks/scripts/R/deig.R" and here are the openMP version (USE_THREAD=1 USE_OPENMP=1) results.
Without the patch
SIZE Flops Time
128x128 : 1927.93 MFlops 0.029000 sec
256x256 : 570.51 MFlops 0.784000 sec
384x384 : 721.25 MFlops 2.093000 sec
512x512 : 378.53 MFlops 9.453000 sec
640x640 : 409.15 MFlops 17.081000 sec
768x768 : 475.10 MFlops 25.419000 sec
896x896 : 582.22 MFlops 32.938000 sec
1024x1024 : 679.14 MFlops 42.150000 sec
1152x1152 : 784.43 MFlops 51.959000 sec
1280x1280 : 902.08 MFlops 61.979000 sec
1408x1408 : 1038.00 MFlops 71.692000 sec
1536x1536 : 1170.07 MFlops 82.570000 sec
1664x1664 : 1223.62 MFlops 100.386000 sec
1792x1792 : 1371.79 MFlops 111.837000 sec
1920x1920 : 1530.71 MFlops 123.274000 sec
2048x2048 : 1683.94 MFlops 135.995000 sec

With the patch:
SIZE Flops Time
128x128 : 1189.58 MFlops 0.047000 sec
256x256 : 4758.30 MFlops 0.094000 sec
384x384 : 7987.15 MFlops 0.189000 sec
512x512 : 8321.50 MFlops 0.430000 sec
640x640 : 11805.34 MFlops 0.592000 sec
768x768 : 14946.26 MFlops 0.808000 sec
896x896 : 18801.13 MFlops 1.020000 sec
1024x1024 : 15634.06 MFlops 1.831000 sec
1152x1152 : 17147.01 MFlops 2.377000 sec
1280x1280 : 15217.77 MFlops 3.674000 sec
1408x1408 : 24641.16 MFlops 3.020000 sec
1536x1536 : 32376.88 MFlops 2.984000 sec
1664x1664 : 22427.32 MFlops 5.477000 sec
1792x1792 : 29691.74 MFlops 5.167000 sec
1920x1920 : 27672.17 MFlops 6.819000 sec
2048x2048 : 37297.66 MFlops 6.140000 sec

brada4 · 2022-03-03T07:16:29Z

640x640 : 409.15 MFlops 17.081000 sec
640x640 : 11805.34 MFlops 0.592000 sec

Impressive indeed.

How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.

kangshan1157 · 2022-03-03T07:19:23Z

640x640 : 409.15 MFlops 17.081000 sec
640x640 : 11805.34 MFlops 0.592000 sec

Impressive indeed.

How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.

Pthread version is "USE_THREAD=1 USE_OPENMP=0" and here are the pthread version results:
Without patch:
SIZE Flops Time
128x128 : 1747.19 MFlops 0.032000 sec
256x256 : 6675.83 MFlops 0.067000 sec
384x384 : 11265.46 MFlops 0.134000 sec
512x512 : 10194.43 MFlops 0.351000 sec
640x640 : 14147.29 MFlops 0.494000 sec
768x768 : 17733.59 MFlops 0.681000 sec
896x896 : 23049.46 MFlops 0.832000 sec
1024x1024 : 25355.14 MFlops 1.129000 sec
1152x1152 : 32169.25 MFlops 1.267000 sec
1280x1280 : 33987.89 MFlops 1.645000 sec
1408x1408 : 37832.39 MFlops 1.967000 sec
1536x1536 : 36293.24 MFlops 2.662000 sec
1664x1664 : 40050.35 MFlops 3.067000 sec
1792x1792 : 41009.69 MFlops 3.741000 sec
1920x1920 : 45175.12 MFlops 4.177000 sec
2048x2048 : 43463.21 MFlops 5.269000 sec
With patch:
SIZE Flops Time
128x128 : 1694.24 MFlops 0.033000 sec
256x256 : 7214.20 MFlops 0.062000 sec
384x384 : 11182.01 MFlops 0.135000 sec
512x512 : 11848.49 MFlops 0.302000 sec
640x640 : 18991.19 MFlops 0.368000 sec
768x768 : 24153.15 MFlops 0.500000 sec
896x896 : 30247.88 MFlops 0.634000 sec
1024x1024 : 30260.00 MFlops 0.946000 sec
1152x1152 : 42412.53 MFlops 0.961000 sec
1280x1280 : 42324.05 MFlops 1.321000 sec
1408x1408 : 41924.68 MFlops 1.775000 sec
1536x1536 : 35467.18 MFlops 2.724000 sec
1664x1664 : 40931.17 MFlops 3.001000 sec
1792x1792 : 41318.94 MFlops 3.713000 sec
1920x1920 : 45755.70 MFlops 4.124000 sec
2048x2048 : 42813.17 MFlops 5.349000 sec

brada4 · 2022-03-03T08:04:06Z

Looking deeper at the benchmark script:

additional chol() that is DGETRF+DPOTRF is there, but otherwise it is O(n^3) drilldown as solve/eig.
except those few BLAS/LAPACK functions (plundered by single-thread gc() ) rest are single-threaded, but likely best option to do back then. Summary result will be worse on 30-core 1GHz CPU than on 10-core 3GHz CPU

martin-frbg · 2022-03-03T09:15:56Z

I wonder if this can be fixed without bringing back the original problem. Perhaps by making it conditional on the fraction of the total threads available (e.g. run with num_threads(num) if num is less than half of the cpu) ? Guess this would introduce some weird new crossover points where performance suddenly changes for no apparent reason...

brada4 · 2022-03-03T09:35:25Z

Thats other problem with threading thresholds. Yes, lots of time is wasted then the heuristics there fail and spin too many threads.
Actually you see slightly that the 128x128 sample is slower after change, if you set O_N_THR=1 it gets faster, rougly same in OMP and PTH, fix applied or not.
This fix just emphasizes other long hidden problem, in a different place.

Characteristic part on sandybridge 2-core non-ht NUC pre-fix pthread dgemm.R, versions recent.

      SIZE             Flops                   Time
           128x128 :    4194.30 MFlops   0.001000 sec
!!         256x256 :    3050.40 MFlops   0.011000 sec
           384x384 :   28311.55 MFlops   0.004000 sec

thr=1

           128x128 :    4194.30 MFlops   0.001000 sec
           256x256 :   16777.22 MFlops   0.002000 sec
           384x384 :   18874.37 MFlops   0.006000 sec

EDIT: I doubted initial measurements as they were just summary, not raw, so asked to use known consistent benchmark. Precision tool turned up more favourable result than initial assessment.

kangshan1157 · 2022-03-19T09:06:32Z

@martin-frbg Except the benchmarks/scripts/R/deig.R, what other benchmark do I need to verify. For deig.R this benchmark, it seems it is not stable enough, so I add the loop count to 20 and collect the following data on Ice Lake.
From 128 To 384 Step=128 Loops=20
without patch

	round 0	round 0	round 1	round 1	round 2	round 2	round 3	round 3	round 4	round 4	mean	mean	rsd	rsd
SIZE	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time
128x128	2167.06	0.516	2188.26	0.511	2129.91	0.525	2101.88	0.532	2146.26	0.521	2146.674	0.521	1.55%	1.55%
256x256	457.25	19.564	432.09	20.703	458.37	19.516	438.19	20.415	456.18	19.61	448.416	19.9616	2.75%	2.78%
384x384	422.31	71.492	403.81	74.766	426.01	70.871	406.3	74.308	423.49	71.292	416.384	72.5458	2.51%	2.53%

with patch

	round 0	round 0	round 1	round 1	round 2	round 2	round 3	round 3	round 4	round 4	mean	mean	rsd	rsd
SIZE	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time
128x128	2733.99	0.409	2675.12	0.418	2463	0.454	2564.68	0.436	2733.99	0.409	2774.69	0.403	4.25%	4.84%
256x256	5850.63	1.529	5591.01	1.6	5305.82	1.686	5580.54	1.603	5756.51	1.554	5893.02	1.518	3.53%	3.95%
384x384	8569.81	3.523	8890.29	3.396	7995.61	3.776	8521.43	3.543	8562.52	3.526	8398.17	3.595	3.84%	3.84%

with patch VS without patch

SIZE	Flops
128x128	1.29x
256x256	13.14x
384x384	20.17x

Without this path, the larger size, the more time needs to be completed, so I only pickup 3 sizes. The patch can improve this benchmark. I have no idea what else benchmarks I have to verify with the patch.

bartoldeman · 2022-06-13T13:27:18Z

We found a similar slowdown case in the easybuild community, using numpy and svd (https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py)

In the end the problem is that both #2775 and this fix are right, depending on the use case, and the compiler used.

When you do an svd via dgesdd a lot of BLAS functions are called with wildly varying numbers of threads. If you vary num_threads in parallel regions there is a huge overhead particularly with libgomp (GCC).

Here's an example for DGESDD (via mpimd-csc/flexiblas#7 (comment))
https://gist.github.com/0a2e1783e68b5aca8b69e0947c833082
from worse to worst:

$ gfortran -lopenblas -O2 -fopenmp test_dgesdd.f90 -o test_dgesdd
$ OMP_NUM_THREADS=2 OMP_PROC_BIND=true ./test_dgesdd
 Time =    1.1620110740000000
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=true ./test_dgesdd 
 Time =    7.8668489140000002
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=false ./test_dgesdd 
 Time =    42.826767384000000

Using a little printf("%ld ", num); above #pragma omp parallel for... I found that this single dgesdd invokes the parallel region
16228 times, and switches the number of threads 7900 times. A small sample is this: 1 3 1 4 1 4 1 4 1 4 1 5 2 5 2 5 2 5 2 6 2 6 2 6 and 64 64 8 64 8 64 64 64 64 8 64 8 64 (https://gist.github.com/c367d0bf460ed385b2d994fcee5723e6)

After a google search, and inspired by https://stackoverflow.com/questions/24440118/openmp-parallell-region-overhead-increase-when-num-threads-varies
I adapted that test: https://gist.github.com/cb7b050f6d1f5a3893df4a1352714668, ran it on a node with 64 cores and watch the huge overhead:

$ gcc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 137.097901
64 threads 4427.660024
2/64 alternating 175142.989028

Intel (or clang for that matter) doesn't have this issue to the same extent, but also has no speedup for 2 threads:

$ icc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 4647.016525
64 threads 4935.026169
2/64 alternating 10691.881180

I'm not quite sure what the best solution is. Clearly if the number of threads stays low, there is a performance benefit (with libgomp), in this case about 32 (which happens to be exactly 64/2 threads so mostly linear versus the number of threads).

But switching too often is catastrophic by a factor 40 or so.

Perhaps a heuristic that could work is to keep track of say the latest 32 openmp regions and set num_threads to the maximum num of those regions? Then if you have 32 low threaded calls in a row you can still take advantage of the performance benefit.

martin-frbg · 2022-06-14T07:27:04Z

Thanks. I have not found a solution I like - keeping track of past behaviour adds its own overhead and is not particularly good at predicting the future unless the program really does the same computation over and over. Perhaps something as trivial as introducing a new environment variable "OPENBLAS_ADAPTIVE" to choose between pre- and post-#2775 behaviour on startup would already help?

martin-frbg · 2022-07-27T22:20:56Z

attempting to supersede this with #3703, using a new environment variable to choose between the two modes of operation

kevincwells · 2022-09-06T06:16:11Z

Since #3703 has been merged, can this PR now be closed?

No need to set num_threads, as num_threads(num)

c655125

will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created.

oschuett mentioned this pull request Mar 24, 2022

OpenBLAS 0.3.19 -> 0.3.20 cp2k/cp2k#2018

Merged

martin-frbg mentioned this pull request Jul 27, 2022

Add env variable OMP_ADAPTIVE to control OMP threadpool behaviour #3703

Merged

kangshan1157 closed this Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No need to set OMP num_threads #3546

No need to set OMP num_threads #3546

kangshan1157 commented Feb 25, 2022 •

edited

Loading

martin-frbg commented Feb 25, 2022

kangshan1157 commented Feb 25, 2022

brada4 commented Mar 3, 2022

kangshan1157 commented Mar 3, 2022

brada4 commented Mar 3, 2022

kangshan1157 commented Mar 3, 2022 •

edited

Loading

brada4 commented Mar 3, 2022

martin-frbg commented Mar 3, 2022

brada4 commented Mar 3, 2022 •

edited

Loading

kangshan1157 commented Mar 19, 2022

bartoldeman commented Jun 13, 2022 •

edited

Loading

martin-frbg commented Jun 14, 2022

martin-frbg commented Jul 27, 2022

kevincwells commented Sep 6, 2022

No need to set OMP num_threads #3546

No need to set OMP num_threads #3546

Conversation

kangshan1157 commented Feb 25, 2022 • edited Loading

martin-frbg commented Feb 25, 2022

kangshan1157 commented Feb 25, 2022

brada4 commented Mar 3, 2022

kangshan1157 commented Mar 3, 2022

brada4 commented Mar 3, 2022

kangshan1157 commented Mar 3, 2022 • edited Loading

brada4 commented Mar 3, 2022

martin-frbg commented Mar 3, 2022

brada4 commented Mar 3, 2022 • edited Loading

kangshan1157 commented Mar 19, 2022

bartoldeman commented Jun 13, 2022 • edited Loading

martin-frbg commented Jun 14, 2022

martin-frbg commented Jul 27, 2022

kevincwells commented Sep 6, 2022

kangshan1157 commented Feb 25, 2022 •

edited

Loading

kangshan1157 commented Mar 3, 2022 •

edited

Loading

brada4 commented Mar 3, 2022 •

edited

Loading

bartoldeman commented Jun 13, 2022 •

edited

Loading