Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No need to set OMP num_threads #3546

Closed
wants to merge 1 commit into from

Conversation

kangshan1157
Copy link

@kangshan1157 kangshan1157 commented Feb 25, 2022

No need to set num_threads, as num_threads(num) will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created.
With this patch, pts/rbenchmark-1.0.3 will be 2.375x improved (0.385 secs VS 0.164 secs) on the Ice Lake server under CentOS 8.

Here are the steps on how to run rbenchmark on CentOS 8:

  1. Install R package
    $ sudo dnf install R
  2. Build your openblas
    $ make TARGET=CORE2 USE_THREAD=1 USE_OPENMP=1 FC=gfortran CC=gcc LIBPREFIX="libopenblas" INTERFACE64=0
  3. Download R benchmark
    $ wget http://www.phoronix-test-suite.com/benchmark-files/rbenchmarks-20160105.tar.bz2
    $ tar -xf rbenchmarks-20160105.tar.bz2
    $ cd rbenchmarks
    $ export LD_LIBRARY_PATH=
    $ Rscript R-benchmark-25/R-benchmark-25.R
    The benchmark' result is like "Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.166433631462761".

@martin-frbg
Copy link
Collaborator

Reason given for this change in #2775 (@Guobing-Chen) was
"In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API."
so I am a bit sceptical you may just be comparing different situations/workloads

@kangshan1157
Copy link
Author

Reason given for this change in #2775 (@Guobing-Chen) was "In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API." so I am a bit sceptical you may just be comparing different situations/workloads

Yes. We are using different workloads. Rbenchmark tries to calculate the Eigenvalues of a 640x640 random matrix. The "eigen" function will continuously call "dgeev_" which will call "exec_blas" and each time a new num will be calculated. If with "num_threads(num)", eg. If there are 112 logical cores in total on the hardware. If the first time the calculated num is 50, then openMP will create 50 threads to deal with the workload. The second time if the calculated num is 112, then openMP will create 112 new threads to deal with the workload, and it does not reuse the old ones. But if without "num_threads(num)", there are 112 threads in OpenMP thread pool and it will reuse them for two times' operation. New threads' creation will cause much overhead and it is the root cause why rbenchmark has very poor performance without this patch.

will cause more new threads' overhead in some scenarios.
In openMP, if your required threads num is larger than
your last used num, then new num threads will be created.
@brada4
Copy link
Contributor

brada4 commented Mar 3, 2022

That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R
Namely it permits gc() in metered section.
How does pthread version perform?

@kangshan1157
Copy link
Author

That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R Namely it permits gc() in metered section. How does pthread version perform?

I tried "benchmarks/scripts/R/deig.R" and here are the openMP version (USE_THREAD=1 USE_OPENMP=1) results.
Without the patch
SIZE Flops Time
128x128 : 1927.93 MFlops 0.029000 sec
256x256 : 570.51 MFlops 0.784000 sec
384x384 : 721.25 MFlops 2.093000 sec
512x512 : 378.53 MFlops 9.453000 sec
640x640 : 409.15 MFlops 17.081000 sec
768x768 : 475.10 MFlops 25.419000 sec
896x896 : 582.22 MFlops 32.938000 sec
1024x1024 : 679.14 MFlops 42.150000 sec
1152x1152 : 784.43 MFlops 51.959000 sec
1280x1280 : 902.08 MFlops 61.979000 sec
1408x1408 : 1038.00 MFlops 71.692000 sec
1536x1536 : 1170.07 MFlops 82.570000 sec
1664x1664 : 1223.62 MFlops 100.386000 sec
1792x1792 : 1371.79 MFlops 111.837000 sec
1920x1920 : 1530.71 MFlops 123.274000 sec
2048x2048 : 1683.94 MFlops 135.995000 sec

With the patch:
SIZE Flops Time
128x128 : 1189.58 MFlops 0.047000 sec
256x256 : 4758.30 MFlops 0.094000 sec
384x384 : 7987.15 MFlops 0.189000 sec
512x512 : 8321.50 MFlops 0.430000 sec
640x640 : 11805.34 MFlops 0.592000 sec
768x768 : 14946.26 MFlops 0.808000 sec
896x896 : 18801.13 MFlops 1.020000 sec
1024x1024 : 15634.06 MFlops 1.831000 sec
1152x1152 : 17147.01 MFlops 2.377000 sec
1280x1280 : 15217.77 MFlops 3.674000 sec
1408x1408 : 24641.16 MFlops 3.020000 sec
1536x1536 : 32376.88 MFlops 2.984000 sec
1664x1664 : 22427.32 MFlops 5.477000 sec
1792x1792 : 29691.74 MFlops 5.167000 sec
1920x1920 : 27672.17 MFlops 6.819000 sec
2048x2048 : 37297.66 MFlops 6.140000 sec

@brada4
Copy link
Contributor

brada4 commented Mar 3, 2022

640x640 : 409.15 MFlops 17.081000 sec
640x640 : 11805.34 MFlops 0.592000 sec

Impressive indeed.

How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.

@kangshan1157
Copy link
Author

kangshan1157 commented Mar 3, 2022

640x640 : 409.15 MFlops 17.081000 sec
640x640 : 11805.34 MFlops 0.592000 sec

Impressive indeed.

How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.

Pthread version is "USE_THREAD=1 USE_OPENMP=0" and here are the pthread version results:
Without patch:
SIZE Flops Time
128x128 : 1747.19 MFlops 0.032000 sec
256x256 : 6675.83 MFlops 0.067000 sec
384x384 : 11265.46 MFlops 0.134000 sec
512x512 : 10194.43 MFlops 0.351000 sec
640x640 : 14147.29 MFlops 0.494000 sec
768x768 : 17733.59 MFlops 0.681000 sec
896x896 : 23049.46 MFlops 0.832000 sec
1024x1024 : 25355.14 MFlops 1.129000 sec
1152x1152 : 32169.25 MFlops 1.267000 sec
1280x1280 : 33987.89 MFlops 1.645000 sec
1408x1408 : 37832.39 MFlops 1.967000 sec
1536x1536 : 36293.24 MFlops 2.662000 sec
1664x1664 : 40050.35 MFlops 3.067000 sec
1792x1792 : 41009.69 MFlops 3.741000 sec
1920x1920 : 45175.12 MFlops 4.177000 sec
2048x2048 : 43463.21 MFlops 5.269000 sec
With patch:
SIZE Flops Time
128x128 : 1694.24 MFlops 0.033000 sec
256x256 : 7214.20 MFlops 0.062000 sec
384x384 : 11182.01 MFlops 0.135000 sec
512x512 : 11848.49 MFlops 0.302000 sec
640x640 : 18991.19 MFlops 0.368000 sec
768x768 : 24153.15 MFlops 0.500000 sec
896x896 : 30247.88 MFlops 0.634000 sec
1024x1024 : 30260.00 MFlops 0.946000 sec
1152x1152 : 42412.53 MFlops 0.961000 sec
1280x1280 : 42324.05 MFlops 1.321000 sec
1408x1408 : 41924.68 MFlops 1.775000 sec
1536x1536 : 35467.18 MFlops 2.724000 sec
1664x1664 : 40931.17 MFlops 3.001000 sec
1792x1792 : 41318.94 MFlops 3.713000 sec
1920x1920 : 45755.70 MFlops 4.124000 sec
2048x2048 : 42813.17 MFlops 5.349000 sec

@brada4
Copy link
Contributor

brada4 commented Mar 3, 2022

Looking deeper at the benchmark script:

  • additional chol() that is DGETRF+DPOTRF is there, but otherwise it is O(n^3) drilldown as solve/eig.
  • except those few BLAS/LAPACK functions (plundered by single-thread gc() ) rest are single-threaded, but likely best option to do back then. Summary result will be worse on 30-core 1GHz CPU than on 10-core 3GHz CPU

@martin-frbg
Copy link
Collaborator

I wonder if this can be fixed without bringing back the original problem. Perhaps by making it conditional on the fraction of the total threads available (e.g. run with num_threads(num) if num is less than half of the cpu) ? Guess this would introduce some weird new crossover points where performance suddenly changes for no apparent reason...

@brada4
Copy link
Contributor

brada4 commented Mar 3, 2022

Thats other problem with threading thresholds. Yes, lots of time is wasted then the heuristics there fail and spin too many threads.
Actually you see slightly that the 128x128 sample is slower after change, if you set O_N_THR=1 it gets faster, rougly same in OMP and PTH, fix applied or not.
This fix just emphasizes other long hidden problem, in a different place.

Characteristic part on sandybridge 2-core non-ht NUC pre-fix pthread dgemm.R, versions recent.

      SIZE             Flops                   Time
           128x128 :    4194.30 MFlops   0.001000 sec
!!         256x256 :    3050.40 MFlops   0.011000 sec
           384x384 :   28311.55 MFlops   0.004000 sec

thr=1

           128x128 :    4194.30 MFlops   0.001000 sec
           256x256 :   16777.22 MFlops   0.002000 sec
           384x384 :   18874.37 MFlops   0.006000 sec

EDIT: I doubted initial measurements as they were just summary, not raw, so asked to use known consistent benchmark. Precision tool turned up more favourable result than initial assessment.

@kangshan1157
Copy link
Author

@martin-frbg Except the benchmarks/scripts/R/deig.R, what other benchmark do I need to verify. For deig.R this benchmark, it seems it is not stable enough, so I add the loop count to 20 and collect the following data on Ice Lake.
From 128 To 384 Step=128 Loops=20
without patch

  round 0 round 0 round 1 round 1 round 2 round 2 round 3 round 3 round 4 round 4 mean mean rsd rsd
SIZE Flops Time Flops Time Flops Time Flops Time Flops Time Flops Time Flops Time
128x128 2167.06 0.516 2188.26 0.511 2129.91 0.525 2101.88 0.532 2146.26 0.521 2146.674 0.521 1.55% 1.55%
256x256 457.25 19.564 432.09 20.703 458.37 19.516 438.19 20.415 456.18 19.61 448.416 19.9616 2.75% 2.78%
384x384 422.31 71.492 403.81 74.766 426.01 70.871 406.3 74.308 423.49 71.292 416.384 72.5458 2.51% 2.53%

with patch

  round 0 round 0 round 1 round 1 round 2 round 2 round 3 round 3 round 4 round 4 mean mean rsd rsd
SIZE Flops Time Flops Time Flops Time Flops Time Flops Time Flops Time Flops Time
128x128 2733.99 0.409 2675.12 0.418 2463 0.454 2564.68 0.436 2733.99 0.409 2774.69 0.403 4.25% 4.84%
256x256 5850.63 1.529 5591.01 1.6 5305.82 1.686 5580.54 1.603 5756.51 1.554 5893.02 1.518 3.53% 3.95%
384x384 8569.81 3.523 8890.29 3.396 7995.61 3.776 8521.43 3.543 8562.52 3.526 8398.17 3.595 3.84% 3.84%

with patch VS without patch

SIZE Flops
128x128 1.29x
256x256 13.14x
384x384 20.17x

Without this path, the larger size, the more time needs to be completed, so I only pickup 3 sizes. The patch can improve this benchmark. I have no idea what else benchmarks I have to verify with the patch.

@bartoldeman
Copy link
Contributor

bartoldeman commented Jun 13, 2022

We found a similar slowdown case in the easybuild community, using numpy and svd (https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py)

In the end the problem is that both #2775 and this fix are right, depending on the use case, and the compiler used.

When you do an svd via dgesdd a lot of BLAS functions are called with wildly varying numbers of threads. If you vary num_threads in parallel regions there is a huge overhead particularly with libgomp (GCC).

Here's an example for DGESDD (via mpimd-csc/flexiblas#7 (comment))
https://gist.github.com/0a2e1783e68b5aca8b69e0947c833082
from worse to worst:

$ gfortran -lopenblas -O2 -fopenmp test_dgesdd.f90 -o test_dgesdd
$ OMP_NUM_THREADS=2 OMP_PROC_BIND=true ./test_dgesdd
 Time =    1.1620110740000000
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=true ./test_dgesdd 
 Time =    7.8668489140000002
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=false ./test_dgesdd 
 Time =    42.826767384000000

Using a little printf("%ld ", num); above #pragma omp parallel for... I found that this single dgesdd invokes the parallel region
16228 times, and switches the number of threads 7900 times. A small sample is this: 1 3 1 4 1 4 1 4 1 4 1 5 2 5 2 5 2 5 2 6 2 6 2 6 and 64 64 8 64 8 64 64 64 64 8 64 8 64 (https://gist.github.com/c367d0bf460ed385b2d994fcee5723e6)

After a google search, and inspired by https://stackoverflow.com/questions/24440118/openmp-parallell-region-overhead-increase-when-num-threads-varies
I adapted that test: https://gist.github.com/cb7b050f6d1f5a3893df4a1352714668, ran it on a node with 64 cores and watch the huge overhead:

$ gcc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 137.097901
64 threads 4427.660024
2/64 alternating 175142.989028

Intel (or clang for that matter) doesn't have this issue to the same extent, but also has no speedup for 2 threads:

$ icc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 4647.016525
64 threads 4935.026169
2/64 alternating 10691.881180

I'm not quite sure what the best solution is. Clearly if the number of threads stays low, there is a performance benefit (with libgomp), in this case about 32 (which happens to be exactly 64/2 threads so mostly linear versus the number of threads).

But switching too often is catastrophic by a factor 40 or so.

Perhaps a heuristic that could work is to keep track of say the latest 32 openmp regions and set num_threads to the maximum num of those regions? Then if you have 32 low threaded calls in a row you can still take advantage of the performance benefit.

@martin-frbg
Copy link
Collaborator

Thanks. I have not found a solution I like - keeping track of past behaviour adds its own overhead and is not particularly good at predicting the future unless the program really does the same computation over and over. Perhaps something as trivial as introducing a new environment variable "OPENBLAS_ADAPTIVE" to choose between pre- and post-#2775 behaviour on startup would already help?

@martin-frbg
Copy link
Collaborator

attempting to supersede this with #3703, using a new environment variable to choose between the two modes of operation

@kevincwells
Copy link

Since #3703 has been merged, can this PR now be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants