Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds OpenMP to qsort, should also improve test speed a bit #179

Merged
merged 7 commits into from
Mar 28, 2025

Conversation

sterrettm2
Copy link
Contributor

This adds OpenMP acceleration to quicksort, in addition to making some changes to the testing code. On my current testing system, this gives up to a 3x speedup, though I think more may be achieved on a stronger system.

Benchmarks:

10m
Comparing simdsort/random_10m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_10m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/qsort-openmp/builddir/benchexe)
Benchmark                                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------------ 50815267
[simdsort/random_10m vs. simdsort/random_10m]/uint64_t_mean                  -0.6462         -0.6461     161374745      57099765     161274843      57081772
[simdsort/random_10m vs. simdsort/random_10m]/int64_t_mean                   -0.6722         -0.6720     161381007      52894903     161242920      52883775
[simdsort/random_10m vs. simdsort/random_10m]/uint32_t_mean                  -0.6683         -0.6684      76346854      25321800      76335248      25316127
[simdsort/random_10m vs. simdsort/random_10m]/int32_t_mean                   -0.6461         -0.6462      74568936      26387521      74555377      26374643
[simdsort/random_10m vs. simdsort/random_10m]/uint16_t_mean                  +0.0286         +0.0286     306422934     315196752     306379370     315156800
[simdsort/random_10m vs. simdsort/random_10m]/int16_t_mean                   +0.0159         +0.0159     308471111     313372843     308432095     313323090
[simdsort/random_10m vs. simdsort/random_10m]/float_mean                     -0.6593         -0.6593      75784101      25821957      75777886      25815941
[simdsort/random_10m vs. simdsort/random_10m]/double_mean                    -0.6613         -0.6613     140408876      47560519     140340074      47530458
OVERALL_GEOMEAN                                                              -0.5520         -0.5520             0             0             0             0
1m
Comparing simdsort/random_1m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_1m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/qsort-openmp/builddir/benchexe)
Benchmark                                                                     Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_1m vs. simdsort/random_1m]/uint64_t_mean                  -0.5563         -0.5564      13892838       6163743      13893768       6162738
[simdsort/random_1m vs. simdsort/random_1m]/int64_t_mean                   -0.5439         -0.5441      13943919       6359981      13944683       6358016
[simdsort/random_1m vs. simdsort/random_1m]/uint32_t_mean                  -0.4661         -0.4662       6244794       3334286       6244370       3333435
[simdsort/random_1m vs. simdsort/random_1m]/int32_t_mean                   -0.3862         -0.3861       6172364       3788625       6173740       3789904
[simdsort/random_1m vs. simdsort/random_1m]/uint16_t_mean                  -0.0138         -0.0138      31545637      31108834      31548000      31111137
[simdsort/random_1m vs. simdsort/random_1m]/int16_t_mean                   -0.0076         -0.0076      30964472      30728775      30967064      30730877467       3261958
[simdsort/random_1m vs. simdsort/random_1m]/float_mean                     -0.4568         -0.4566       6418410       3486598       6419908       3488443
[simdsort/random_1m vs. simdsort/random_1m]/double_mean                    -0.5486         -0.5488      11469483       5177166      11469799       5174866
OVERALL_GEOMEAN                                                            -0.4088         -0.4089             0             0             0             0

And to show that there does not seem to be a regression with small sizes due to the SIMD logic:

128
Benchmark                                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_128 vs. simdsort/random_128]/uint64_t_mean                  -0.0442         -0.0512           866           828           878           833
[simdsort/random_128 vs. simdsort/random_128]/int64_t_mean                   -0.0006         -0.0028           821           821           828           826
[simdsort/random_128 vs. simdsort/random_128]/uint32_t_mean                  +0.0152         +0.0150           583           591           587           596
[simdsort/random_128 vs. simdsort/random_128]/int32_t_mean                   +0.0210         +0.0150           582           594           588           597
[simdsort/random_128 vs. simdsort/random_128]/uint16_t_mean                  -0.0034         +0.0008          1933          1926          1939          1941
[simdsort/random_128 vs. simdsort/random_128]/int16_t_mean                   +0.0061         +0.0104          1889          1901          1896          1916
[simdsort/random_128 vs. simdsort/random_128]/float_mean                     +0.0012         +0.0022           599           600           605           606
[simdsort/random_128 vs. simdsort/random_128]/double_mean                    +0.0079         +0.0086           707           712           711           717
OVERALL_GEOMEAN                                                              +0.0003         -0.0003             0             0             0             0

In addition, this adds larger tests for quicksort to test the OpenMP logic. Out of concern for the runtime of the test suite, I modified both these new tests and the older kv tests to only use the larger sizes for the main sort, the only one that uses OpenMP and thus needs them. I believe this should result in a noticeable net reduction in the runtime of the test suite.

@r-devulap
Copy link
Contributor

[simdsort/random_10m vs. simdsort/random_10m]/uint16_t_mean                  +0.0286         +0.0286     306422934     315196752     306379370     315156800
[simdsort/random_10m vs. simdsort/random_10m]/int16_t_mean                   +0.0159         +0.0159     308471111     313372843     308432095     313323090

I assume this was benchmarked on a SKX?

r-devulap
r-devulap previously approved these changes Mar 25, 2025
Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I noticed you use the same thresholds mostly from the key-value sort. Did you get a chance to experiment with it?

@sterrettm2
Copy link
Contributor Author

Oops, that int16 slowness seems to be a real issue. Let me fix that quickly

@sterrettm2
Copy link
Contributor Author

sterrettm2 commented Mar 26, 2025

Okay, this should be fixed now. Here are some benchmarks from my SPR machine:
EDIT: Added the 16-bit swizzle patch

SPR Benchmarks
Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_10m/ vs. simdsort/random_10m/]uint64_t                -0.6771         -0.6771      76630598      24743835      76628741      24743600
[simdsort/random_10m/ vs. simdsort/random_10m/]int64_t                 -0.6819         -0.6819      77024679      24503926      77018065      24497835
[simdsort/random_10m/ vs. simdsort/random_10m/]uint32_t                -0.7413         -0.7414      31002898       8020485      30998289       8016108
[simdsort/random_10m/ vs. simdsort/random_10m/]int32_t                 -0.7382         -0.7382      30928995       8098572      30928261       8097811
[simdsort/random_10m/ vs. simdsort/random_10m/]uint16_t                -0.9503         -0.9503     113969113       5663682     113952771       5663444
[simdsort/random_10m/ vs. simdsort/random_10m/]int16_t                 -0.9505         -0.9505     114573142       5673764     114556807       5673571
[simdsort/random_10m/ vs. simdsort/random_10m/]float                   -0.7369         -0.7370      30276061       7965491      30275684       7963746
[simdsort/random_10m/ vs. simdsort/random_10m/]double                  -0.6361         -0.6360      63558447      23129980      63549471      23129564
[simdsort/random_10m/ vs. simdsort/random_10m/]_Float16                -0.9435         -0.9435      77698793       4391234      77689627       4390512
OVERALL_GEOMEAN                                                        -0.8346         -0.8346             0             0             0             0

@r-devulap r-devulap merged commit b12fe4d into intel:main Mar 28, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants