Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: hash_map: optimize isFree/isTombstone #10562

Merged

Conversation

sentientwaffle
Copy link
Contributor

Optimize HashMap.

  • Add an Metadata.isFree helper method (not an optimization, just for clarity).
  • Implement Metadata.isTombstone and Metadata.isFree with @bitCast then comparing to a constant. I assume @bitCast-then-compare is faster than the old method because it only involves one comparison, and doesn't require bitmasking.
  • Summary of benchmark results (gotta-go-fast, run locally, compared to master):
    • 3/4 of the hash map benchmarks used ~10% fewer cycles
    • The last one (project Euler) shows 4% fewer cycles.
Benchmarks
label available pre-available-fix available-faster is-free-tomb-bitcasts
commit ef0566d d54ba76 3861c96 d1c8fb5
random-find
task-clock (ms) 16,534 13,772 18,384 14,945
context-switches 0 0 0 0
cpu-migrations 0 0 0 0
page-faults 467 469 467 467
cycles 31,104,803,977 25,945,119,875 34,588,214,725 28,155,425,962
instructions 80,226,182,400 70,491,433,560 86,563,110,421 90,400,354,978
branches 14,189,529,026 11,809,816,759 12,769,551,259 14,083,913,924
branch misses 319,005,834 229,735,717 252,086,595 122,622,922
L1-dcache-loads 8,896,046,933 7,907,823,608 10,536,298,474 10,919,177,972
L1-dcache-load-misses 174,337,652 164,790,077 172,284,377 148,872,050
LLC-loads 3,725,037 2,245,202 3,826,286 1,444,452
LLC-load-misses 343,656 93,780 354,132 51,303
random-distinct
task-clock (ms) 1,892 1,860 1,993 1,700
context-switches 0 0 0 0
cpu-migrations 0 0 0 0
page-faults 66,911 66,912 66,910 66,910
cycles 3,269,601,475 3,163,723,978 3,452,727,683 2,889,333,627
instructions 3,426,699,012 2,677,609,405 3,699,799,236 3,342,951,341
branches 517,835,326 433,026,244 542,492,094 467,979,288
branch misses 24,678,474 24,636,232 32,133,887 25,161,544
L1-dcache-loads 552,992,863 394,913,076 631,125,865 575,465,372
L1-dcache-load-misses 106,111,926 109,344,475 112,017,648 107,277,507
LLC-loads 51,676,188 56,874,928 53,674,073 52,664,954
LLC-load-misses 17,796,075 22,959,431 16,115,924 15,356,085
insert-10M-int
task-clock (ms) 3,021 2,960 4,185 2,781
context-switches 0 0 0 0
cpu-migrations 0 0 0 0
page-faults 73,759 73,760 73,759 73,759
cycles 5,348,995,897 5,191,935,672 7,583,195,490 4,891,625,417
instructions 3,798,916,797 3,144,621,177 4,629,854,698 3,733,511,249
branches 619,672,586 553,510,387 720,072,866 502,442,844
branch misses 32,061,922 31,377,103 43,435,586 31,756,749
L1-dcache-loads 570,232,081 458,894,618 711,517,039 623,767,204
L1-dcache-load-misses 179,585,635 182,700,348 191,241,512 180,021,265
LLC-loads 80,555,904 83,639,512 81,533,360 79,334,076
LLC-load-misses 32,926,971 34,002,117 31,937,085 31,143,504
project-euler-14-main
task-clock (ms) 325 388 430 315
context-switches 0 0 0 0
cpu-migrations 0 0 0 0
page-faults 17,416 17,416 17,417 17,416
cycles 534,002,560 651,213,437 718,199,420 511,703,222
instructions 396,849,402 395,656,209 467,199,680 382,598,844
branches 71,840,304 69,010,106 72,189,551 60,020,886
branch misses 5,080,626 4,667,034 5,949,324 4,655,130
L1-dcache-loads 79,627,594 77,122,812 95,701,841 81,966,220
L1-dcache-load-misses 20,540,324 21,259,274 22,035,398 20,491,521
LLC-loads 9,833,842 10,622,492 10,177,942 9,656,479
LLC-load-misses 4,140,400 5,168,282 4,763,765 3,520,579

And I experimented a few other optimizations that didn't pan out -- see here for all the data.

cc: @jorangreef @Sahnvour

- Add an `Metadata.isFree` helper method.
- Implement `Metadata.isTombstone` and `Metadata.isFree` with `@bitCast` then comparing to a constant. I assume `@bitCast`-then-compare is faster than the old method because it only involves one comparison, and doesn't require bitmasking.
- Summary of benchmarked changes (`gotta-go-fast`, run locally, compared to master):
  - 3/4 of the hash map benchmarks used ~10% fewer cycles
  - The last one (project Euler) shows 4% fewer cycles.
@Sahnvour
Copy link
Contributor

I would have expected the optimizer to catch this simplification!

@Sahnvour Sahnvour self-requested a review January 10, 2022 21:16
@SpexGuy
Copy link
Contributor

SpexGuy commented Jan 10, 2022

I would have expected the optimizer to catch this simplification!

It's not quite equivalent, the old check of isUsed() or isTombstone() would stop looping for slots which were marked as unused but had nonzero fingerprint bits. The new version will keep looping if this happens, because it doesn't recognize the slot as free. The optimizer can't prove that this case doesn't happen, so it can't make this transformation.

@sentientwaffle
Copy link
Contributor Author

Restructuring isUsed or isTombstone into !isFree is a change for clarity, not performance. The actual optimization is the @bitCast(u8, self) == slot_free; (as opposed to checking the used and fingerprint fields independently).

I too was surprised that the compiler didn't handle that optimization (i.e. combining comparisons of a packed struct's fields) automatically.

@andrewrk andrewrk merged commit 4731a6e into ziglang:master Jan 11, 2022
andrewrk pushed a commit that referenced this pull request Jan 12, 2022
- Add an `Metadata.isFree` helper method.
- Implement `Metadata.isTombstone` and `Metadata.isFree` with `@bitCast` then comparing to a constant. I assume `@bitCast`-then-compare is faster than the old method because it only involves one comparison, and doesn't require bitmasking.
- Summary of benchmarked changes (`gotta-go-fast`, run locally, compared to master):
  - 3/4 of the hash map benchmarks used ~10% fewer cycles
  - The last one (project Euler) shows 4% fewer cycles.
scorphus pushed a commit to scorphus/zig that referenced this pull request Jan 15, 2022
- Add an `Metadata.isFree` helper method.
- Implement `Metadata.isTombstone` and `Metadata.isFree` with `@bitCast` then comparing to a constant. I assume `@bitCast`-then-compare is faster than the old method because it only involves one comparison, and doesn't require bitmasking.
- Summary of benchmarked changes (`gotta-go-fast`, run locally, compared to master):
  - 3/4 of the hash map benchmarks used ~10% fewer cycles
  - The last one (project Euler) shows 4% fewer cycles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants