Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better ARM intrinsics implementation for dn_simdhash #113095

Merged
merged 7 commits into from
Mar 7, 2025

Conversation

kg
Copy link
Member

@kg kg commented Mar 3, 2025

See #113074
Speculative better implementation of dn_simdhash search for ARM NEON. I've examined the generated assembly and it looks right, but I don't have a way to test this locally right now so I'm testing it on CI.
This by necessity restricts NEON support to ARM64 (ARM32's vector registers are too narrow and not all the intrinsics I need exist on it). On ARM32 we should use scalar fallback.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 3, 2025
@kg kg changed the title Simdhash arm shrn Better ARM intrinsics implementation for dn_simdhash Mar 3, 2025
@kg kg added area-Infrastructure-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 3, 2025
Copy link
Contributor

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

@kotlarmilos
Copy link
Member

The results are comparable with the scalar approach.

This approach - Total AOT time: 47399 ms
Scalar fallback - Total AOT time: 46708 ms

@steveisok
Copy link
Member

/azp run runtime-ioslike

Copy link

Pull request contains merge conflicts.

@kg kg force-pushed the simdhash-arm-shrn branch from 4d0fc8d to 6f7e43e Compare March 4, 2025 21:41
@steveisok
Copy link
Member

/azp run runtime-ioslike

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@steveisok
Copy link
Member

/azp run runtime-android

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kg
Copy link
Member Author

kg commented Mar 4, 2025

Timings from an Ampere ARM64 VM with this PR:

ght_find_random_keys: Warmed 6 time(s). Running 2048 iterations... 15 step(s): avg 68.540ns min 67.670ns max 69.906ns
dn_find_random_keys: Warmed 8 time(s). Running 2048 iterations... 19 step(s): avg 52.125ns min 51.060ns max 53.129ns
dn_find_missing_key: Warmed 10 time(s). Running 4096 iterations... 13 step(s): avg 39.811ns min 39.321ns max 40.401ns
ght_find_missing_key: Warmed 6 time(s). Running 2048 iterations... 14 step(s): avg 74.905ns min 73.549ns max 77.544ns

vs the scalar implementation:

ght_find_random_keys: Warmed 6 time(s). Running 2048 iterations... 15 step(s): avg 66.992ns min 66.052ns max 68.372ns
dn_find_random_keys: Warmed 6 time(s). Running 2048 iterations... 15 step(s): avg 69.237ns min 68.365ns max 70.181ns
dn_find_missing_key: Warmed 8 time(s). Running 2048 iterations... 18 step(s): avg 55.457ns min 54.583ns max 55.913ns
ght_find_missing_key: Warmed 5 time(s). Running 2048 iterations... 13 step(s): avg 78.604ns min 77.465ns max 79.513ns

Based on this it seems like Apple's CPUs are just really good at scalar code, but the same is not necessarily true for ARM64 chips from other manufacturers. So it's worth landing this to improve performance on i.e. random Android devices even if it won't improve MacOS/iOS.

@steveisok
Copy link
Member

/azp run runtime-android

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kg
Copy link
Member Author

kg commented Mar 4, 2025

Determined that using vector loads instead of byte loads generates smaller, faster ARM assembly, so we do that for ARM64 targets now. This means byte loads are used on x64 and any unsupported target, and vector loads are used on WASM and ARM64.

ght_find_random_keys: Warmed 6 time(s). Running 2048 iterations... 15 step(s): avg 65.169ns min 63.989ns max 66.211ns
dn_find_random_keys: Warmed 10 time(s). Running 4096 iterations... 12 step(s): avg 42.675ns min 41.976ns max 43.275ns
dn_find_missing_key: Warmed 12 time(s). Running 4096 iterations... 15 step(s): avg 33.815ns min 33.653ns max 34.333ns
ght_find_missing_key: Warmed 6 time(s). Running 2048 iterations... 13 step(s): avg 77.407ns min 76.249ns max 79.503ns

@steveisok
Copy link
Member

/azp run runtime-android

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kg
Copy link
Member Author

kg commented Mar 6, 2025

/backport to release/9.0-staging

Copy link
Contributor

github-actions bot commented Mar 6, 2025

Started backporting to release/9.0-staging: https://github.com/dotnet/runtime/actions/runs/13708261750

Copy link
Contributor

github-actions bot commented Mar 6, 2025

@kg backporting to "release/9.0-staging" failed, the patch most likely resulted in conflicts:

$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: New ARM NEON implementation of find_first_matching_suffix_simd
Using index info to reconstruct a base tree...
M	src/native/containers/dn-simdhash-arch.h
Falling back to patching base and 3-way merge...
Auto-merging src/native/containers/dn-simdhash-arch.h
CONFLICT (content): Merge conflict in src/native/containers/dn-simdhash-arch.h
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 New ARM NEON implementation of find_first_matching_suffix_simd
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants