Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improve ProbabilisticMap on Avx512 #107798

Merged
merged 3 commits into from
Feb 28, 2025

Conversation

MihaZupan
Copy link
Member

This PR contains 3 separate changes that take advantage of Avx512's permute behavior.

The first change is skipping an AND instruction by taking advantage of the fact that PermuteVar64x8 will only look at the bottom 6 bits (64 values) of the control.

-Vector512<byte> index = values & Vector512.Create((byte)31);
-Vector512<byte> bitMask = Avx512Vbmi.PermuteVar64x8(charMap, index);
+Vector512<byte> bitMask = Avx512Vbmi.PermuteVar64x8(charMap, values);

Note that we were doing AND 31 instead of 63 though.
In this case this is okay because we know that the charMap is just a duplicated 256-bit lookup.
The 6th bit that we aren't masking off anymore will impact whether we pick from the first 0-31, or 32-63 values, but that doesn't matter since the two are the same.


The second change is also taking advantage of the above observation.
It recognizes that the values >>> 5 operation is emulated as values.AsInt32() >>> 5).AsByte() & Vector128.Create((byte)7) since there's no instruction for >>> on bytes on X86.
We can skip that & 7 operation if we swap out the shuffle for a permute. This way we're again only looking at the lower 6 bits. As before, we now have bits 4/5/6 that aren't getting masked off, but that's okay since the values are duplicated 8 times.

-Vector512<byte> shifted = values >>> 5;
-Vector512<byte> bitPositions = Avx512BW.Shuffle(Vector512.Create(0x8040201008040201).AsByte(), shifted);
+Vector512<byte> shifted = (values.AsInt32() >>> 5).AsByte();
+Vector512<byte> bitPositions = Avx512Vbmi.PermuteVar64x8(Vector512.Create(0x8040201008040201).AsByte(), shifted);

The third change is taking advantage of the PermuteVar32x8x2 instruction to pick alternating bytes from the two source vectors, instead of shifting the bytes around and doing a saturating pack.

-Vector512<byte> sourceLower = Avx512BW.PackUnsignedSaturate(
-    (source0 & Vector256.Create((ushort)255)).AsInt16(),
-    (source1 & Vector256.Create((ushort)255)).AsInt16());
+Vector512<byte> sourceLower = Avx512Vbmi.PermuteVar64x8x2(source0.AsByte(), Vector512.CreateSequence<byte>(0, 2), source1.AsByte());

-Vector512<byte> sourceUpper = Avx512BW.PackUnsignedSaturate(
-    (source0 >>> 8).AsInt16(),
-    (source1 >>> 8).AsInt16());
+Vector512<byte> sourceUpper = Avx512Vbmi.PermuteVar64x8x2(source0.AsByte(), Vector512.CreateSequence<byte>(1, 2), source1.AsByte());

Since PackUnsignedSaturate also shuffles inputs around a bit, we needed to reverse that by calling FixUpPackedVector512Result (another permute) if we did find any potential matches.
PermuteVar64x8x2 keeps the input order as-is, meaning we can skip that "fix up".

-if (TryFindMatchAvx512<TUseFastContains>(ref cur, PackedSpanHelpers.FixUpPackedVector512Result(result).ExtractMostSignificantBits(), ref state, out int index))
+if (TryFindMatchAvx512<TUseFastContains>(ref cur, result.ExtractMostSignificantBits(), ref state, out int index))

Overall, it adds up to a ~20% improvement.

public class ProbMap
{
    [Params(32, 64, 128, 256, 512, 1024, 10_000)]
    public int Length;

    private static readonly SearchValues<char> s_searchValues = SearchValues.Create("ßäöüÄÖÜ");
    private char[] _text;

    [GlobalSetup]
    public void Setup() => _text = new string('\n', Length).ToCharArray();

    [Benchmark]
    public int IndexOfAny() => _text.AsSpan().IndexOfAny(s_searchValues);
}
Method Toolchain Length Mean Error Ratio
IndexOfAny main 32 2.606 ns 0.0793 ns 1.00
IndexOfAny pr 32 2.414 ns 0.0560 ns 0.93
IndexOfAny main 64 2.750 ns 0.0505 ns 1.00
IndexOfAny pr 64 2.540 ns 0.0271 ns 0.92
IndexOfAny main 128 4.307 ns 0.0289 ns 1.00
IndexOfAny pr 128 3.175 ns 0.0483 ns 0.74
IndexOfAny main 256 5.637 ns 0.0459 ns 1.00
IndexOfAny pr 256 4.932 ns 0.0445 ns 0.87
IndexOfAny main 512 10.615 ns 0.0535 ns 1.00
IndexOfAny pr 512 8.577 ns 0.1427 ns 0.81
IndexOfAny main 1024 22.187 ns 0.1047 ns 1.00
IndexOfAny pr 1024 18.331 ns 0.3958 ns 0.83
IndexOfAny main 10000 231.458 ns 1.1701 ns 1.00
IndexOfAny pr 10000 185.283 ns 0.4801 ns 0.80

@MihaZupan MihaZupan added this to the 10.0.0 milestone Sep 13, 2024
@MihaZupan MihaZupan self-assigned this Sep 13, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

@MihaZupan MihaZupan requested a review from Copilot February 7, 2025 22:29

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

@MihaZupan MihaZupan merged commit 6422286 into dotnet:main Feb 28, 2025
137 of 139 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants