Further improve ProbabilisticMap on Avx512 #107798
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains 3 separate changes that take advantage of Avx512's permute behavior.
The first change is skipping an AND instruction by taking advantage of the fact that
PermuteVar64x8
will only look at the bottom 6 bits (64 values) of the control.Note that we were doing
AND 31
instead of63
though.In this case this is okay because we know that the
charMap
is just a duplicated 256-bit lookup.The 6th bit that we aren't masking off anymore will impact whether we pick from the first 0-31, or 32-63 values, but that doesn't matter since the two are the same.
The second change is also taking advantage of the above observation.
It recognizes that the
values >>> 5
operation is emulated asvalues.AsInt32() >>> 5).AsByte() & Vector128.Create((byte)7)
since there's no instruction for>>>
on bytes on X86.We can skip that
& 7
operation if we swap out the shuffle for a permute. This way we're again only looking at the lower 6 bits. As before, we now have bits 4/5/6 that aren't getting masked off, but that's okay since the values are duplicated 8 times.The third change is taking advantage of the
PermuteVar32x8x2
instruction to pick alternating bytes from the two source vectors, instead of shifting the bytes around and doing a saturating pack.Since
PackUnsignedSaturate
also shuffles inputs around a bit, we needed to reverse that by callingFixUpPackedVector512Result
(another permute) if we did find any potential matches.PermuteVar64x8x2
keeps the input order as-is, meaning we can skip that "fix up".Overall, it adds up to a ~20% improvement.