-
Notifications
You must be signed in to change notification settings - Fork 280
AVX runtime float/double swizzle small improvement #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks! nice fine tuning \o/ |
|
Looks like AVX tests are good now, but I'm unsure what the remaining failure is. |
|
LGTM, I'll fix the emulated part, not something you should worry on. Would you mind squashing the last two commits? |
|
@AntoinePrv when you squash, you can also rebase on |
| __m256 swapped = _mm256_permute2f128_ps(self, self, 0x01); // [high | low] | ||
|
|
||
| // normalize mask | ||
| batch<uint32_t, A> half_mask = mask % 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this generate different asm actually?
PS I am fine either way. Might be worth having a normalize<value> API so that we can use everywhere that converts value to mask if it is a pow2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure. For a regular integer operation I would say so but with intrinsics I had a doubt.
|
I like this PR! Nice that you found an ulterior way to optimize this! I would suggest that generating the blend mask and the normalization to be pure method or class method that we can use elsewhere. I think we might need them here and there. This also allow to unit test them individually making debugging and development easier. Again, this is just a suggestion. Feel free to ignore it. |
|
Merged as 9d41ad9 (once squashed) |
I have a number of of swizzle improvements to suggest, but I am starting small to get better accustomed to xsimd.
What do you make of the following change? My motivation was that
_mm256_permute2f128_psis the most expensive operation (though not sure if that's a problem in a CPU pipeline) so this PR suggest using it only once.It also replaces modulo with a select mask to make sure this is properly optimized.