Skip to content

Conversation

@AntoinePrv
Copy link
Contributor

I have a number of of swizzle improvements to suggest, but I am starting small to get better accustomed to xsimd.

What do you make of the following change? My motivation was that _mm256_permute2f128_ps is the most expensive operation (though not sure if that's a problem in a CPU pipeline) so this PR suggest using it only once.

It also replaces modulo with a select mask to make sure this is properly optimized.

@AntoinePrv AntoinePrv changed the title AVX runtime float/double swizzle improvement AVX runtime float/double swizzle small improvement Oct 31, 2025
@serge-sans-paille
Copy link
Contributor

Thanks! nice fine tuning \o/

@AntoinePrv
Copy link
Contributor Author

Looks like AVX tests are good now, but I'm unsure what the remaining failure is.
👀 @serge-sans-paille @JohanMabille

@serge-sans-paille
Copy link
Contributor

LGTM, I'll fix the emulated part, not something you should worry on. Would you mind squashing the last two commits?

@serge-sans-paille
Copy link
Contributor

@AntoinePrv when you squash, you can also rebase on master which now contains a fix for the emulated part

__m256 swapped = _mm256_permute2f128_ps(self, self, 0x01); // [high | low]

// normalize mask
batch<uint32_t, A> half_mask = mask % 4;
Copy link
Contributor

@DiamonDinoia DiamonDinoia Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this generate different asm actually?

PS I am fine either way. Might be worth having a normalize<value> API so that we can use everywhere that converts value to mask if it is a pow2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure. For a regular integer operation I would say so but with intrinsics I had a doubt.

@DiamonDinoia
Copy link
Contributor

I like this PR! Nice that you found an ulterior way to optimize this!

I would suggest that generating the blend mask and the normalization to be pure method or class method that we can use elsewhere. I think we might need them here and there.

This also allow to unit test them individually making debugging and development easier.

Again, this is just a suggestion. Feel free to ignore it.

@serge-sans-paille
Copy link
Contributor

serge-sans-paille commented Nov 3, 2025

Merged as 9d41ad9 (once squashed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants