New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic binary operators (and, or, xor, not) to IO::Buffer
.
#5893
Conversation
9646ba3
to
6f63ec3
Compare
memory_xor(unsigned char * restrict output, unsigned char * restrict base, size_t size, unsigned char * restrict mask, size_t mask_size) | ||
{ | ||
for (size_t offset = 0; offset < size; offset += 1) { | ||
output[offset] = base[offset] ^ mask[offset % mask_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
% mask_size
is likely quite slower than a bit mask (if the size is power of two), or condition with a separate mask_index resetting to 0 if >= mask_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the assembly and it looks to generate fairly efficient SSE/AVX instructions, so I trust the compiler is doing the right things here. In any case, simple and correct is a good first implementation, and later on we can unroll it by hand if performance is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about unrolling. It's about a modulo operation is as expensive as division (many cycles), and a lot more than + - * & ! ^
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this: https://godbolt.org/z/7MEfsrz9z - change -Os to -O3 and you see the compiler will unroll the loop and load the data in chunks. My understanding is that by unrolling the loop carefully, some division operations can be skipped.
If you know how we can count the number of divs being performed, that would be awesome, otherwise, I'm fine with slow but correct to start with, and figuring out a faster implementation later as an improvement. If you think adding a branch to the inner loop will ultimately be faster, let's do that? But at that point I think we need to benchmark it. It's entirely possible the compiler would do a worse job at unrolling a loop with an if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly confident it does not remove the division operations, it could only do that if it did the branch I suggest, and there are plenty of divl
and divq
left in that assembly.
Maybe it does a few less, but it could only do that without branches if it can make sure the mask_size is e.g. > 1, and it cannot know that.
I agree, we should benchmark it, then it should be clear. With the explicit condition there would be 0 division left, but the extra branch. It should still unroll fine.
The best would be if we could have the mask_size
as a constant in the source, this should be possible by doing an early check and call with a constant for the mask_size
we care about + force inline this function.
This introduces both the operators (
&
,|
,^
and~
) as well as in-place methods:and!
,or!
,xor!
,not!
.