-
-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partially refactor SSE2 blitters with macros #2656
Partially refactor SSE2 blitters with macros #2656
Conversation
So funny story, guess what I was working on last night before you posted this PR? Starbuck5@4233e2c. I guess I'll shelve that. Alright, here's what I think. I don't like having the DUFF loop in the non-4-at-a-time pixel segments. It seems wasteful to me to duplicate the code a lot just so it can be run through once per height. The 4-at-a-time pixel segments could go through 800 pixels every row, the non-4-at-a-time will only ever go through 3 at most. I'd like to keep the functions themselves (like the function definitions, the names) handwritten in the code, rather than having them defined by the macro. This way you can jump around the code in an IDE to those functions or search for them directly. This is also a style preference. Thoughts? I'd like the blit macros to be generic seeming. Rather than a specific slot for a code, an alpha code and an alpha prep, I'd prefer a slot for code and a slot for prep. Slot for prep wouldn't even be necessary, if you handwrite the functions. Here's an example of what I'm talking about: void
blit_blend_rgb_add_sse2(SDL_BlitInfo *info)
{
SETUP_SSE2_BLITTER;
const __m128i mm128_rgbmask =
_mm_set1_epi32(~(info->src->Amask | info->dst->Amask));
RUN_SSE2_BLITTER({
mm128_src = _mm_and_si128(mm128_src, mm128_rgbmask);
mm128_dst = _mm_adds_epu8(mm128_dst, mm128_src);
})
} This macro strategy would be more consistent with the AVX macro strategy and I believe it would be better suited to expand to the SSE2 alpha blitters in the future. I understand this will be more lines of code than what you have now, but I don't think it's bloated at all. I understand you're using the same strategy you used in the SSE2 fill and I didn't complain then, but this has a better argument for wanting to be more generic. Also, tell me about the |
I'm fine with this, I'll test the perf difference out of curiosity but I'll change that no matter what.
I was trying to replicate the filler strategy. My idea was to keep the macros generic for now. This way, if we want to modify a function like ADD in the future, we can simply remove one line and apply your strategy of using macros within a function definition. This means we won’t have to change all the macros, just replace individual instances with custom code. I understand your point and partially agree. However, one can simply look at the SSE blitters file to see that the blitters are defined with macros, which should give a clear picture of what’s happening.
The only reason I did that was to help the people working on this stuff when reading to make them know what's constant and what's not. So basically to help them read the code. |
I did not do any performance testing, I trust your results. I did do some correctness testing, with the following program: import random
import hashlib
import pygame
random.seed(36)
test_flags = [
pygame.BLEND_RGB_ADD,
pygame.BLEND_RGBA_ADD,
pygame.BLEND_RGB_SUB,
pygame.BLEND_RGBA_SUB,
pygame.BLEND_RGB_MAX,
pygame.BLEND_RGBA_MAX,
pygame.BLEND_RGB_MIN,
pygame.BLEND_RGBA_MIN,
pygame.BLEND_RGB_MULT,
pygame.BLEND_RGBA_MULT,
]
expected_hashes = [
"a32c63faa4e6485f523d89d6abf36915e3577666a15f582a6355f0928460ac0e",
"7a28ea1660965a1d696751fd752243e71a57b15ce093b4bf0161b77be44a0866",
"3eb1ce3593d4e5c849484476633426e206c28055f21670dacf4b75a889153150",
"d167fe901dbe6eabf4072642f6281ff259ee95f18488b56b8d2c8085c5100cf8",
"28ac1f3bd5b264780c270abeddfcbbf0db5fa60470a4d5bd95de85e560b205f6",
"5d0524379429a3d446573630bb0c0a84986685bffaf064f4c9e6a417a040a327",
"caa5a5c310ae532e3310ce1b4b52d0f76680e613c1d12349aa171b9d937c3250",
"1e8a730cbad33abc8be6e01338a8d9d2065cebf314d2dcae4c8c09e4ff648a0e",
"d4e17e77fd276f631860d51e861087cfb0e3c7bae04728a917fc645c8e92bbdb",
"2b63502cb372733967e476d076a044ae51566cbc70aa4e6e5295748505a3aaff",
]
surf_size = (20, 10)
offset = (3, 7)
def populate_surf(surf):
for y in range(surf.get_height()):
for x in range(surf.get_width()):
surf.set_at(
(x, y),
(
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
),
)
hashes = []
for flag in test_flags:
source = pygame.Surface(surf_size, pygame.SRCALPHA)
dest = pygame.Surface(surf_size, pygame.SRCALPHA)
populate_surf(source)
populate_surf(dest)
dest.blit(source, offset, special_flags=flag)
sha256 = hashlib.sha256()
sha256.update(pygame.image.tobytes(dest, "RGBA"))
digest = sha256.hexdigest()
assert digest == expected_hashes.pop(0)
hashes.append(digest)
print(digest)
print(hashes) I tested the results with SSE2 on main, AVX2 on main, and SSE2 on this branch. All of them had the same hashes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice code quality improvement! (And also perf improvement for mult)
02b7420
to
c648be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I can confirm the small performance enhancement on my machine, and the reduction in code size speaks for itself. Nice work! 🎉
Similar to #2055, this PR tries to streamline and condense our blitter SSE2 code to make it easier to work with and understand/fix. This PR only implements the respective RGB and RGBA variants of the
ADD
,SUB
,MULT
,MIN
andMAX
. Makes the file 514 lines shorter.These changes don't affect performance negatively in any case, instead they improve performance for the SSE2 MULT/RGBA_MULT cases by about 15-21% due to the old version using a worse strategy.
I used two separate programs to test, one for generating the data and another for plotting and printing results.
And a warning, the generation program has some option at the top of the files which are to ensure the testing is fair and precise. This current setup generates .json files that could be very big so watch out or reduce the
REP
value.Data Program:
Visualization Program: