AVX Surface.fill() setup, AVX BLEND_ADD #2382

itzpr3d4t0r · 2023-08-05T16:38:17Z

Our current implementation of Surface.fill() when using blend flags only implements the single-pixel strategy. This is a massive opportunity to speed things up.
This PR tries to start the changes with BLEND_ADD.

Results:

OLD FILL
2.042152139999234
2.060880370000814
1.977689829999872
2.0481357899989234
1.9969090399994456

NEW FILL
0.01658679000029224
0.01656728999951156
0.01650887000068906
0.016579669998463942
0.016620350000448526

BLIT (avx2 with cached surface)
0.02195337999946787
0.021780080000462478
0.02219769999937853
0.022039489999588113
0.02213338999863481

BLIT (avx2 no cached surface)
0.33580584000083036
0.34673017999957667
0.34221240999904695
0.33031794000053194
0.33174736999790183

Test Program:

from statistics import mean
from timeit import repeat

import pygame

pygame.init()

surf = pygame.Surface((500, 500))
surf.fill((132, 33, 200))

color = pygame.Surface((500, 500))
color.fill((24, 24, 24))

G = globals()

teststr = """
surf.fill((24, 24, 24), None, pygame.BLEND_ADD)
"""
for _ in range(5):
    print(mean(repeat(teststr, globals=G, number=1000, repeat=10)))

print()
teststr = """
surf.blit(color, (0, 0), None, pygame.BLEND_ADD)
"""
for _ in range(5):
    print(mean(repeat(teststr, globals=G, number=1000, repeat=10)))

Starbuck5 · 2023-08-09T05:28:59Z

Couple preliminary things.

I was interested to see if I could find any examples of people actually using Surface.fill with a blend flag online. I did, I found this: https://github.com/Rabbid76/PyGameExamplesAndAnswers/blob/master/documentation/pygame/pygame_blending_and_transaprency.md#change-the-color-of-a-surface-area-mask

AVX2 only works on x86, our SSE2 code also runs on ARM due to SSE2Neon.h. So SSE2 is more broadly important to us, since it will help x86 computers as well as ARM macs and other ARM devices.

I like that you're taking inspiration from my macro strategy. I see that you're using the SSE and AVX2 registers together, my blit macros use masked stores on the non-aligned edges so everything can be done with AVX2 registers and instructions. Is that something you'd like to do here?

itzpr3d4t0r · 2023-08-09T07:15:36Z

AVX2 only works on x86, our SSE2 code also runs on ARM due to SSE2Neon.h. So SSE2 is more broadly important to us, since it will help x86 computers as well as ARM macs and other ARM devices.

If what you mean is that i should do the SSE2 version first i guess it's fine to see this PR as a setup for avx and then the blend add SSE2 version and setup will come next instead of expanding AVX2, so after these two PRs we could either implement both the sse and avx versions of the same flags in a single PR or separately.

I like that you're taking inspiration from my macro strategy.

yep =).

I see that you're using the SSE and AVX2 registers together, my blit macros use masked stores on the non-aligned edges so everything can be done with AVX2 registers and instructions. Is that something you'd like to do here?

Yeah I've already replicated your work to compare performance and see the benefits. Tbh i thought about the "only avx" stategy myself without knowing your implementation. In practice i didn't see much of a difference, basically the same. I've also switched unrolled loops with a for loop. The main benefit there is that we would just need a single code for filling instead of two which is good. I didn't push that yet but might be wise for simplicity.

Starbuck5 · 2023-08-13T09:32:51Z

You’re talking about two codes, in my macro you only need 1, and you only need AVX for the AVX blitter. This is not about performance, this is about code simplicity. And this is the approach I would prefer.

Unrolled vs normal for loop— I don’t care too much. Unrolled has a larger code size, so if there’s no measurable benefit I’d do a normal for loop.

src_c/surface_fill.c

src_c/simd_surface_fill_avx2.c

Starbuck5 · 2023-08-21T08:12:05Z

I saw that you moved to my favored strategy and then moved away from it. I think there are speedups to be had in your implementation of my suggested strategy.

Another consideration is code size: your current implementation copy-pastes the "add" code into the final binary 14 times by my count, because of the loop unrolling and the macro. Code size could be a bigger impact if it was a bigger routine (like a blit), rather than just a handful of instructions, but it's something to keep in mind.

src_c/surface_fill.c

src_c/simd_fill.h

MyreMylar · 2023-10-15T09:50:40Z

looks like this needs a merge with main to get past the old CircleCI failure.

…ead of vector calculations.

…at the end.

MyreMylar

Alright, LGTM 👍 (passes all my visual tests, and I also see the expected speedup)

SIMD with the add blend is so nice and straightforward! 🎉

Starbuck5

This looks good to me.

On to an SSE2 implementation, and roll out to other flags?

Starbuck5 · 2023-11-12T06:29:49Z

I'd like to be squashed down a bit before merge, please.

itzpr3d4t0r added the Performance Related to the speed or resource usage of the project label Aug 5, 2023

itzpr3d4t0r marked this pull request as ready for review August 5, 2023 19:51

itzpr3d4t0r requested a review from a team as a code owner August 5, 2023 19:51

itzpr3d4t0r added the Surface pygame.Surface label Aug 6, 2023

itzpr3d4t0r force-pushed the surface-fill-add-optimization branch from b2bd30d to 5d7f47c Compare August 6, 2023 08:03

itzpr3d4t0r mentioned this pull request Aug 11, 2023

Surface.fill() performance improvements #2390

Closed

This comment was marked as outdated.

Sign in to view

itzpr3d4t0r changed the title ~~SIMD'd Surface.fill when using BLEND_ADD~~ AVX Surface.fill() setup, AVX BLEND_ADD Aug 15, 2023

Starbuck5 reviewed Aug 21, 2023

View reviewed changes

src_c/surface_fill.c Outdated Show resolved Hide resolved

Starbuck5 reviewed Aug 21, 2023

View reviewed changes

src_c/simd_surface_fill_avx2.c Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

dr0id reviewed Sep 10, 2023

View reviewed changes

src_c/surface_fill.c Show resolved Hide resolved

itzpr3d4t0r force-pushed the surface-fill-add-optimization branch from da66157 to 46a4483 Compare September 11, 2023 21:56

Temmie3754 reviewed Oct 9, 2023

View reviewed changes

src_c/simd_fill.h Outdated Show resolved Hide resolved

itzpr3d4t0r and others added 11 commits October 15, 2023 11:54

optimizes the BLEND_ADD flag when used in surface.fill

f876bd8

now loading color via macro, enabling direct color preprocessing inst…

21debc5

…ead of vector calculations.

addressed concerns. Less code required. Refactors.

40537fe

optimization and fixes

ef127c7

rollback to LOOP_UNROLLED4 for the 8-pixels case to restore performance

b2dd365

fix

cd09175

Now using for loop instead of loop unrolled. Excess pixels processed …

4622395

…at the end.

fix

97601cb

format

7e063bf

refactors, added more comments

f24f290

Shortened required code

567366c

itzpr3d4t0r added 8 commits October 15, 2023 11:54

reverted back to old strategy without dispatch

c677085

try to setup sse2

326b9b7

preliminary stuff

4fb84b3

removed simd_shared include and function bodies

1cbb731

fixes?

9d6123b

removed pitch safeguards

9287f5e

remove sse2

7d5354e

removed a comment, simplified bpp calculations

de7b49b

itzpr3d4t0r force-pushed the surface-fill-add-optimization branch from 1e75d63 to de7b49b Compare October 15, 2023 09:55

undo unwanted changes

ac6ce17

MyreMylar approved these changes Oct 15, 2023

View reviewed changes

Starbuck5 added the SIMD label Oct 26, 2023

itzpr3d4t0r requested a review from Starbuck5 November 11, 2023 10:55

Starbuck5 approved these changes Nov 12, 2023

View reviewed changes

itzpr3d4t0r merged commit 3ac78fc into pygame-community:main Nov 12, 2023
30 checks passed

itzpr3d4t0r added this to the 2.4.0 milestone Nov 12, 2023

itzpr3d4t0r mentioned this pull request Nov 12, 2023

Added missing AVX2 fillers #2565

Merged

itzpr3d4t0r deleted the surface-fill-add-optimization branch November 12, 2023 10:22

itzpr3d4t0r mentioned this pull request Nov 12, 2023

Add SSE2 fillers #2566

Merged

itzpr3d4t0r mentioned this pull request Jan 21, 2024

Alpha fillers #2682

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX Surface.fill() setup, AVX BLEND_ADD #2382

AVX Surface.fill() setup, AVX BLEND_ADD #2382

itzpr3d4t0r commented Aug 5, 2023 •

edited

Starbuck5 commented Aug 9, 2023

itzpr3d4t0r commented Aug 9, 2023

Starbuck5 commented Aug 13, 2023

This comment was marked as outdated.

Starbuck5 commented Aug 21, 2023

This comment was marked as outdated.

MyreMylar commented Oct 15, 2023

MyreMylar left a comment •

edited

Starbuck5 left a comment

Starbuck5 commented Nov 12, 2023

AVX Surface.fill() setup, AVX BLEND_ADD #2382

AVX Surface.fill() setup, AVX BLEND_ADD #2382

Conversation

itzpr3d4t0r commented Aug 5, 2023 • edited

Starbuck5 commented Aug 9, 2023

itzpr3d4t0r commented Aug 9, 2023

Starbuck5 commented Aug 13, 2023

This comment was marked as outdated.

Starbuck5 commented Aug 21, 2023

This comment was marked as outdated.

MyreMylar commented Oct 15, 2023

MyreMylar left a comment • edited

Choose a reason for hiding this comment

Starbuck5 left a comment

Choose a reason for hiding this comment

Starbuck5 commented Nov 12, 2023

itzpr3d4t0r commented Aug 5, 2023 •

edited

MyreMylar left a comment •

edited