Add SSE2 intrinsics smoothscale backend #2473

Starbuck5 · 2023-09-29T08:15:54Z

Tackles the bulk of what I wanted in #1767

This was more difficult than I thought, this was quite difficult. I read through the existing smoothscale assembly (written assembly, not intrinsics) and the C implementations and combined what I saw into an SSE2 intrinsic version of the same algorithm.

I then had lots of ideas for optimizations on top of that, so for three of the filters it processes more pixels at a time than the existing SSE/MMX implementations.

This PR "just" implements the SSE2 backend, there are several follow on steps that would be quite nice:

Get it compiling and up at runtime on NEON, so ARM folks can get fast smoothscale
Remove legacy MMX/SSE backends eventually (we've been linking the compiled code for this in, at least on Windows, this is a part of the pygame codebase we don't know how to compile). Wanting to remove this uncompileable, unportable and ancient bit of our codebase was my chief motivation for opening this PR.
AVX2 backend? Now that we've got it in intrinsics, it's much easier to scale up the parallelism with other intrinsics.
Fix so if a malloc fails in the filters it will accurately report an error

Performance relative to SSE backend:

X-shrinking: 25% faster
Y-shrinking: 28% faster
X-expanding: neglible
Y-expanding: 25% faster

These are pixel perfect accurate relative to the SSE backend. Verified by saving out to pngs and checking the file hashes.

src_c/transform.c

MyreMylar · 2023-09-29T18:22:01Z

src_c/simd_transform_sse2.c

+    Uint16 *templine;
+    /* allocate a clear memory area for storing the accumulator line */
+    templine = (Uint16 *)calloc(dstpitch, 2);
+    if (templine == NULL) {
+        return;
+    }


Would it potentially be faster here to have this line allocated as a __m128i* type and then use the bit shifting intrinsic to do the equivalent of incrementing the accumulator? Probably not but I just noticed there was a bit of jumping into SIMD types and then back out again.

I often feel like I need a simple intrinsic testbed for this stuff where I can see what is happening in memory doing some basic operations like allocating, shifting a pointer. Maybe then I could master the nuances of aligning memory as well and see if that makes any difference to performance.

I'm always curious about what the intrinsics actually get compiled to. At first, I thought it was 1-to-1 script with the instructions but now I feel it's much more loose and compiler specific. After your comment here I played with putting some of this into godbolt, which did work to show the assembly that is generated, for what that's worth.

I did some tests last year that showed chaining intrinsics together into larger statements was more performant. So not defining middle states of the calculation into variables gives the compiler more freedom?

Alignment is an interesting thing to bring up here, I did a test on this using SDL_SIMDAlloc() so the accumulate could be loaded and stored with aligned instructions, and did that with __m128i*, and I got a small 7% improvement over the current iteration. However SDL_SIMDAlloc is only available in SDL 2.0.10, and we support down to SDL 2.0.9 for now, so I'll leave a comment. I've heard that alignment used to be extremely important and that it is less important for performance these days.

Reading through the algorithm again showed me an unnecessary pointer add though, I'll take that out and add a comment about pursuing SDL_SimdAlloc for that data region in the future.

MyreMylar · 2023-09-29T19:41:39Z

src_c/simd_transform_sse2.c

+                // The operation! Translated from old filter_shrink_X_SSE
+                // assembly. I don't understand the equivalence between
+                // these operations and the C version of these operations,
+                // but it works.
+                src = _mm_slli_epi16(src, 2);
+                dst = _mm_mulhi_epu16(src, mm_xcounter);
+                dst = _mm_add_epi16(dst, accumulate);
+                accumulate = _mm_mulhi_epu16(src, mm_xfrac);
+
+                dst = _mm_mulhi_epu16(dst, xrecip);


You could probably work through this in python. Something a bit like this, but with more than one test pixel and more looping:

red_channel = 128 green_channel = 2 blue_channel = 255 alpha_channel = 255 print("red_channel:", "{:016b}".format(red_channel)) print("green_channel:", "{:016b}".format(green_channel)) print("blue_channel:", "{:016b}".format(blue_channel), "\n") # doing _mm_slli_epi16(src_pixel, 2) red_shift = red_channel << 2 green_shift = green_channel << 2 blue_shift = blue_channel << 2 alpha_shift = alpha_channel << 2 print("red_shift:", "{:016b}".format(red_shift)) print("green_shift:", "{:016b}".format(green_shift)) print("blue_shift:", "{:016b}".format(blue_shift), "\n") xspace = 32768 # 0x04000 * 256 / 128 - from: 0x04000 * srcwidth / dstwidth xrecip = 1073741824 // xspace # 0x40000000 / xspace xcounter = 16384 print("xspace:", "{:016b}".format(xspace)) print("xcounter:", "{:016b}".format(xcounter)) print("xrecip:", "{:016b}".format(xrecip), "\n") # doing _mm_mulhi_epu16(src_pixel, xcounter) red_mul = (red_shift * xcounter) >> 16 green_mul = (green_shift * xcounter) >> 16 blue_mul = (blue_shift * xcounter) >> 16 alpha_mul = (alpha_shift * xcounter) >> 16 print("red_mul:", "{:016b}".format(red_mul)) print("green_mul:", "{:016b}".format(green_mul)) print("blue_mul:", "{:016b}".format(blue_mul), "\n") # doing _mm_mulhi_epu16(dst_pixel, xrecip) red_dst = (red_mul * xrecip) >> 16 green_dst = (green_mul * xrecip) >> 16 blue_dst = (blue_mul * xrecip) >> 16 alpha_dst = (alpha_mul * xrecip) >> 16 print("red_dst:", "{:016b}".format(red_dst)) print("green_dst:", "{:016b}".format(green_dst)) print("blue_dst:", "{:016b}".format(blue_dst), "\n")

MyreMylar

Looking exciting! Probably the most complicated conversion to SIMD operation yet.

This reminds me that we should probably have an issue tracking the functions in the transform module and what is potentially SIMD-able, what is not and what has already been SIMD'd.

Starbuck5 · 2023-09-30T05:45:36Z

Looking exciting! Probably the most complicated conversion to SIMD operation yet.

I'm glad to have someone to share it with! I'm very excited about this PR.

Starbuck5 · 2023-10-01T01:24:13Z

Here's one of my test scripts, an interactive one:

Interactive smoothscale test

# Change out background path to get running, change smoothscale backend to test behavior without rebuilding
import pygame

pygame.init()
screen = pygame.display.set_mode((800,600))

pygame.transform.set_smoothscale_backend("SSE2")
print(pygame.transform.get_smoothscale_backend())

background_orig = pygame.image.load(r"C:\Users\charl\Desktop\pygame-ce\examples\data\fist.png")
background_orig.set_colorkey((0,38,255))
background = pygame.Surface(background_orig.get_size(), pygame.SRCALPHA)
background.blit(background_orig, (0,0))

print(background)

clock = pygame.Clock()

while True:
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit()
            raise SystemExit

    screen.fill("purple")
    background2 = pygame.transform.smoothscale(background, pygame.mouse.get_pos())
    screen.blit(background2, (0,0))
    pygame.display.flip()
    clock.tick(60)

MyreMylar

I ended up creating my own performance testing program to put this through it's paces:

Expand to see code

from sys import stdout
from pstats import Stats
from cProfile import Profile


import pygame

pygame.init()


def scale_down():
    dimensions = [
        (0, 0),
        (3, 3),
        (65, 65),
        (126, 127),
        (254, 255),
        (1024, 1024),
        (2047, 2048),
    ]
    scales = [1.3, 2, 3.2]
    for iterations in range(0, 1000):
        for w, h in dimensions:
            for scale in scales:
                src_surf = pygame.Surface((w, h), pygame.SRCALPHA, 32)
                src_surf.fill((100, 255, 128, 200))

                pygame.transform.smoothscale(
                    src_surf,
                    (
                        int(src_surf.get_width() / scale),
                        int(src_surf.get_height() / scale),
                    ),
                )


def scale_up():
    dimensions = [
        (0, 0),
        (3, 3),
        (65, 65),
        (126, 127),
        (254, 255),
        (1024, 1024),
        (2047, 2048),
    ]
    scales = [1.3, 2, 3.2]
    for iterations in range(0, 1000):
        for w, h in dimensions:
            for scale in scales:
                src_surf = pygame.Surface((w, h), pygame.SRCALPHA, 32)
                src_surf.fill((100, 255, 128, 200))

                pygame.transform.smoothscale(
                    src_surf,
                    (
                        int(src_surf.get_width() * scale),
                        int(src_surf.get_height() * scale),
                    ),
                )


if __name__ == "__main__":
    print("Scale Down")
    profiler = Profile()
    profiler.runcall(scale_down)
    stats = Stats(profiler, stream=stdout)
    stats.strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats()

    print("Scale Up")
    profiler = Profile()
    profiler.runcall(scale_up)
    stats = Stats(profiler, stream=stdout)
    stats.strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats()

Current results on main branch:

Scale Down
         84002 function calls in 50.543 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   12.715   12.715   50.543   50.543 smoothscale_speed_test.py:11(scale_down)
    21000   32.091    0.002   32.091    0.002 {built-in method pygame.transform.smoothscale}
    21000    5.731    0.000    5.731    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.004    0.000    0.004    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Scale Up
         84002 function calls in 203.356 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   30.437   30.437  203.356  203.356 smoothscale_speed_test.py:37(scale_up)
    21000  166.979    0.008  166.979    0.008 {built-in method pygame.transform.smoothscale}
    21000    5.933    0.000    5.933    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.005    0.000    0.005    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Then the current results with this PR:

Scale Down
         84002 function calls in 41.358 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   12.520   12.520   41.358   41.358 smoothscale_speed_test.py:11(scale_down)
    21000   22.729    0.001   22.729    0.001 {built-in method pygame.transform.smoothscale}
    21000    6.103    0.000    6.103    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.004    0.000    0.004    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Scale Up
         84002 function calls in 178.964 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   30.329   30.329  178.964  178.964 smoothscale_speed_test.py:37(scale_up)
    21000  142.608    0.007  142.608    0.007 {built-in method pygame.transform.smoothscale}
    21000    6.020    0.000    6.020    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.005    0.000    0.005    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Which looks to me like 30% faster scaling down and 15% faster scaling up.

Everything looks good to me here. 👍

src_c/simd_transform_sse2.c

itzpr3d4t0r

Alright I'm not gonna pretend like i know exactly how this works but i didn't see anything weird and looks like it's working fine. We can always get back to it if we find ways to improve the code further.

Better parallelization for Y-expand, Y-shrink, X-shrink + cleanups and fixes

Starbuck5 requested a review from a team as a code owner September 29, 2023 08:15

Starbuck5 marked this pull request as draft September 29, 2023 08:16

Starbuck5 added Performance Related to the speed or resource usage of the project transform pygame.transform labels Sep 29, 2023

Starbuck5 changed the title ~~Smoothscale intrinsics~~ Add SSE2 intrinsics smoothscale backend Sep 29, 2023

Starbuck5 commented Sep 29, 2023

View reviewed changes

src_c/transform.c Show resolved Hide resolved

ankith26 added the SIMD label Sep 29, 2023

MyreMylar reviewed Sep 29, 2023

View reviewed changes

Starbuck5 force-pushed the smoothscale-intrinsics branch 2 times, most recently from bcf8dad to bce8af0 Compare September 30, 2023 01:29

Starbuck5 marked this pull request as ready for review September 30, 2023 05:29

Starbuck5 added this to the 2.4.0 milestone Sep 30, 2023

MyreMylar mentioned this pull request Sep 30, 2023

SIMD-ifying the transform submodule - checklist #2476

Open

14 tasks

MyreMylar approved these changes Oct 1, 2023

View reviewed changes

MyreMylar assigned Starbuck5 Oct 7, 2023

MyreMylar requested a review from itzpr3d4t0r October 15, 2023 10:16

itzpr3d4t0r reviewed Oct 15, 2023

View reviewed changes

src_c/simd_transform_sse2.c Show resolved Hide resolved

src_c/simd_transform_sse2.c Show resolved Hide resolved

itzpr3d4t0r approved these changes Nov 2, 2023

View reviewed changes

Starbuck5 added 2 commits November 3, 2023 00:45

Initial smoothscale intrinsic implementation

4e9418a

Smoothscale intrinsic improvements

3027208

Better parallelization for Y-expand, Y-shrink, X-shrink + cleanups and fixes

Starbuck5 force-pushed the smoothscale-intrinsics branch from edf1a4e to 3027208 Compare November 3, 2023 07:47

Starbuck5 merged commit 0062ac4 into pygame-community:main Nov 3, 2023
29 of 30 checks passed

Starbuck5 deleted the smoothscale-intrinsics branch November 3, 2023 08:04

Starbuck5 mentioned this pull request Nov 4, 2023

Compile SSE2 smoothscale intrinsics on NEON #2544

Merged

Starbuck5 mentioned this pull request Nov 22, 2023

Smoothscale: deprecate old backends, update docs #2583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SSE2 intrinsics smoothscale backend #2473

Add SSE2 intrinsics smoothscale backend #2473

Starbuck5 commented Sep 29, 2023 •

edited

MyreMylar Sep 29, 2023

Starbuck5 Sep 30, 2023 •

edited

MyreMylar Sep 29, 2023

MyreMylar left a comment

Starbuck5 commented Sep 30, 2023

Starbuck5 commented Oct 1, 2023

MyreMylar left a comment

itzpr3d4t0r left a comment

Add SSE2 intrinsics smoothscale backend #2473

Add SSE2 intrinsics smoothscale backend #2473

Conversation

Starbuck5 commented Sep 29, 2023 • edited

MyreMylar Sep 29, 2023

Choose a reason for hiding this comment

Starbuck5 Sep 30, 2023 • edited

Choose a reason for hiding this comment

MyreMylar Sep 29, 2023

Choose a reason for hiding this comment

MyreMylar left a comment

Choose a reason for hiding this comment

Starbuck5 commented Sep 30, 2023

Starbuck5 commented Oct 1, 2023

MyreMylar left a comment

Choose a reason for hiding this comment

itzpr3d4t0r left a comment

Choose a reason for hiding this comment

Starbuck5 commented Sep 29, 2023 •

edited

Starbuck5 Sep 30, 2023 •

edited