Optimize vector parsing in math.c #2443

Starbuck5 · 2023-09-10T02:25:53Z

Strategy-- The vector compatible check and sequence_vectorcoords functions duplicated some checks, so a unified function could optimize the overall runtime.

Conclusions: I had hoped to optimize vector_generic_math, but the results are mixed as you can see below. The runs are too variable for me to confidently say it's a speedup or slowdown in that area. However, this is definitely a speedup for Vector methods that take in and process vectors, especially when called with sequences instead of other vectors. ~~Vector methods are now 7% faster on average when taking vector arguments, 18% faster when taking sequence arguments.~~ Vector methods are now 5.5% faster on average when taking vector arguments, 20% faster when taking sequence arguments.

There are more instances of pgVectorCompatible_Check + PySequence_AsVectorCoords coupled in the code, but in harder to test areas, like in the elementwise proxy, so I'm leaving those be for now.

Benchmarking results (several runs before/after averaged):

vec2_1.move_towards(vec2_2, 4): 0.8714175000000001 (3.349% faster)
vec2_1.move_towards(vec2like, 4): 1.06382 (12.487% faster)
vec2_1.move_towards_ip(vec2_2, 4): 0.7681425000000001 (3.353% faster)
vec2_1.move_towards_ip(vec2like, 4): 0.89942 (17.049% faster)
vec3_1.move_towards(vec3_2, 4): 0.9133575 (0.082% faster)
vec3_1.move_towards(vec3like, 4): 1.0817775 (16.579% faster)
vec3_1.move_towards_ip(vec3_2, 4): 0.7777825 (4.29% faster)
vec3_1.move_towards_ip(vec3like, 4): 0.9343374999999999 (20.735% faster)
vec2_1.cross(vec2_2): 0.4231025 (9.083% faster)
vec2_1.cross(vec2like): 0.6134875 (21.03% faster)
vec3_1.cross(vec3_2): 0.488765 (6.727% faster)
vec3_1.cross(vec3like): 0.7261725 (23.767% faster)
vec2_1.angle_to(vec2_2): 0.5835575 (8.587% faster)
vec2_1.angle_to(vec2like): 0.7513475 (22.943% faster)
vec3_1.angle_to(vec3_2): 0.5254425 (14.304% faster)
vec3_1.angle_to(vec3like): 0.7385475 (23.836% faster)
vec3_1.rotate(34, vec3_2): 0.9605474999999999 (1.951% faster)
vec3_1.rotate_ip(34, vec3_2): 0.85734 (3.164% faster)
vec2_1 + vec2_2: 0.27509249999999996 (3.111% faster)
vec2_1 + vec2like: 0.477885 (25.886% faster)
vec2_1 * vec2_2: 0.2107525 (2.365% faster)
vec2_1 * vec2like: 0.4136725 (26.987% faster)
vec2_1 * 2.3: 0.356165 (-9.052% faster)
vec3_1 + vec3_2: 0.3149825 (2.445% faster)
vec3_1 + vec3like: 0.52228 (30.443% faster)
vec3_1 * vec3_2: 0.245065 (8.35% faster)
vec3_1 * vec3like: 0.4576 (31.846% faster)
vec3_1 * 2.3: 0.37668 (-7.955% faster)

Average change = 11.705 %

Benchmarking script

import timeit

import pygame

vec2_1 = pygame.Vector2(36,6.4)
vec2_2 = pygame.Vector2(4,5)
vec2like = (50, 60)

vec3_1 = pygame.Vector3(36,6.4,-99)
vec3_2 = pygame.Vector3(4,5,56)
vec3like = (50, 60, 20.4)

def bench(statement, globals=globals(), number=5000000):
    return round(timeit.timeit(statement, globals=globals, number=number), 5)

# Put previous output results here to automatically generate comparison
prevresults = []

thisresults = []

thischanges = []

def printbench(statement):
    v2 = bench(statement)
    thisresults.append(v2)

    if prevresults:
        v1 = prevresults[len(thisresults)-1]
        change = -round((v2 - v1) / abs(v1) * 100, 3)
        thischanges.append(change)
        print(f"{statement}: {v2} ({change}% faster)")
    else:
        print(f"{statement}: {v2}")

printbench("vec2_1.move_towards(vec2_2, 4)")
printbench("vec2_1.move_towards(vec2like, 4)")
printbench("vec2_1.move_towards_ip(vec2_2, 4)")
printbench("vec2_1.move_towards_ip(vec2like, 4)")

printbench("vec3_1.move_towards(vec3_2, 4)")
printbench("vec3_1.move_towards(vec3like, 4)")
printbench("vec3_1.move_towards_ip(vec3_2, 4)")
printbench("vec3_1.move_towards_ip(vec3like, 4)")

printbench("vec2_1.cross(vec2_2)")
printbench("vec2_1.cross(vec2like)")

printbench("vec3_1.cross(vec3_2)")
printbench("vec3_1.cross(vec3like)")

printbench("vec2_1.angle_to(vec2_2)")
printbench("vec2_1.angle_to(vec2like)")

printbench("vec3_1.angle_to(vec3_2)")
printbench("vec3_1.angle_to(vec3like)")

printbench("vec3_1.rotate(34, vec3_2)")
printbench("vec3_1.rotate_ip(34, vec3_2)")

printbench("vec2_1 + vec2_2")
printbench("vec2_1 + vec2like")
printbench("vec2_1 * vec2_2")
printbench("vec2_1 * vec2like")
printbench("vec2_1 * 2.3")

printbench("vec3_1 + vec3_2")
printbench("vec3_1 + vec3like")
printbench("vec3_1 * vec3_2")
printbench("vec3_1 * vec3like")
printbench("vec3_1 * 2.3")

if thischanges:
    print()
    print("Average change =", round(sum(thischanges)/len(thischanges), 3), "%")
    print()

print(thisresults)

src_c/math.c

It combines the functionality of pgVectorCompatible_Check and PySequence_AsVectorCoords, which optimizes the overall runtime because those two functions have duplicate checks.

+Fix comment

itzpr3d4t0r

Nice PR. I'm sure we can build on it further in the future, the whole module is full of places we can work on.
I'd like to suggest you use mean(timeit.repeat(...)) with a high repeat number for benchmarking instead of a single timeit that accumulates runtimes. This is because it yields smoother and better results overall since outliers won't count towards increasing the times in a suboptimal way.

And a side note.
Generally speaking, the idea of having 2 pointers, one with a stack allocated array in the case of a sequence being passed and a single pointer that switches between the array and an actual vector's coords memory address is a valid optimization across the whole module. It avoids having to use memcpy in the vector case, which is probably what matters most.

src_c/math.c

Starbuck5 · 2023-09-12T04:09:59Z

I'd like to suggest you use mean(timeit.repeat(...)) with a high repeat number for benchmarking instead of a single timeit that accumulates runtimes. This is because it yields smoother and better results overall since outliers won't count towards increasing the times in a suboptimal way.

I swapped out my timeit.timeit call with min(timeit.repeat(statement, globals=globals, repeat=1000, number=1000)) * 1000 and got the following results:

Results

vec2_1.move_towards(vec2_2, 4): 0.16205 (6.774% faster)
vec2_1.move_towards(vec2like, 4): 0.1976 (14.809% faster)
vec2_1.move_towards_ip(vec2_2, 4): 0.14465 (6.269% faster)
vec2_1.move_towards_ip(vec2like, 4): 0.1703 (19.088% faster)
vec3_1.move_towards(vec3_2, 4): 0.165325 (7.212% faster)
vec3_1.move_towards(vec3like, 4): 0.20437500000000003 (18.987% faster)
vec3_1.move_towards_ip(vec3_2, 4): 0.145225 (9.658% faster)
vec3_1.move_towards_ip(vec3like, 4): 0.18039999999999998 (22.124% faster)
vec2_1.cross(vec2_2): 0.080625 (12.029% faster)
vec2_1.cross(vec2like): 0.1172 (22.653% faster)
vec3_1.cross(vec3_2): 0.09242500000000001 (11.555% faster)
vec3_1.cross(vec3like): 0.1359 (27.034% faster)
vec2_1.angle_to(vec2_2): 0.1089 (13.037% faster)
vec2_1.angle_to(vec2like): 0.144625 (23.872% faster)
vec3_1.angle_to(vec3_2): 0.10072500000000001 (16.428% faster)
vec3_1.angle_to(vec3like): 0.141625 (24.285% faster)
vec3_1.rotate(34, vec3_2): 0.1841 (5.674% faster)
vec3_1.rotate_ip(34, vec3_2): 0.16610000000000003 (5.504% faster)
vec2_1 + vec2_2: 0.05325 (2.473% faster)
vec2_1 + vec2like: 0.093575 (21.349% faster)
vec2_1 * vec2_2: 0.041475 (4.983% faster)
vec2_1 * vec2like: 0.08097499999999999 (24.797% faster)
vec2_1 * 2.3: 0.07072500000000001 (-12.934% faster)
vec3_1 + vec3_2: 0.053875 (18.556% faster)
vec3_1 + vec3like: 0.097925 (31.521% faster)
vec3_1 * vec3_2: 0.042475 (24.455% faster)
vec3_1 * vec3like: 0.08814999999999999 (32.94% faster)
vec3_1 * 2.3: 0.06772500000000001 (1.24% faster)

Average change = 14.87 %

I'm happy to see it shows a higher speedup, with a +14.87% average now.

Starbuck5 · 2023-09-12T04:14:04Z

@itzpr3d4t0r

And a side note.
Generally speaking, the idea of having 2 pointers, one with a stack allocated array in the case of a sequence being passed and a single pointer that switches between the array and an actual vector's coords memory address is a valid optimization across the whole module. It avoids having to use memcpy in the vector case, which is probably what matters most.

You've suggested this in 4 separate comments on this PR, I replied when you brought it up in Ankith's thread: #2443 (comment). My initial testing actually found the code was slower without the memcpy, somehow-- maybe the compiler is optimizing a fixed side memcpy away? Assuming this PR gets merged, you can try this yourself and maybe show a speedup and maybe make a follow up PR.

ankith26

LGTM, thanks for the PR 🎉

Left a review for your consideration, resolve at will

src_c/math.c

Starbuck5 · 2023-09-14T04:58:59Z

Alright @itzpr3d4t0r now it's on you: further optimizations 😄 📈

Starbuck5 requested a review from a team as a code owner September 10, 2023 02:25

Starbuck5 added Performance Related to the speed or resource usage of the project math pygame.math labels Sep 10, 2023

dr0id reviewed Sep 10, 2023

View reviewed changes

src_c/math.c Outdated Show resolved Hide resolved

src_c/math.c Show resolved Hide resolved

ankith26 reviewed Sep 10, 2023

View reviewed changes

src_c/math.c Outdated Show resolved Hide resolved

src_c/math.c Show resolved Hide resolved

Starbuck5 added 2 commits September 10, 2023 20:27

Add pg_VectorCoordsFromObj to optimize vectors

7c801bf

It combines the functionality of pgVectorCompatible_Check and PySequence_AsVectorCoords, which optimizes the overall runtime because those two functions have duplicate checks.

Rework error strategy a bit for optimization

f6e5d4c

Starbuck5 force-pushed the Add-unified-vector-check-and-get branch from 9ab246c to f6e5d4c Compare September 11, 2023 03:27

Use PySequence_ITEM since already checked sequence

c128495

+Fix comment

itzpr3d4t0r requested changes Sep 11, 2023

View reviewed changes

src_c/math.c Outdated Show resolved Hide resolved

src_c/math.c Show resolved Hide resolved

src_c/math.c Show resolved Hide resolved

src_c/math.c Show resolved Hide resolved

Use Py_XDECREF

1f5665d

itzpr3d4t0r approved these changes Sep 12, 2023

View reviewed changes

ankith26 approved these changes Sep 12, 2023

View reviewed changes

src_c/math.c Show resolved Hide resolved

Starbuck5 merged commit c69d9a6 into pygame-community:main Sep 14, 2023
32 checks passed

Starbuck5 deleted the Add-unified-vector-check-and-get branch September 14, 2023 04:58

Starbuck5 added this to the 2.4.0 milestone Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize vector parsing in math.c #2443

Optimize vector parsing in math.c #2443

Starbuck5 commented Sep 10, 2023 •

edited

itzpr3d4t0r left a comment •

edited

Starbuck5 commented Sep 12, 2023

Starbuck5 commented Sep 12, 2023

ankith26 left a comment

Starbuck5 commented Sep 14, 2023

Optimize vector parsing in math.c #2443

Optimize vector parsing in math.c #2443

Conversation

Starbuck5 commented Sep 10, 2023 • edited

itzpr3d4t0r left a comment • edited

Choose a reason for hiding this comment

Starbuck5 commented Sep 12, 2023

Starbuck5 commented Sep 12, 2023

ankith26 left a comment

Choose a reason for hiding this comment

Starbuck5 commented Sep 14, 2023

Starbuck5 commented Sep 10, 2023 •

edited

itzpr3d4t0r left a comment •

edited