Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU specific modules (neon, SSE4, etc). #1723

Open
illume opened this issue May 12, 2020 · 17 comments
Open

CPU specific modules (neon, SSE4, etc). #1723

illume opened this issue May 12, 2020 · 17 comments
Labels
hard A hard challenge to solve Performance Related to the speed or resource usage of the project Platform: Raspberry Pi Issues with the Raspberry Pi platform tooling

Comments

@illume
Copy link
Member

illume commented May 12, 2020

As a user I want to install and distribute pygame where the CPU specific optimizations are included for optimized functions.

As part of the PR Adds MMX, SSE2 & Arm NEON optimised versions of blit_blend_premultiplied() some more SIMD optimized functions were made. However there's no runtime detection for those functions, so some are disabled by default.

@illume
Copy link
Member Author

illume commented May 12, 2020

This is based on some of the discussion in #1711 ...


Here's another way we could hack up distutils based on python modules and compilation flags. (obviously these ways would be out of scope for this PR)

Currently Setup.SDLX.in has stuff like this.

surface src_c/surface.c src_c/alphablit.c src_c/surface_fill.c $(SDL) $(DEBUG)

Adding CPU specific modules could be done by adding flags (-mfpu=neon) to the end like this:

surface_fpu_neon src_c/surface.c src_c/alphablit.c src_c/surface_fill.c $(SDL) $(DEBUG) -mfpu=neon

Then the init machinery (__init__.py) can load the correct surface (or surface_fpu_neon) module based on the CPU capabilities. I guess all the other C based routines would also be sped up because of better compiler optimizations too.

@MyreMylar
Copy link
Contributor

MyreMylar commented May 12, 2020

Was just quoting that :)

I think it would require us to:

  • Pull out the SSE2/ARM NEON optimised functions into a separate .c file.
  • Test it to see if it works. In theory with the current state of the code the intrinsic functions are never run at runtime anyway if there is no platform support, so that part shouldn't need any changes.

All we are really trying to do is isolate the impact of the -mfpu=neon option to a single bit of code rather than have it accidentally affect the whole project.

@illume
Copy link
Member Author

illume commented May 13, 2020

Another possibility is compiling the whole wheel into a separate package.
Such as pygame_neon or pygame_sse42. This could potentially work better with other tools. Then the main pygame could try and import that.

Speaking of other tools: things like cx_freeze and pyinstaller need to be considered.

Also, we can distribute our own wheels however we want, but how can other distro packages (such as Debian pygame) take advantage of the solution?

@MyreMylar
Copy link
Contributor

A discussion of problems Google had with compiler specific modules using AVX:

https://randomascii.wordpress.com/2016/12/05/vc-archavx-option-unsafe-at-any-speed/

@MyreMylar
Copy link
Contributor

I discussed this a little bit on the end of this issue too:

#1229

@MyreMylar MyreMylar added the hard A hard challenge to solve label May 15, 2020
@illume
Copy link
Member Author

illume commented Dec 21, 2020

There was some more discussion in here about an issue with neon compilation on pi: #2373

@MyreMylar MyreMylar added the Platform: Raspberry Pi Issues with the Raspberry Pi platform label Jan 20, 2021
@ankith26
Copy link
Contributor

ankith26 commented Jan 20, 2021

The arm-neon optimisations are much needed, as they give a significant speed boost for pygame (checked on a raspberry pi 3). They might also be of advantage to other arm based systems, like android and the new M1 macs. Also, my theory is that with Apple shifting to ARM, and the switch proving effective, more companies are likely follow the lead. Soon we might see more ARM systems entering the market, and one fine day in the future, ARM might even end up overtaking intel/amd.
So setting up arm optimisations and enabling them by default sure seems like a good idea to me.

Creating seperate files, conditional compilation, runtime detection, all that sounds very good. But its a bit too complicated, and is a lot of work. I have a much simpler solution, and that is changing this line of code

pygame/setup.py

Line 136 in 91df485

if consume_arg('-enable-arm-neon'):

With

# ref: https://en.wikipedia.org/wiki/Uname
if platform.machine() in ["armv7l", "armv8l", "arm64", "aarch64"]:

Yes, this is compile-time platform detection and wheels compiled for these platforms will only work on these platforms (isn't that the point of wheels anyways). Things like creating an armv7l wheel and later renaming the wheel to armv6l will no longer work if this change is applied (that is a bad idea anyways, because you never know what other problems doing that creates). This change might create some issues for the folk at piwheels, so we must make a decision considering that in mind (we do not want to bother the people at piwheels, if we end up doing a change like this, we might have to consider building arm linux wheels ourselves or something like that).
Also, the current behavior applies the compiler flag -mfpu=neon to all the files. I am in favour of retaining this behavior, because it seems to optimise more stuff. This might need more testing (I have not come across any issues while testing on my pi3b, though)

@MyreMylar
Copy link
Contributor

This build issue was discussed a while ago with piwheels here in case anyone else is missing the context:

#1229

Essentially right now piwheels hosts an 'arm6' and an 'arm7' wheel publicly but in 99% of cases those are actually the same 'arm7' wheel built from source & renamed automatically by script. Having actually separate arm7 and arm6 wheels is instead a manual hand compilation process which (I think) was only being done for new releases of openCV.

Why this matters is because on raspberry pi, piwheels is the default repository when users do a pip install pygame rather than the PyPI that most other platforms use (and that we upload to) and I suspect where they get the pygame from that gets preinstalled on Raspberry Pi's as well.

So while making separate arm6 & arm7 wheels seems like an easy solution for us working on pygame, it is not for the piwheels maintainers.

@MyreMylar
Copy link
Contributor

...though I see in this blog post:

https://blog.piwheels.org/new-opencv-builds-including-opencv-4-4-x/

..That piwheels are, as of September last year, building OpenCV automatically too. So maybe there is something to be worked out here with @bennuttall to have separate genuine arm6 and arm7 wheels of pygame.

@bennuttall
Copy link

I'd be happy to build armv7 wheels separately (because it's pygame) if possible.

Is there a way to supply the compiler flag to the pip command?

@MyreMylar
Copy link
Contributor

MyreMylar commented Jan 20, 2021

Is there a way to supply the compiler flag to the pip command?

Do you mean - does this work:

python setup.py install -enable-arm-neon

Because if so, yes.

I don't know of the top of my head if pip passes along arguments to setup.py if it can't find a wheel. I'm generally not much of an expert on our building process, but I did add that one flag.

I believe we would just need that neon flag enabled for the arm7 (pi3 & pi4) wheel and disabled for the arm6 (pi Zero etc) wheel.

@bennuttall
Copy link

I meant something like pip wheel pygame==2.0.0 -enable-arm-neon (I guess that won't work as enable-arm-neon is not a pip flag) - but maybe an env var like ENABLE_ARM_NEON=1 pip wheel pygame==2.0.0

I ask because that's the normal way we build wheels - using the pip wheel command - it means you get the source dist that was released to PyPI for any given version, rather than cloning from github and finding the right commits to build from.

@robertpfeiffer
Copy link
Contributor

robertpfeiffer commented Jan 21, 2021

You may need to compile SDL2 itself with the right flags for aarch64 on the pi4. The last time I looked at the SDL2 that ships with Raspberry Pi OS, I found that a couple of spicy features were disabled. I don't know if we even need the rpi SDL video driver on the Pi 4, or if OpenGL ES under X11 just works now. I also don't know which optimisations are enabled. As far as I can tell, nobody has benchmarked the impact of building SDL2 with --enable-arm-simd or --enable-arm-neon on the Pi 4.

In another thread, a Gentoo user reported that on aarch64, neon does not work, but it was unclear whether SDL2 or PyGame was the culprit. It might be Gentoo itself.

In the short term, I quite like @ankith26's idea for detecting running on arm and doing the right thing be default, but 1. I would keep the command line flag for compatibility 2. I would disable these kinds of detection when a certain environment var is set, so you can cross-compile from a RPi to another target arch (I know, pretty niche use case, but theoretically a source of bugs). I set PYGAME_CROSS_COMPILE in the p4a build recipe so I could disable detection of ABI or CPU features in setup.py, but then I didn't use that. We should probably document this or I'll have to make a pull request at kivy to remove it from the p4a recipe.

@bennuttall
Copy link

Ok, one red line I'd have to draw would be creating a Pi4-only wheel and publishing it to piwheels as an armv7 wheel (because Pi 3 users would get it and it wouldn't work for them), as there's no way to distinguish.

If there's a way I can build an armv7 wheel that's more optimised but not limited to Pi 4 then I'd be happy to. Otherwise, best bet is for pygame to host its own optimised wheels for individual Pi models.

piwheels doesn't build for aarch64 yet but we intend to. Maintainers can already publish aarch64 wheels to PyPI which can be picked up by Pi 4 aarch64 users (e.g. numpy does this). But for packages that don't produce aarch64 wheels, we intend to build them in future.

@ankith26
Copy link
Contributor

ankith26 commented Jan 21, 2021

Yes, those armv7l wheels would indeed work on the pi2, pi3 and pi4 regardless of which pi it was compiled on. It’s just that an armv7l compiled (optimised) wheel would not work on armv6l by just renaming the wheel.

We need to do some work from our end to properly optimize stuff for aarch64, do some platform detection based compilation and also make it configurable via env vars, so that it works even with the pip wheel command. I will try to open a PR for this (soonish)

@bennuttall
Copy link

Great. That sounds ideal. Ping me when you want me to take a look.

@ankith26
Copy link
Contributor

ankith26 commented Dec 28, 2021

Okay, to make things easier for everyone (like I already mentioned in the linked issue), runtime detection is the way to go. That would also make armv7 to armv6 wheel renaming work fine, so piwheels workflow is not disturbed and we don't give you any extra work 😄
OFC, renaming an aarch64 wheel to armv7 or armv6 is a bad idea and we shouldn't be doing this, so we only need runtime detection for armv7 and not armv8. The optimisations are enabled by default on aarch64 wheels

We generally need a more robust system for platform detection for SIMD, and even on intel codepaths, perhaps refactoring SIMD code into seperate files loaded after a runtime check is the way to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hard A hard challenge to solve Performance Related to the speed or resource usage of the project Platform: Raspberry Pi Issues with the Raspberry Pi platform tooling
Projects
None yet
Development

No branches or pull requests

5 participants