Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Use Python Stable ABI for the bindings #2345

Closed
encukou opened this issue Jan 11, 2023 · 15 comments · Fixed by #2674
Closed

RFE: Use Python Stable ABI for the bindings #2345

encukou opened this issue Jan 11, 2023 · 15 comments · Fixed by #2674
Labels
python Python bindings RFE

Comments

@encukou
Copy link
Contributor

encukou commented Jan 11, 2023

Hello,
RPM is typically built/installed as a system component, and the Python bindings are tied to a particular minor (3.x) version of Python. Using the bindings with other Python versions is difficult.
I'd like to solve the issue for rpm and similar bindings to “system” packages. There's a few chicken-and-egg problems to untangle in building and installation, but I think a good first step would be building the _rpm module with Python's Stable ABI, which means the resulting compiled module would be usable with multiple versions of Python (once copied/linked to the appropriate directory for each Python).

The stable ABI requires using a limited subset of Python's C API, so it'd need some porting. I'm willing to send patches, but I would need reviews and some RPM-specific guidance.

Is this interesting in general?

If so, some details:

  • RPM would either need to drop support for Python 3.1, or carry rather large #ifdef blocks.
  • The implementation could be much less invasive if we disallow loading the _rpm module more than once per process. This would limit rpm's use in applications that embed Python and call Py_Initialize/Py_Finalize multiple times, or use sub-interpreters. AFAIK, those aren't very relevant for rpm. I can go into details if this is an issue.
  • The stable ABI would ideally be used for Python 3.10+, with older versions RPM would continue to build version-specific modules. Using the stable ABI for older versions (down to 3.2) is possible, but the lower the version is the more caveats there are.
@Conan-Kudo
Copy link
Member

In general, this seems interesting to me, though I'm curious about the caveats.

(I also generally wonder why this isn't just being introduced as a Python 4 so we can switch Python to major versioned directories instead of major+minor versioned directories...)

@DemiMarie
Copy link
Contributor

An alternative would be to use CFFI.

@pmatilai pmatilai added RFE python Python bindings labels Jan 12, 2023
@pmatilai
Copy link
Member

I've always considered the Python API instability between minor versions a massive pain in the behind, so I'm glad to hear this is being addressed - I haven't been paying attention to Python world so this is (good) news to me.

So yes, this would be most welcome unless it turns out to somehow unreasonably encumber us. On the outset, the caveats listed here seem non-issues to me:

  • Dropping support for Python 3.1 is not a problem at all, that's an ancient version and we could bump the requirement quite a bit without too many concerns. 3.10 is asking a bit too much perhaps but there's plenty to choose from in between, 3.6 (for rhel-8 support) might be a starting point to think about, but note that rhel-8 support is not a requirement at all.
  • The only embedded use of Python in an rpm context that I know of is in gdb, but then gdb uses the C API and isn't really relevant at all. Doesn't seem like a big deal to me to disallow loading multiple times.
  • Some performance loss is tolerable for the cause, we don't exactly count CPU cycles per instruction in the Python bindings (or librpm) anyhow...

Thanks for the initiative, and looking forward for the patches 😁

@encukou
Copy link
Contributor Author

encukou commented Jan 12, 2023

OK!
A proof of concept is at my fork. The diff is big but most changes are rather mechanical. Read below for context.

How would you prefer me to contribute this? Polish it and submit one big PR, or several smaller ones (with the initial ones not making sense on their own, like dropping Python 3.1)?

(I also generally wonder why this isn't just being introduced as a Python 4 so we can switch Python to major versioned directories instead of major+minor versioned directories...)

The feature is there since 3.2 and has been improved ever since. You can hop on the train at any time, it's not a major version change.
It works by defining the limited API version you want to support, which makes the compiled extension compatible with that version and anything newer (up to Python 4, or a major policy change).

Figuring out a better installation layout is probably something distros will need to drive, and this change should give them an incentive :)

Dropping support for Python 3.1 is not a problem at all, that's an ancient version and we could bump the requirement quite a bit without too many concerns. 3.10 is asking a bit too much perhaps but there's plenty to choose from in between, 3.6 (for rhel-8 support) might be a starting point to think about, but note that rhel-8 support is not a requirement at all.

I'm suggesting the 3.10 minimum for the stable ABI only. For older versions (down to 3.2), RPM would build a version-specific module, just like before. However even those would have some of the caveats below (unless we put in big #ifdef blocks).
That limit can be lowered later. (Maybe soon -- I haven't yet checked details. I know there's a 3.7+ feature I'd rather rely on for the initial implementation, and RPM probably wants to keep 3.6 support.)
FWIW, in this area I'll probably struggle with telling CMake what to do -- I'll ask for help on that later.

Some performance loss is tolerable for the cause, we don't exactly count CPU cycles per instruction in the Python bindings (or librpm) anyhow...

Performance loss comes mostly from macros/inline functions being actual function calls. RPM uses sequence item assignment (PyList_SET_ITEM, PyTuple_SET_ITEM) and reference counting (Py_INCREF, Py_DECREF and similar) which would go from being simple memory access to a C function call.
A slowdown might be measurable but I don't think it'll be noticeable.

I'm curious about the caveats.

Here's the biggest change/caveat by far: Since the stable ABI doesn't expose memory layout, type definitions need to change.

RPM uses static types, e.g. hdr_Type is the type object exposed to Python as <class 'rpm.hdr'>. Since it's C static it doesn't need reference counting & lifetime management.

For stable ABI, we need to use heap types, which are more similar to classes defined in Python. They're created dynamically from a “spec”.
Being heap-allocated (like any other Python object), they need lifetime management. They can also be created more than once per process, and code that uses them needs to find “its” copy. Both of these issues are manageable, but hard to address in an existing codebase.
Hence the hack I'm asking you to accept: the module, and its types, would only be initialized once per process, and heap-allocated types would never be dropped. We'd get a one-time-per-process memory “leak”. Tools like Valgrind might complain (though they probably already do if you're using Python).
It's possible to remove the hack later and I'll be happy to send patches (or even improve Python to allow it if I hit an unexpected snag), but on a longer time scale and possibly only for newer Python versions.

Another difference in heap types is that, like classes defined in Python code, the type object is mutable: for example a user can do rpm.hdr.write = None to disable the write method on RPM headers. Only Python 3.10+ allows disabling this (lower versions would need very messy hacks, or #ifdefing all the changes away).
But I don't think allowing the user to deliberately shoot themselves in the foot is a problem. (It only affects the Python level, you won't get segfaults or memory corruption.)

@pmatilai
Copy link
Member

I'll need to digest this all for a bit, but just a quick note on this:

I know there's a 3.7+ feature I'd rather rely on for the initial implementation, and RPM probably wants to keep 3.6 support.

A feature that makes a big difference makes shooting for 3.7 totally reasonable.

@Conan-Kudo
Copy link
Member

From my perspective, the minimum I care about is Python 3.9, since that's the RHEL 9 minimum. I would like to have Python 3.6 if we can work it to maintain SLE 15 too, but if there's a big QoL enhancement, I'm okay with just Python 3.9+.

@pmatilai
Copy link
Member

"Must build on RHEL latest" is the unwritten rule, so if RHEL 9 has 3.9 then that would be a "hard limit" on how new we can go.

@encukou
Copy link
Contributor Author

encukou commented Jan 12, 2023

I'm only asking to drop Python 3.1.
With the change, a single Stable ABI .so would support Python 3.10+ (or even 3.7+), but for older Pythons you could still build version-specific extensions, like now.
Sorry I didn't make that easy to skim.

@pmatilai
Copy link
Member

pmatilai commented Jan 12, 2023

Yes I understood about dropping 3.1, but I'm saying we really do not need to support all those old versions either. Python 3.7 would be an entirely reasonable minimum version to support.

@pmatilai
Copy link
Member

I had a brief look at the PoC and I don't see anything fundamentally offending in there, more to the contrary (for one, I will not miss those gigantic PyTypeObject initializers). I didn't fine-comb the changes but AIUI there's no functionality change, except for the init-only-once part? In that case, please proceed.

How would you prefer me to contribute this? Polish it and submit one big PR, or several smaller ones (with the initial ones not making sense on their own, like dropping Python 3.1)?

I guess one big PR makes sense because it's all towards that one goal and don't make that much sense otherwise, but I'm ok with either way. I'll want more explanations in the commit messages in any case (of course in a quick PoC you don't bother with that). No need to write novels, just as long as it makes it clear it's working towards the goal of using Python stable ABI and a brief explanation why that particular change is needed for the uninitiated, "because foo is not available in the stable ABI" level being quite adequate.

@encukou
Copy link
Contributor Author

encukou commented Jan 18, 2023

AIUI there's no functionality change, except for the init-only-once part?

Init once, dropping 3.1, and mutable type objects (the PoC uses Py_TPFLAGS_IMMUTABLETYPE but that's 3.10+ only).
And the unknown unknowns – I've helped design a lot of this API but I don't have that much experience actually using it.

I guess one big PR makes sense [...] but I'm ok with either way.

OK, I'll do some kind of middle ground :)
Thanks!

@pmatilai
Copy link
Member

Hey, it's been quiet on this front for quite a while. Is this a case of "too busy with other things", or is there something more to it? Just asking in case somebody else wants to pickup this work.

@encukou
Copy link
Contributor Author

encukou commented Sep 18, 2023

Yup, too busy, mostly with a certain security issue.
I've actually picked this up again just last week (but left for a conference on Thursday). It's now on the top of my TODO list :)

@encukou
Copy link
Contributor Author

encukou commented Sep 21, 2023

The PR is up at #2674.
I'd appreciate help with CMake -- I'm not familiar with it and I don't exactly know what I want from it in terms of versions and options :)

@encukou
Copy link
Contributor Author

encukou commented Nov 9, 2023

the module, and its types, can only be initialized once per process

For reference, the Python version where this limitation can be lifted relatively easily is 3.10 (which adds API for type/module association, like PyType_GetModule). Later versions make it easier still.

I know because I made a working proof of concept.
I will work on the CPython side to enable making this patch more elegant :)

If issues around the limitation (or reference counting, or the Python wrappers in general) come up, please let me know.
If not -- see you in 3+ years, when it's time for _rpm to start using Python 3.10+ API! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Python bindings RFE
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants