-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Problems with Py_buffer management in memoryobject.c (and elsewhere?) #54390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There are a number of places in memoryobject.c where get_shape0 is used without the return value being checked. If it fails, this leads to hanging exceptions and crashes. |
Yes, there are probably many holes like this. I've done what I could to make the simple cases (builtin types) to work, but the spec is rotten from the start. Blame the numpy people, sorry. |
As far as I know, PEP-3118 serves its purpose of allowing extension modules like numpy and PIL to share data without needing to copy it around all the time. It's just that memoryview wasn't really part of that purpose (since all the affected third party libraries have their own objects for looking at memory), and it shows :P Our near term goal should be to get memoryview in a good place for handling 1D contiguous data and throw exceptions (rather than crashing) for everything else. I think Antoine has achieved the former, but it sounds like there is still some work to do on the latter. |
The deeper issue is that there are no ownership or lifetime rules for (and I'm not even talking of the issue of making slices of memoryviews, |
Read the "Releasing the buffer" section in PEP-3118 again. Unless I'm misunderstanding you completely, the rules you're asking for are already in place: those fields are entirely the responsibility of the exporting object, and it needs to ensure they remain valid until the buffer is released. Now, it may be that we haven't *implemented* this correctly or consistently in the builtin objects, in which case we should do something about it. But the rules are there. |
In practice, though, we copy Py_buffer structs around and there's no way |
The culprit wrt copying Py_buffer structs seems mainly to be dup_buffer, which is called in memory_getbuf. This seems unnecessary in the case where there's an underlying object and it has returned the view thru its own tp_as_buffer. The underlying object at that point is solely responsible for releasing the buffer, so memory_getbuf shouldn't mess with it at all. In the case where there is no underlying object (mainly thru PyMemoryView_FromBuffer), it probably should allocate any memory in the view in such a way that it can be freed in memory_releasebuf when the view->obj is NULL. |
While on the subject, the docs for PyMemoryView_FromBuffer state that the resulting memoryview takes ownership of the Py_buffer struct and is responsible for freeing any associated memory. It does not do so, which is not surprising. The absence of a standard for allocation makes it absolutely impossible for it to do so safely. |
Hmm, if we're ever creating a new copy of a Py_buffer without calling GetBuffer again on the original object, then *that* is a bug (exactly comparable to copying a PyObject pointer without incrementing the reference count - it's OK if you can guarantee the lifecycle of your borrowed pointer is shorter than the lifecycle of the original, but disallowed otherwise). This still doesn't sound like a problem with the spec to me, it sounds like a problem with the implementation strategy that was originally used when integrating the spec into the interpeter core (which I readily agree received far less attention than the spec itself did). It really sounds like we should be creating a _Py_ManagedBuffer internal object, with each instance containing a PyObject* and a Py_buffer instance. We don't need to alter PEP-3118 to do that - such a change is entirely consumer side, so the objects providing the buffers don't need to care how the lifecycle of the Py_buffer struct is managed. When we need another reference to the buffer, we can then just increment the refcount of the _Py_ManagedBuffer instance rather than having to call GetBuffer again on the original object. The contents of Py_buffer objects that have been passed to GetBuffer really shouldn't be getting modified at all - for those cases, we should maintain *two* Py_buffer structs, one (unmodified) with the original results of the GetBuffer call, and a second owned by the caller (for manipulation). |
It doesn't help that neither PEP-3118 nor the Py_buffer docs mention the "obj" member of the Py_buffer struct that refers back to the original object providing the buffer - that's fairly fundamental to understanding how PyBuffer_Release and PyMemoryView_FromBuffer can work even in theory. (Given that, an additional _PyManagedBuffer object shouldn't be necessary - MemoryView just needs to call ReleaseBuffer and GetBuffer at the appropriate times) |
So the overall to-do list here is sounding like:
|
It might be instructive to look at how NumPy itself manages sharing of ndarray data and ownership of the corresponding structs. I meant to find time to look at this over the break. |
BTW, I agree that it looks as though significant changes might be needed. I wonder whether it would make sense to exclude the Py_buffer struct fro m the Stable ABI PEP for now. |
Agreed. I'll put something on python-dev about that, so MvL sees it. |
More direct - added MvL to nosy list. Martin, we would like to exclude Py_buffer from the stable ABI for Python 3.2, until we have a chance to thrash out the missing pieces of the documentation for 3.3. I *think* it is a documentation problem, but until we're certain, it seems safer to leave it out. |
From Lenard Lindstrom in bpo-10001 (closed as dupe of this issue): The ~Py_buffer.obj field is undocumented. Yet memoryview, that acts as a wrapper, includes the field in gc traversal. Also, if the field is not initialized by bf_getbuffer its value can be indeterminate. For memoryview the gc can crash (see attached C demo program). |
Added Travis to nosy list - even if he doesn't have time to finish this off himself, hopefully he can point us in the right direction. |
Nick, it sounds as though you have an idea of how you think things should be working here---perhaps you can help me out. I'd really like to understand what the implementation should look like from the POV of a 3rd party module that defines some object exporting the buffer interface. Here's a specific scenario I'd like to understand: module foo defines a type Foo that implements the buffer protocol. For simplicity, suppose it's exporting 1-dim buffers. When I do:
what's the sequence of getbuffer and releasebuffer calls that foo_object should expect to see? Q1. Does foo get 2 requests to getbuffer (and 2 to releasebuffer), or just one each? I'm assuming at least that getbuffer and releasebuffer calls should be paired. Q2. For each pair of getbuffer/releasebuffer calls, should the 'view' parameter passed into releasebuffer be identical to that provided to getbuffer? Or is it acceptable for the actual Py_buffer* pointers to be distinct, but the pointed-to Py_buffers to be exact copies. (The existence of the smalltable field complicates the notion of an exact copy a little bit, but I think that's a detail that can be ignored for these questions.) |
PEP-3118 makes it clear that the underlying object should see *two* pairs of calls to the buffer methods: Even if we ignore the undocumented "obj" field, the target object needs to ensure the exported buffer remains valid as long as any memory views are referencing it. The only way to do that is to treat GetBuffer/ReleaseBuffer as the moral equivalent of INCREF/DECREF. However, I believe the current memoryview implementation does the wrong thing and only calls them once, and then duplicates the Py_buffer structures without ever going back to the original objects (that opinion was based on a quick scan of the code a while back, but it would fit with the uncomplimentary sentiments Antoine has expressed in trying to get all this to exhibit some kind of consistency) For point 2, it must be the same pointer. When the PEP says "the same", I agree it could be taken as ambiguous, but the later reference to the exporter managing a linked-list of exported views makes it clear that identity is what matters. As far as I can see, some of things in the PEP were found to be a PITA in practice (such as every consumer of the API having to implement the equivalent of the "base" attribute in the original memoryview design), so Travis changed them. Unfortunately, those changes never made it back into the protocol documentation, leading to the current confusion. |
Actually, and unless I made a mistake, it does call them twice.
The common idiom (including in code not written by me :-)) is currently Also, we have the C API function PyMemoryView_FromBuffer() which |
Fine with me. Attached is a patch; it would be good if someone could |
It's OK if the Py_buffer is on the stack - it's just a unique identifier for the exporter to use as a key, not something the exporter controls the lifecycle of (the latter is true only for the pointers *inside* the struct, such as buf, shape, strides, etc). PyMemoryView_FromBuffer should be calling PyObject_Getbuffer on the view->obj member (it's one of the things that embedding the reference allows, just as it allowed removal of the separate obj argument from the PyObject_ReleaseBuffer signature). That way the source object knows there is now a *second* Py_buffer struct kicking around, and can decide whether to re-use the same internal pointers or create new ones. |
Martin's patch for the PEP-384 adjustments looks good to me. A note in the PEP that we've deliberately excluded this on a temporary basis due to the implementation/documentation mismatch and expect to have it back in for 3.3 might be helpful, too. |
One use case is anonymous memory, i.e. view->obj == NULL. |
[Nick]
Okay, thanks. Next point of confusion: does that mean that in the example I gave, the object 'n' (the memoryview slice) needs to know about *two* distinct Py_buffer structs---one to pass back to releasebuffer and one to store its own information? It seems to me that these can't be the same, since (a) the Py_buffer for the slice will have different shape and stride from that returned by getbuffer, (b) what's passed back to releasebuffer should be identical to what came from getbuffer, so we can't just modify the shape and stride information in place. |
Removal of the buffer interface from the ABI has been committed as r87805, r87806, r87805, r87808, and r87809. |
I've added some comments on Rietveld. This time, I pasted the email Regarding the use of _testbuffer in the docs: I agree that it's strange, So I'm not sure what to do. Of course I'll take out the examples if you |
Radical suggestion: make it public as collections.simple_ndarray? (prefixing with simple_ to be explicit that this is not even *close* to being the all-singing, all-dancing NumPy.ndarray) |
I've been busy fixing the memoryview.h issues first. Could you take a http://hg.python.org/features/pep-3118/file/f020716704d4/Include/memoryobject.h Changes: a) Make all functions and the two buffer access macros part b) Make the managed buffer API definitely private. c) Make PyBUF_MAX_NDIM=64 part of the official buffer API. I think it might be OK to defer the decision about Py_MEMORYVIEW_C etc., |
Nick Coghlan <report@bugs.python.org> wrote:
Heh, that's a nice one. :) While I think that the code in _testbuffer.c is sound and very well tested, Examples: _testbuffer.ndarray has the questionable capability of changing Then, similar to NumPy's ndarray, _testbuffer.ndarray constructs arrays NumPy's ndarray is also the low level API. I think most people use >>> a = array([[1,2], [3,4]], dtype="L")
>>> a
array([[1, 2],
[3, 4]], dtype=uint64) So it would take some effort to polish the API. Meanwhile, to eliminate the use of _testbuffer in the documentation, I think I've just added complete support for displaying multi-dimensional lists and The code is quite succinct and follows exactly the same pattern as Then, in the documentation we can use cast() and memoryview.tolist(), |
Well, the question is: does it have a point? Is this ndarray useful You might answer that we already have array.array but ISTM that |
Antoine Pitrou <report@bugs.python.org> wrote:
I think it would be mainly educational. It's easier to understand The other issue is that's it's slightly strange to have a fully Anyway, right now I don't know myself if I'm for or against it. :) |
Hmm, I don't think buffer access macros should be part of the limited
My question is whether there is any point in making these flags. Does (and of course the memoryview object is becoming more and more like |
Antoine Pitrou <report@bugs.python.org> wrote:
I thought the whole Py_buffer API was only temporarily removed from the http://bugs.python.org/issue10181#msg125462 For instance, here the macros are not excluded: http://hg.python.org/cpython/file/3292cc222d5c/Include/memoryobject.h Since the issues seem resolved in general, I thought it was time to As for the two macros specifically, I know Pauli was using them: http://bugs.python.org/issue10181#msg139775
The flags are primarily for the memoryview itself to avoid repeated calls Pauli: If you are still following the issue, do you think having access |
I'm talking about the memoryview access macros. It's like The limited API promises binary (ABI) compatibility accross releases.
It might be a mistake, then.
Well, Pauli might use them, but it just means his code isn't compatible
But why would 3rd-party code use PyMemoryView_GET_BUFFER instead of, for |
I'm with Antoine here - we want to be *very* conservative with what we expose through the limited API. Due to the ABI compatibility promise, anything exposed that way is very hard to change. Keeping things out of the limited API isn't really an issue - it just means that anyone that wants to use them needs to rebuild their extensions for each new version of the C API (just as they do now). Addings things to the stable ABI later is easy though - that's why we had Martin take all the Py_buffer related stuff out of the initial version of the limited API. |
Also, +1 to replacing _testbuffer with .cast() to demonstrate the multi-dimensional support. Without an actual multi-dimensional array object in the standard library, demonstrating those aspects is always going to be a little tricky. However, it's worth doing so that the likes of NumPy and PIL don't need to explain those interactions. |
Antoine Pitrou <report@bugs.python.org> wrote:
You're right: I presumed that the macros were excluded temporarily
That's a good question. It looks to me that the macro was present as I think 3rd-party code uses the macros mainly because they are Most use cases I can think of would also involve having access to Suppose you have an initialized bytes object that you want to
Of course this could also be exposed as a function, e.g.: /* stealing a reference to bytes */
PyMemoryView_FromBytesAndInfo(PyObject *bytes, Py_buffer *info); So let's make the flags private. What do you prefer?
Also, I'll add a note to the docs that PyMemoryView_GET_BUFFER can |
I think we should minimize the number of reference-stealing functions.
I don't really mind, whatever you think is best :) |
I've uploaded a new patch that should address the remaining issues: o In the documentation _testbuffer has been replaced by o I restored the state of the limited API. If we want o Flags of the memoryview object are private. Additionally, because NumPy allows non-aligned array accesses, On x86/amd64 gcc is smart enough to produce almost exactly the same On other platforms the situation might be worse, but I don't have |
Latest version looks good to me - I vote for landing it whenever you're ready :) |
Thanks, Nick. I'll try to get it done this weekend. I've uploaded Misc/NEWS and Doc/whatsnew/3.3.rst (my apologies to Antoine I wasn't sure whether to put the "whatsnew" entry into the section |
PEP section makes sense - I plan to mark PEP-3118 as Final once you commit this (or you can do that yourself, for that matter). |
New changeset 3f9b3b6f7ff0 by Stefan Krah in branch 'default':
|
Nick Coghlan <report@bugs.python.org> wrote:
Array features are complete except for multi-dimensional indexing and slicing. The overall array handling scheme is sound. Life would be a bit easier Then there is an open issue (bpo-3132) for expanding the struct module syntax, In another issue (bpo-13072) the question came up whether the proposed 'u' and 'w' We need to decide what to do about 2.7 and 3.2. It's pretty difficult by Another problem for 2.7 and 3.2 is that the 'B' format would still need to |
Aw, I guess I was too optimistic in thinking this was the last gap before we could declare it Final... +1 on creating a new feature request for NumPy style multi-dimensional indexing and slicing support on memoryview (I'm assuming that's what you meant). As far as your last question goes, I'm not sure we *can* salvage this in 2.7/3.2. We do have the option of throwing our hands up in the air and saying "Sorry, we couldn't figure out how to fix it without completely rewriting it. Take a look at 3.3 if you want to see how it's *supposed* to work." I hesitate to suggest this (since I'm not volunteering to do the work) but perhaps it would be worth packaging up the 3.3. memoryview and publishing it on PyPI for the benefit of 3.2 and 2.7? May also be worth bringing up on python-dev to see if anyone else has any bright ideas. Myself, I'm not seeing any real choices beyond "won't fix", "backport 3.3 version and remove/disable the new features", "backport 3.3 version, new features and all" and "publish a full featured version on PyPI". |
I'm busy adding the C-API changes to the docs. Regarding the stable ABI: The general mood was to *keep* the removal of the buffer interface In that case this removal (especially of the Py_buffer struct) should |
New changeset 9d3c7dd55c7f by Stefan Krah in branch 'default': |
For 3.2, there's no removal to document - we asked MvL to drop the buffer APIs from PEP-384, so they've never been part of the stable ABI. From the PEP: "The buffer interface (type Py_buffer, type slots bf_getbuffer and bf_releasebuffer, etc) has been omitted from the ABI, since the stability of the Py_buffer structure is not clear at this time. Inclusion in the ABI can be considered in future releases." I agree we shouldn't be hasty in making even the updated buffer interface implementation officially part of the stable ABI. Better to leave it out for 3.3, get some real world feedback on the latest version, then commit to stability for 3.4+. |
Great, that's exactly what I was looking for. - As far as I can see the issue |
It occurs to me that one thing that *could* be backported to early versions are some of the documentation improvements on how to correctly handle the lifecycle of fields in Py_buffer. That gets messy though because memoryview doesn't behave nicely in those versions, so we'd be violating our own guidelines. Perhaps the relevant sections should be updated with a reference saying that the semantics have been cleaned up in 3.3, and if in doubt, try to follow the 3.3 documentation? |
New changeset 373f6cdc6d08 by Stefan Krah in branch 'default': |
I currently use Python 2.7, and I'd like to make use of memoryview. Specifically, I work on BITS (http://biosbits.org/), which runs Python in ring 0 as part of GRUB, and I'd like to use memoryview to give Python access to data in physical memory. I ran into several of the problems documented here when trying to do so. I'd really love to see a backport of this fixed version into 2.7. |
More specifically, the primary functionality that I'd like to use exists in 3.3 as PyMemoryView_FromMemory. I've tried to approximate that function using the available API in 2.7, but that led me here. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: