Fix undefined memoryview format #2223

kyamagu · 2020-05-22T05:39:40Z

Current py::memoryview constructor accepts py::buffer_info, and use info.format.c_str() as a format to PyBuffer object. However, this char* pointer is used after info goes out of stack, leading to undefined behavior.

This PR attempts to fix this, using static variable inside the constructor. test_memoryview calls the constructor two times with different format and check that the first memoryview is not affected by the second call.

Note the format strings might be replaced by instantiated py::format_string<T> types, but initializer becomes complicated.

include/pybind11/pytypes.h

tests/test_pytypes.py

include/pybind11/pytypes.h

rwgk · 2020-05-23T12:04:49Z

The code you're trying to fix has a far more serious bug: it doesn't keep a Python reference to the object it's providing a view for (buf.obj is not set). One indication is sys.getrefcount(a) before and after you create the memoryview view with your test_memoryview() function. It's unchanged. If you use the built-in memoryview the reference count increases by 1. Another way to show the issue is to use weakref, e.g. I tried this with Python 3.6:

    l = [38, -84, 15]                                                           
    a = array.array('i', l)                                                     
    w = weakref.ref(a)                                                          
    if 0:  # SWITCH                                                                      
      mv = m.test_memoryview(a)                                                 
    else:                                                                       
      mv = memoryview(a)                                                        
    def check(mv):                                                              
      list(mv) == l                                                             
      print('PASS')                                                             
    check(mv)                                                                   
    del a                                                                       
    print('w():',  w())                                                         
    check(mv)

I ran this with the MSAN memory sanitizer.
With SWITCH 0 I got:

  PASS
  w(): array('i', [38, -84, 15])
  PASS

With SWITCH 1:

  PASS
  w(): None
  ==46858==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x5635170d106a in PyLong_FromLong third_party/python_runtime/v3_6/Objects/longobject.c:239:5
    #1 0x563517157602 in unpack_single third_party/python_runtime/v3_6/Objects/memoryobject.c:1701:12
    #2 0x563517156120 in memory_item third_party/python_runtime/v3_6/Objects/memoryobject.c:2249:16
    #3 0x563516e1cddb in PySequence_GetItem third_party/python_runtime/v3_6/Objects/abstract.c:1641:16
    #4 0x56351709a613 in iter_iternext third_party/python_runtime/v3_6/Objects/iterobject.c:63:14
    #5 0x5635170a794c in listextend third_party/python_runtime/v3_6/Objects/listobject.c:854:26
    #6 0x5635170b1a97 in list_init third_party/python_runtime/v3_6/Objects/listobject.c:2313:24
    #7 0x563517264069 in type_call third_party/python_runtime/v3_6/Objects/typeobject.c:915:19
    #8 0x563516e46e2f in _PyObject_FastCallDict third_party/python_runtime/v3_6/Objects/abstract.c:2331:18
    #9 0x563516e49e3a in _PyObject_FastCallKeywords third_party/python_runtime/v3_6/Objects/abstract.c:2496:14
    #10 0x5635177de676 in call_function third_party/python_runtime/v3_6/Python/ceval.c:4861:17
    ... truncated ...

NOTE:
To get around the issue you identified, I locally patched the code to simply leak the format string:

--- a/google3/third_party/pybind11/include/pybind11/pytypes.h
+++ b/google3/third_party/pybind11/include/pybind11/pytypes.h
@@ -1333,7 +1333,8 @@ public:
         static std::vector<Py_ssize_t> py_shape { };
         buf.buf = info.ptr;
         buf.itemsize = info.itemsize;
-        buf.format = const_cast<char *>(info.format.c_str());
+        std::string* format_leaking = new std::string{info.format};
+        buf.format = const_cast<char *>(format_leaking->c_str());
         buf.ndim = (int) info.ndim;
         buf.len = info.size;
         py_strides.clear();

I think the bigger issue needs to be fixed first. There isn't much gain from fixing just the issue with the format handling; maybe the bigger fix resolves it as a sideeffect.

@aldanor Did you have a reason for not using info.view, with Py_INCREF(info.view) before passing it to PyMemoryView_FromBuffer(info.view)?

kyamagu · 2020-05-25T00:51:21Z

Great, the fundamental issue is that the current implementation is not following the expected exporter behavior:
https://docs.python.org/3/c-api/buffer.html#buffer-protocol

I will leave this PR open at this moment

rwgk · 2020-05-25T20:39:21Z

Hi @kyamagu, thanks for pointing out the issue. I'll reach out to @aldanor to discuss how we can fix the issues.

aldanor · 2020-05-25T23:07:44Z

@rwgk Apologies, been a bit busy so only got to replying now. Tbh I don't remember much about this, I think it was needed to make something else work (I was hacking mostly on making structured numpy dtypes work in pybind at the moment) so not enough attention was paid to this.

All of the points above sound about right, all this static variable business in both memoryview and buffer_info should go away (except for maybe static format strings which can be just stack allocated), and the refcounts have to be carefully respected. It would also be nice to add a few refcount tests like the ones outlined above by @rwgk?

wjakob · 2020-05-31T04:56:11Z

related: #1501

wjakob · 2020-05-31T04:58:31Z

This may end up needing a fancier construction such as a weak reference that waits for the memoryview to expire before it free()s the Py_buffer*.

wjakob · 2020-06-30T22:17:20Z

@kyamagu, do you plan to do further work on this PR?

kyamagu · 2020-07-01T01:04:16Z

@wjakob I am not working on this issue at this moment, as the refcount issue seems to need a fix first. Should I go further and revise this PR?

wjakob · 2020-07-01T13:10:55Z

Both of these are really the same issue, as far as I can see.

kyamagu · 2020-07-02T01:09:15Z

Alright, look into this

rwgk · 2020-07-03T05:47:01Z

Hi Kota, thanks a lot for taking on this fix! I already looked at your code a little bit and at first sight it looks good. I'm planning to repeat my test from May 23. It's a holiday weekend here, it may take me a few days, but I'll try asap.

rwgk

Hi Kota, I did quite a bit of testing, including ASAN & MSAN in the Google environment (Python 2.7 & 3.6) and testing with Python 2.7 & 3.7 in a more standard Debian environment, including a low-tech leak check putting your tests into infinite loops and monitoring memory usage with top. Almost everything passes, except unfortunately for m.test_memoryview_frombuffer_new the format comes out as B instead of b in the Google environment (standard Debian is fine). I need to debug. Could you please let me know if you have a suspicion what would be wrong?

tests/test_pytypes.py

tests/test_pytypes.cpp

tests/test_pytypes.py

rwgk · 2020-07-06T00:53:19Z

Hi Kota, below are modified versions of your tests. That code passes in the Google environment with Python 2.7 & 3.7 & ASAN & MSAN, and under standard Debian with 2.7 & 3.6.
I'm still looking at a detail of your implementation.

test_pytypes.cpp:

    m.def("test_memoryview_frombuffer_new", [](bool unsigned_if_true) {         
          static const int16_t si16[] = { 3, 1, 4, 1, 5 };                        
          static const uint16_t ui16[] = { 2, 7, 1, 8 };                          
          auto smv = py::memoryview(py::buffer_info(si16, 5, true));              
          auto umv = py::memoryview(py::buffer_info(ui16, 4, true));              
          return unsigned_if_true ? umv : smv;                                    
    });

test_pytypes.py:

@pytest.mark.parametrize('method, arg, fmt, expected_view', [                   
    (m.test_memoryview_fromobject, b'red', 'B', b'red'),                        
    (m.test_memoryview_frombuffer_reference, b'green', 'B', b'green'),          
    (m.test_memoryview_frombuffer_new, False, 'h', [3, 1, 4, 1, 5]),            
    (m.test_memoryview_frombuffer_new, True, 'H', [2, 7, 1, 8]),                
])                                                                              
def test_memoryview(method, arg, fmt, expected_view):                           
    view = method(arg)                                                          
    assert isinstance(view, memoryview)                                         
    assert view.format == fmt                                                   
    if isinstance(expected_view, bytes) or sys.version_info[0] >= 3:            
        view_as_list = list(view)                                                 
    else:                                                                       
        # Using max to pick non-zero byte (big-endian vs little-endian).          
        view_as_list = [max([ord(c) for c in s]) for s in view]                   
    assert view_as_list == list(expected_view)                                  
                                                                                
                                                                                
@pytest.mark.parametrize('method', [                                            
    m.test_memoryview_fromobject,                                               
    m.test_memoryview_frombuffer_reference,                                     
])                                                                              
def test_memoryview_refcount(method):                                           
    buf = b'\x0a\x0b\x0c\x0d'                                                   
    ref_before = sys.getrefcount(buf)                                           
    view = method(buf)                                                          
    ref_after = sys.getrefcount(buf)                                            
    assert ref_before < ref_after                                               
    assert list(view) == list(buf)

rwgk · 2020-07-06T01:24:04Z

include/pybind11/pytypes.h

+
+inline memoryview::memoryview(const buffer_info& info) {
+    // TODO: two-letter formats are not supported.
+    static const char* formats[] = {


Observations, I have to defer to @wjakob judgement for what to do (if anything):

Here the compiler has to work very hard just to put together the array of char pointers:
static const char* formats[] = {"?", "b", "B", "h", "H", "i", "I", "q", "Q", "f", "d", "g"};
That's basically the same as the hard-coded string in detail/common.h:
static constexpr const char c = "?bBhHiIqQfdg"[...];

It's only a subset of the format strings accepted by the current Python implementation:

https://github.com/python/cpython/blob/e51dd9dad6590bf3a940723fbbaaf4f64a3c9228/Objects/memoryobject.c#L1150

Very unfortunately, this authoritative Python function is hidden away (static).
My own choice would be to copy the function into pybind11 (probably with modifications), with a long comment to explain why. I'd also carefully check every released Python version to see if the list evolved over time, to track that in our copy, if necessary.

Alternatively, if we decide we only want the subset of format strings, I'd do a slight refactor of detail/common.h to make the list available without the very involved detour through the C++ template engine.

kyamagu · 2020-07-06T01:26:54Z

@rwgk Oops, just working on a different part now, will take your comment above into

rwgk · 2020-07-06T17:10:35Z

Thanks Kota! Just today we started a new way to maintain pybind11, making team-based decisions. I'll consult with the other maintainers regarding my "Observations" above, to see what they think.

wjakob · 2020-07-06T19:28:09Z

I took a brief look: the array of standard formats is a bit on the verbose side — the more serious problem is that it will not work for more complex dtypes that are very important to some users. My suggestion (to be discussed/tested if that actually works) would be something like the following:

strdup the format string
create a capsule that will free the string, when it is freed
create a tuple with 2 entries: the first is the capsule, the second is the owner (Py_buffer::obj) of the provided buffer or None
specify the tuple as obj of the newly created Py_buffer.

The owner reference will then both keep the capsule and the actual owner of the data region alive until deallocation.

kyamagu · 2020-07-07T05:41:17Z

We can skip the case when the owner is not nullptr, because when owner is not null, we can use PyMemoryView_FromObject instead of PyMemoryView_FromBuffer to let the reference object manage the format string. The problem arise only when there is no owner. Maybe no tuple is necessary.

FYI, CPython internally defines ManagedBuffer Object (_PyManagedBufferObject ) to manage lifetime of format string when needed (PyMemoryView_GetContiguous implementation).
https://github.com/python/cpython/blob/master/Objects/memoryobject.c

Alternatively, changing the type of buffer_info:: format from std::string to const char* would resolve the problem, but that breaks the API backward compatibility and also ask the caller to manage the lifetime of the format string.

kyamagu · 2020-07-13T05:07:44Z

@rwgk Hi, the new commit adds tests for failure cases and an additional fix for PyPI tests.

There is still an error for docs generation; the current version of breathe + sphinx + doxygen toolchain cannot correctly handle inlined memoryview::frombuffer definition and produces warnings. It is still possible to build docs by removing -W option from sphinx, but Travis CI fails due to this warning. Do you know any workaround?

rwgk · 2020-07-13T05:11:12Z

Thanks Kota! I'll prioritize retesting with MSAN tomorrow. I'll also alert the other maintainers to the doc gen issue.

…

On Sun, Jul 12, 2020 at 10:07 PM Kota Yamaguchi ***@***.***> wrote: @rwgk <https://github.com/rwgk> Hi, the new commit adds tests for failure cases and an additional fix for PyPI tests. There is still an error for docs generation; the current version of breathe + sphinx + doxygen toolchain cannot correctly handle inlined memoryview::frombuffer definition and produces warnings. It is still possible to build docs by removing -W option from sphinx, but Travis CI fails due to this warning. Do you know any workaround? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2223 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFUZAGIZVNRAV37CQB6NYTR3KJC3ANCNFSM4NHP6G7A> .

wjakob · 2020-07-13T19:38:36Z

Thanks Kota, I would try to see whether #if !defined(DOXYGEN) helps to get the CI to pass. Of course that means that the function won’t be documented.

Speaking of that, we don’t have any documentation for this class. Especially given the tricky template definitions, would you be able to say a few words about typical use cases somewhere in this section? -> https://pybind11.readthedocs.io/en/stable/advanced/pycpp/index.html

rwgk · 2020-07-13T23:42:54Z

Hi Kota, I re-did full MSAN/ASAN testing and didn't find any problems :-)

I made a few small suggested changes. The complete diffs as tested are here (is there a better way to share that with you)?

https://drive.google.com/file/d/1M2iABgJ-9htyaNXj2iN4wLmQ_wqnSl9v/view?usp=sharing

Please feel free to pick-and-choose.

I experimented with moving the memoryview::frombuffer outside the class scope, as a potential workaround for the doc generation issue, and it builds & runs fine. I don't know if it actually helps with the doc generation. Even if it does, the code is a little more noisy. I don't know what's the higher value, less noisy code, or complete documentation.

I also experimented allowing ndim == 0. MSAN/ASAN don't report issues with Python 2 & 3. The Python 2 behavior is a bit weird (see test), but Python 3 is completely sane. Since that's the future my vote is to allow ndim == 0. An additional thought is that a shape element could be 0, which boils down to the same situation, a pointer to a 0-size array. We're also not guarding against that, which I think is good.

For test_memoryview, I undid my change turning args into arg, to avoid the py::none unused argument for the new function you added.

kyamagu · 2020-07-14T01:08:29Z

Thanks, @rwgk 's suggestion makes sense, ndim=0 seems a valid case. Will prepare another commit.

is there a better way to share that with you?

Maybe you can directly edit in this PR? I have never used this functionality before, though.
https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork

@wjakob Alright, I'll check the doc

rwgk · 2020-07-14T01:12:06Z

Maybe you can directly edit in this PR?

Ah, that's interesting, thanks, I'll keep that in mind for next time. I think we're nearly done polishing this PR!

kyamagu · 2020-07-14T03:25:02Z

Ok, I've figured out a workaround to docs generation using a macro to skip lines: https://www.doxygen.nl/manual/faq.html#faq_code

In addition, I've added const variants of frombuffer and frommemory. Let me know if that is too much.

rwgk

Very nice! The const overloads look useful to me.

docs/advanced/pycpp/numpy.rst

include/pybind11/pytypes.h

rwgk

Perfect! I'll ask Yannick and Wenzel for their approvals.

kyamagu · 2020-07-15T02:27:22Z

Oops, py27 doesn't throw when nullptr given, will change the test

wjakob · 2020-07-15T08:24:21Z

This looks great, you have my blessing. One last minor request: in converting from SnakeCase to underscore_case, you didn't add underscores, can we have those? E.g.: FromBuffer -> from_buffer, etc.

rwgk · 2020-07-15T15:52:01Z

@kyamagu Merged! Thanks a lot for the great work!

Still applied patches: * py::dir * {object,handle}::dyn_cast Moreover: * apply pybind/pybind11#2223 to (attempt) to fix memoryview formats * use a global hash table of formats to keep format buffers alive. This is a bit better than the previous strdup hacky solutions, as memory leaks might not explode with the number of memoryviews (as long as they have the same format)

kyamagu added 4 commits May 22, 2020 14:32

Fix undefined memoryview format

4aa3de5

Add missing <algorithm> header

de4baa0

Add workaround for py27 array compatibility

2c53550

Workaround py27 memoryview behavior

9ee1514

rwgk requested changes May 22, 2020

View reviewed changes

include/pybind11/pytypes.h Outdated Show resolved Hide resolved

include/pybind11/pytypes.h Outdated Show resolved Hide resolved

include/pybind11/pytypes.h Outdated Show resolved Hide resolved

tests/test_pytypes.py Outdated Show resolved Hide resolved

aldanor reviewed May 23, 2020

View reviewed changes

include/pybind11/pytypes.h Outdated Show resolved Hide resolved

kyamagu mentioned this pull request Jun 10, 2020

Broken memoryview format kyamagu/skia-python#62

Closed

kyamagu added 2 commits July 2, 2020 15:22

Fix memoryview constructor from buffer_info

a239716

Workaround PyMemoryView_FromMemory availability in py27

5a3842f

kyamagu requested a review from wjakob July 3, 2020 03:53

rwgk reviewed Jul 5, 2020

View reviewed changes

tests/test_pytypes.py Outdated Show resolved Hide resolved

tests/test_pytypes.py Outdated Show resolved Hide resolved

tests/test_pytypes.cpp Outdated Show resolved Hide resolved

tests/test_pytypes.py Outdated Show resolved Hide resolved

rwgk reviewed Jul 6, 2020

View reviewed changes

Fix up memoryview tests

7db007d

Update memoryview test from buffer to check signedness

72e4394

kyamagu added 4 commits July 14, 2020 11:14

Allow ndim=0 memoryview and documentation fixup

db1a2f4

Use void* to align to frombuffer method signature

4c7cfef

Add const variants of frombuffer and frommemory

640b45c

Add memory view section in doc

8d508ee

rwgk approved these changes Jul 14, 2020

View reviewed changes

docs/advanced/pycpp/numpy.rst Outdated Show resolved Hide resolved

docs/advanced/pycpp/numpy.rst Outdated Show resolved Hide resolved

include/pybind11/pytypes.h Outdated Show resolved Hide resolved

kyamagu added 2 commits July 15, 2020 10:06

Fix docs

841ba25

Add test for null buffer

41cd25a

rwgk approved these changes Jul 15, 2020

View reviewed changes

Workaround py27 nullptr behavior in test

2719c94

Rename frombuffer to from_buffer

7f64f8f

wjakob approved these changes Jul 15, 2020

View reviewed changes

YannickJadoul removed their request for review July 15, 2020 11:15

rwgk merged commit e248869 into pybind:master Jul 15, 2020

YannickJadoul mentioned this pull request Sep 13, 2020

Use-after-free in object::memoryview for the format description #1501

Closed

YannickJadoul mentioned this pull request Sep 22, 2020

Fix buffer_info for ctypes buffers (pybind#2502) #2503

Merged

aguinet mentioned this pull request Sep 23, 2020

Update pybind11 to 2.5.0 aguinet/dragonffi#40

Merged

YannickJadoul mentioned this pull request Nov 15, 2020

Add memoryview constructors for PyMemoryView_FromMemory #1998

Closed

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#2223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix undefined memoryview format #2223

Fix undefined memoryview format #2223

kyamagu commented May 22, 2020

rwgk commented May 23, 2020

kyamagu commented May 25, 2020

rwgk commented May 25, 2020

aldanor commented May 25, 2020 •

edited

Loading

wjakob commented May 31, 2020

wjakob commented May 31, 2020

wjakob commented Jun 30, 2020

kyamagu commented Jul 1, 2020

wjakob commented Jul 1, 2020

kyamagu commented Jul 2, 2020

rwgk commented Jul 3, 2020

rwgk left a comment

rwgk commented Jul 6, 2020

rwgk Jul 6, 2020

kyamagu commented Jul 6, 2020

rwgk commented Jul 6, 2020

wjakob commented Jul 6, 2020

kyamagu commented Jul 7, 2020

kyamagu commented Jul 13, 2020

rwgk commented Jul 13, 2020 via email

wjakob commented Jul 13, 2020

rwgk commented Jul 13, 2020

kyamagu commented Jul 14, 2020

rwgk commented Jul 14, 2020

kyamagu commented Jul 14, 2020

rwgk left a comment

rwgk left a comment

kyamagu commented Jul 15, 2020

wjakob commented Jul 15, 2020

rwgk commented Jul 15, 2020

Fix undefined memoryview format #2223

Fix undefined memoryview format #2223

Conversation

kyamagu commented May 22, 2020

rwgk commented May 23, 2020

kyamagu commented May 25, 2020

rwgk commented May 25, 2020

aldanor commented May 25, 2020 • edited Loading

wjakob commented May 31, 2020

wjakob commented May 31, 2020

wjakob commented Jun 30, 2020

kyamagu commented Jul 1, 2020

wjakob commented Jul 1, 2020

kyamagu commented Jul 2, 2020

rwgk commented Jul 3, 2020

rwgk left a comment

Choose a reason for hiding this comment

rwgk commented Jul 6, 2020

rwgk Jul 6, 2020

Choose a reason for hiding this comment

kyamagu commented Jul 6, 2020

rwgk commented Jul 6, 2020

wjakob commented Jul 6, 2020

kyamagu commented Jul 7, 2020

kyamagu commented Jul 13, 2020

rwgk commented Jul 13, 2020 via email

wjakob commented Jul 13, 2020

rwgk commented Jul 13, 2020

kyamagu commented Jul 14, 2020

rwgk commented Jul 14, 2020

kyamagu commented Jul 14, 2020

rwgk left a comment

Choose a reason for hiding this comment

rwgk left a comment

Choose a reason for hiding this comment

kyamagu commented Jul 15, 2020

wjakob commented Jul 15, 2020

rwgk commented Jul 15, 2020

aldanor commented May 25, 2020 •

edited

Loading