Skip to content

Conversation

aldanor
Copy link
Member

@aldanor aldanor commented Sep 9, 2016

Working with NumPy data structures is a very limited experience unless you can access the underlying C objects directly, this PR partially addresses that.

PyArray_Proxy and PyArrayDescr_Proxy are headers of the corresponding NumPy C objects. Those haven't changed at least since v1.5 and are unlikely to change in the future (although NumPy devs mentioned that it's possible in release 2.0 -- hence why they changed most macros to functions to avoid potential ABI breakage; if that ever occurs, the only way to handle it for us would be to include numpy headers at compile time).

  • A few methods on dtype now use C API -- has_fields, kind, itemsize
  • Add various methods to array like ndim, shape, strides, itemsize, nbytes, size, all matching the existing NumPy C/Python API
  • Add data / mutable_data -- direct access without having to request a buffer
  • Add a few indexing routines (index_at, data_at) -- for array the pointers are uint8_t (index is counted in bytes), for array_t it's the underlying type T (index is in elements). This could serve like a good precursor for eventually implementing tensors with fixed number of dimensions (e.g. 1-D, 2-D) -- some checks could be made at compile-time then.

Notes / a few open questions:

  • If the API is considered to be too redundant (e.g. all kinds of *_at and mutable_* methods), a few methods could be probably axed. For instance, data/data_at could be merged into a single method, etc.
  • Can we assume that Py_intptr_t is of same size as size_t? On all flat address space platforms (= all sane platforms) this would be true, I think.
  • Should array::data() return uint8_t* or void*? (the former makes more sense to me)
  • In check_ndim, should a check be made only when NDEBUG is not defined, like a debug assert, or should it be there always? Maybe it could always be done for general class of arrays, and removed for fixed-ndim arrays if we ever get to implement them.
  • Should the out-of-bounds access be also checked the same way when using *_at and at methods? Maybe !NDEBUG only. It's not currently checked at all.
  • Should writeable flag be only checked when not NDEBUG or should it be done always? Unlike bounds checking, I'm thinking it should always be done (unless we know for sure an array is writeable because its ctor/caster ensures that, see below).
  • The reason mutable_* methods exist is that (generally) we only know if the array is writeable or not at runtime, even for array_t, so we need to ensure we can write to buffer. It's explicitly stated in NumPy C API guide that this flag must be respected. If we choose to split array types into a hierarchy as I suggested earlier (e.g. array_in, array_out) and writeable flag is enforced for "out" arrays, this check can be bypassed for those types -- I'm planning to address this in another PR.

@aldanor
Copy link
Member Author

aldanor commented Sep 9, 2016

// Will fix the build errors tomorrow.

set(CMAKE_BUILD_TYPE MinSizeRel CACHE STRING "Choose the type of build" FORCE)
endif()

add_definitions(-DPYBIND11_TESTS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local is better than global:

target_compile_definitions(pybind11_tests PRIVATE -DPYBIND11_TESTS)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point

@dean0x7d
Copy link
Member

dean0x7d commented Sep 9, 2016

+1 for always checking writeable and ndims. The more the interface resembles its Python counterpart (rather than the C API), it's natural to also expect the same Python-like behavior. Translating Python's runtime checks/exception into C++ undefined behavior would be pretty surprising in my opinion. The runtime checks aren't a big performance penalty for casual usage (where the checks are most important). Users looking to squeeze every last drop of performance will just take the data pointer once and the subsequent calculations will be much more expensive than the initial check.

As for array::data(), returning void* seems more natural to me since it makes no assumption about data type. Although, it may also make sense to have a T* array::data<T>() function which would do the cast internally and also check the data type is correct or throw.

@aldanor
Copy link
Member Author

aldanor commented Sep 9, 2016

@dean0x7d Yea, that was my thinking too, always enforce writeable/ndims checks, just wanted to hear more opinions.

As for the T* array::data<T>(), I was planning to add array_t<T> array::astype<T>(casting policy) which is already implemented in another branch and which may be a bit more flexible, but your suggestion could work as well -- it could check that dtype::of<T>() is type-equivalent to array::dtype() (e.g. should be safe for reinterpret_cast) and return a data pointer in-place.

By the way: what about bound checks when indexing?

@dean0x7d
Copy link
Member

dean0x7d commented Sep 9, 2016

Yeah, I think bound checks should be in there for the same reasons as writeable/ndims.

astype<T>() sounds great. My suggestion for data<T>() was mostly to keep reinterpret_cast out of user code, but perhaps that's already covered by astype<T>() + no cast policy?

@aldanor
Copy link
Member Author

aldanor commented Sep 9, 2016

Hmm, I initially thought .astype() should always return a copy (which is also a default behaviour in NumPy). However, in Python API np.array.astype also accepts an optional copy argument:

copy : bool, optional
    By default, astype always returns a newly allocated array. If this
    is set to false, and the `dtype`, `order`, and `subok`
    requirements are satisfied, the input array is returned instead
    of a copy.

Maybe we could do something similar; because we're not changing order (F/C) and subok, it's actually sufficient to just set the casting policy to equiv. NumPy just increfs the original array and returns you that; we'd have to also wrap it array_t. Then you could potentially do:

T* ptr = arr.astype<T>(casting::equiv, false).mutable_data();

@aldanor aldanor force-pushed the feature/numpy-c-api branch from 244e58d to 846f9c2 Compare September 10, 2016 00:50
@aldanor
Copy link
Member Author

aldanor commented Sep 10, 2016

@dean0x7d

  • moved static_assert to the top level
  • axed data_at() and mutable_data_at() (now part of data() and mutable_data())
  • calling data() should never trigger any checks
  • ndim check is now always done
  • writeable flag check is now always done
  • index_at() and its callers now always do full bounds check
  • fixed all travis/appveyor warnings/errors, rebased on latest master

@aldanor aldanor force-pushed the feature/numpy-c-api branch from 846f9c2 to fa96054 Compare September 10, 2016 00:51
@aldanor
Copy link
Member Author

aldanor commented Sep 10, 2016

The only minor question left is void* vs uint8_t* for array::data(). I initially went with uint8_t* mainly because with void* pointers expressions like this would cause a -Wpointer-arith warning: arr.data() + arr.index_at(i, j). It's still suitable for tasks like serialization, but it's not meant to be a subject of reinterpret_cast (with array_t and potential .astype() in the nearest future).

@aldanor aldanor force-pushed the feature/numpy-c-api branch 2 times, most recently from 978b33d to eb25232 Compare September 10, 2016 01:12
@wjakob
Copy link
Member

wjakob commented Sep 10, 2016

Regarding ABI breakage: If NumPy ever changes the layout of PyArray, this is an even stronger reason for the current proxy approach which (although ugly) could be modified to deal with different layouts. The ability to distribute precompiled binaries that work with any version of NumPy is key from a software maintainer viewpoint. Cross-compiling bindings for any pair of Python version & NumPy version would be extremely painful.

bool load(handle src, bool) {
array_t<Scalar> buffer(src, true);
if (!buffer.check())
array_t<Scalar> buf(src, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indent (1 more space) ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

@wjakob
Copy link
Member

wjakob commented Sep 10, 2016

This looks pretty great. I added a few comments, let me know when it's ready to merge.

@dean0x7d
Copy link
Member

@aldanor

The only minor question left is void* vs uint8_t* for array::data(). I initially went with uint8_t* mainly because with void* pointers expressions like this would cause a -Wpointer-arith warning: arr.data() + arr.index_at(i, j). It's still suitable for tasks like serialization, but it's not meant to be a subject of reinterpret_cast (with array_t and potential .astype() in the nearest future).

A possible pitfall of uint8_t* is that arr.data() + i would work without warning and yield the i-th byte, but at first glance most would expect to get the i-th element. This is likely to be a common user mistake for 1D arrays. Using void* would result in a warning or error in such a case, which would prompt users to look at the API and see that arr.data(i) is the proper way to do it. A silent arr.data() + i may cause many headaches. The use case of arr.data() + arr.index_at(i, j) should also already be covered by arr.data(i, j).

@wjakob
Copy link
Member

wjakob commented Sep 10, 2016

I also think that it would be more natural for the array base class to return void* pointers.

@aldanor aldanor force-pushed the feature/numpy-c-api branch 2 times, most recently from 3b0673e to 9ed5217 Compare September 10, 2016 15:39
@aldanor
Copy link
Member Author

aldanor commented Sep 10, 2016

Ok, agreed about void*; it also lets get rid of reinterpret casts and replace them with static casts.

I've pushed a few changes:

  • array::index_at() (and array_t::offset_at()) now always return the offset in elements, even for array, because we know itemsize
  • added array::offset_at() (and array_t::offset_at()) which now always returns the offset in bytes
  • array_t::itemsize() is now constexpr (sizeof(T))
  • array_t::index_at() makes use of constexpr itemsize to for division
  • changed uint8_t* to void* for array::data() and array::mutable_data(), removed all reinterpret casts from the code
  • out of bounds error message is now the same as in NumPy (e.g. index 4 is out of bounds for axis 1 with size 3)
  • dimension check error message is now the same as in NumPy (e.g. too many indices for an array: 3 (ndim = 2))
  • added a few tests, refactored tests a bit to avoid copy/paste, rebased on latest master

// NumPy docs section is now so outdated, may need to rewrite the whole thing soon

@aldanor aldanor force-pushed the feature/numpy-c-api branch from 9ed5217 to aca6bca Compare September 10, 2016 15:42
@wjakob
Copy link
Member

wjakob commented Sep 10, 2016

Great -- this is ready to merge then? (Let's work on the docs separately.)

@aldanor
Copy link
Member Author

aldanor commented Sep 10, 2016

Yea, tests look green, should be good to merge now, cheers all for feedback ;)

@wjakob
Copy link
Member

wjakob commented Sep 10, 2016

great, merged.

@wjakob wjakob merged commit f217c04 into pybind:master Sep 10, 2016
@aldanor aldanor mentioned this pull request Sep 12, 2016
9 tasks
@nevion
Copy link
Contributor

nevion commented Sep 12, 2016

Hm, on data() returning void* by default, this causes a slight annoyance - one almost always uses the return type with strides, adding in byte offsets. As such, it is best if it is unsigned char /uint8_t *.

Otherwise you will commonly write lines like reinterpret_cast<T *>(reinterpret_cast<uint8_t *>(arr.data()) + row * arr.strides[0] + col * arr.strides[1])(along with constness mutations).

Most image processing toolkits use uint8 for this reason, it makes it much more convenient to compute the byte level offsets.

If this belongs in an issue, so be it but the conversation was here already for the return type.

@wjakob
Copy link
Member

wjakob commented Sep 12, 2016

For most image file formats, pixels intensities are represented with uint8, so that return type is a reasonable default (but this is not the case here). I'd much rather have void* to force the user to make a decision on what is right. Plus, there are alternative methods with builtin offset computation.

@aldanor
Copy link
Member Author

aldanor commented Sep 12, 2016

@nevion Just a few points to add to @wjakob's reply:

  • you don't need reinterpret cast from void*, static cast of (T*) will suffice

  • I'm not sure I understood what exactly you meant by

    reinterpret_cast(reinterpret_cast(arr.data()) + arr.strides[0])

    but if it's something like reinterpret_cast<const T*>(reinterpret_cast<const uint8_t*>(arr.data()) + arr.strides(0))), there's other ways to do this, for example:

    (const T*) arr.data(1)

    or if arr is array_t<T>, then just

    arr.data(1)

    As was already noted above, one of the goals is to have a flexible astype() so you could also do

    arr.astype<T>(/* optional casting policy */).data(1)
  • There's also array::offset_at which always returns byte offset

@nevion
Copy link
Contributor

nevion commented Sep 13, 2016

I updated and wrapped in a code block (formatting was screwed as you a saw it) , so please reevaluate the original and updated comment.

@aldanor @wjakob no - this is just for byte offset computations simplicity - which is almost always what you're doing with .data on a matrix. When you want to jump to an item where you make the "right" cast, you first need to go to that address in memory - using the information in strides while multiplying factors for row, column, depth etc, which are in bytes. This is why you can't have T early in data, even for typed arrays, because those strides are allowed to not be evenly divisible by sizeof(T) or packed - there may be padding etc, or some type of view.

In my own work I've often offered a dataT to return typed data (T *) to deal with the frequent case of a packed array in the template specialization (array_t), but this is an exception, not the rule.

for typed arrays/array_t's, we can use T operator(int...indices) or uint8_t* data_at(int... indices) to give an item or memory address to the start of a region, but neither should take the place of uint8_t* data().

As for images, I did not mean to propose assuming the typical uint8_t image type, I mean all image types, despite the pixel format/type, even when typed (like array_t), still use uint8_t (or equivalent sized options) as the pointer's base type returned by data methods (or sometimes fields). The user still has to make a decision on what is right but now we've optimized the extremely common cases of using the data method. Image libraries are good teachers because they all deal with non-packed arrays.

Alternatively, support the weak typed-data() having concept (that is, containers expose typed pointers, like stl vector), at least add a uint8_t* p() method which performs the same behavior - though users will neglect strides if you do offer T* data(), using typed returns, which will open up common bugs on non-packed arrays - as people will be tempted still by the T* return type and stop reading there. You can see this behavior out in the wild for many libraries as people just assume packed access not realizing for example that line-bytes (bytes in a c-major row) had to be divisible by 4 and was in some configurations alloc'd higher with 2 bytes of padding - for example. PIL/Qt::Image/Gtk Image and numpy's use in so many cases of image analysis is one reason why this can be seen and is of concern. So this still provides the safety of void * in most cases but lets us avoid those ugly casts or multi-statements all the time since pointer arithmetic on void's is poorly defined or at least does not work with byte offsets conviniently.

@aldanor
Copy link
Member Author

aldanor commented Sep 13, 2016

@nevion Regarding your updated example, this is currently doable without any reinterpret casts either:

(const T*) arr.data(row, col)

Re: packed data, views and paddings: if an array object is C/F-contiguous (which it's not by default now but it will be, aligned + writeable + c_contiguous as default flag setting), strides are always evenly divisible by itemsize. Moreover, for array_t<T> (and later for .astype<T>()) it is always required that dtype.itemsize() == sizeof(T), in which case it really doesn't matter if T is packed or not (see test examples in test_numpy_dtypes.cpp).

So I assume the only edge case is presenting a misleading T* pointer to the user when the array is not c/f-contiguous (if they casted it to T* from void* the blame's on them), maybe it's a weird view that you don't want to copy but have to iterate over. But in this case any of these would work with array_t object, whatever the strides are, assuming the dtype is aligned:

data_at(row, col) 
*data(row, col)

If you have an array_t and still want a uint8_t* to be used in a tight loop where you don't want bounds checking etc, then yea, you will have to do a reinterpret cast, or, as an alternatve, a static (uint8_t*) arr.array::data(). Idk, if it's such a big deal here, we could potentially provide something like uint8_t* array::bytes().

@nevion
Copy link
Contributor

nevion commented Sep 13, 2016

I kind of don't like that data is combined with data_at in your implementation but maybe it's just to foreign to experience. Definitely don't like the data vs mutable_data variant, but maybe that's the best option... I still figure users will often be using data() + some_offset often enough... maybe bytes is the answer, though I'd still prefer p() shorthand considering how many times I think I'm likely to type it.

As much as I like metaprogramming and functional programming, size() implementation would probably be best to forloop out, same with get_byte_offset. Not just for ensuring good speed for such frequently used routines but that code will be explosive/distracting/high penalty in debug mode. How often do people debug indexes? :-)

Overall comments - given the nature of writing libraries in C/C++ and calling from python not just for wrapping, but for speed, we need to offer no-runtime check paths and/or disable them when NDEBUG is defined, but there are cases where we want checks on still too. Maybe that comes from operator() vs .at, maybe _nothrow like variants. Recurring theme: preserving ndarray's generality and existing behaviors - but do allow things to go faster.

Noting index_at implementation is faulty and assumes packed arrays, should just use dims and itemsize - for loop only again.

numpy .astype returns a copy, .view lets you change the dtype without copy and where some checks on itemsize are performed. Follow numpy semantics for principal of least surprise.

char as *ptr type for Array-Proxy, not sure if this should be char - AFAICR I don't think C++11 specified char to be 1 byte, but uint8_t is guaranteed to and in C++11.

Don't fall into the trap of everywhere handling packed data, the assumption insidiously creeps abound. Maybe you have an optimized specialization again (through flags or type) that has that feature and overloads indexing methods, but in general, something like array_t::in, and ::out should not assume packed. Make those optimizations opt-in.

@nevion
Copy link
Contributor

nevion commented Sep 13, 2016

@aldanor on mutable_data vs data, how about just data() and assert_writable/ ensure_writable which throws an exception if not. That way it's done once early on and then you don't see mutable vs non-mutable variants in code below. data()/data() const will handle whether or not the type is const.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants