-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add array methods via C API #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// Will fix the build errors tomorrow. |
tests/CMakeLists.txt
Outdated
set(CMAKE_BUILD_TYPE MinSizeRel CACHE STRING "Choose the type of build" FORCE) | ||
endif() | ||
|
||
add_definitions(-DPYBIND11_TESTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local is better than global:
target_compile_definitions(pybind11_tests PRIVATE -DPYBIND11_TESTS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good point
+1 for always checking As for |
@dean0x7d Yea, that was my thinking too, always enforce writeable/ndims checks, just wanted to hear more opinions. As for the By the way: what about bound checks when indexing? |
Yeah, I think bound checks should be in there for the same reasons as writeable/ndims.
|
Hmm, I initially thought
Maybe we could do something similar; because we're not changing T* ptr = arr.astype<T>(casting::equiv, false).mutable_data(); |
244e58d
to
846f9c2
Compare
|
846f9c2
to
fa96054
Compare
The only minor question left is |
978b33d
to
eb25232
Compare
Regarding ABI breakage: If NumPy ever changes the layout of PyArray, this is an even stronger reason for the current proxy approach which (although ugly) could be modified to deal with different layouts. The ability to distribute precompiled binaries that work with any version of NumPy is key from a software maintainer viewpoint. Cross-compiling bindings for any pair of Python version & NumPy version would be extremely painful. |
include/pybind11/eigen.h
Outdated
bool load(handle src, bool) { | ||
array_t<Scalar> buffer(src, true); | ||
if (!buffer.check()) | ||
array_t<Scalar> buf(src, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indent (1 more space) ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed
This looks pretty great. I added a few comments, let me know when it's ready to merge. |
A possible pitfall of |
I also think that it would be more natural for the array base class to return |
3b0673e
to
9ed5217
Compare
Ok, agreed about I've pushed a few changes:
// NumPy docs section is now so outdated, may need to rewrite the whole thing soon |
9ed5217
to
aca6bca
Compare
Great -- this is ready to merge then? (Let's work on the docs separately.) |
Yea, tests look green, should be good to merge now, cheers all for feedback ;) |
great, merged. |
Hm, on data() returning void* by default, this causes a slight annoyance - one almost always uses the return type with strides, adding in byte offsets. As such, it is best if it is unsigned char /uint8_t *. Otherwise you will commonly write lines like Most image processing toolkits use uint8 for this reason, it makes it much more convenient to compute the byte level offsets. If this belongs in an issue, so be it but the conversation was here already for the return type. |
For most image file formats, pixels intensities are represented with uint8, so that return type is a reasonable default (but this is not the case here). I'd much rather have |
@nevion Just a few points to add to @wjakob's reply:
|
I updated and wrapped in a code block (formatting was screwed as you a saw it) , so please reevaluate the original and updated comment. @aldanor @wjakob no - this is just for byte offset computations simplicity - which is almost always what you're doing with .data on a matrix. When you want to jump to an item where you make the "right" cast, you first need to go to that address in memory - using the information in strides while multiplying factors for row, column, depth etc, which are in bytes. This is why you can't have T early in data, even for typed arrays, because those strides are allowed to not be evenly divisible by sizeof(T) or packed - there may be padding etc, or some type of view. In my own work I've often offered a dataT to return typed data (T *) to deal with the frequent case of a packed array in the template specialization (array_t), but this is an exception, not the rule. for typed arrays/array_t's, we can use As for images, I did not mean to propose assuming the typical uint8_t image type, I mean all image types, despite the pixel format/type, even when typed (like array_t), still use uint8_t (or equivalent sized options) as the pointer's base type returned by data methods (or sometimes fields). The user still has to make a decision on what is right but now we've optimized the extremely common cases of using the data method. Image libraries are good teachers because they all deal with non-packed arrays. Alternatively, support the weak typed-data() having concept (that is, containers expose typed pointers, like stl vector), at least add a |
@nevion Regarding your updated example, this is currently doable without any reinterpret casts either: (const T*) arr.data(row, col) Re: packed data, views and paddings: if an array object is C/F-contiguous (which it's not by default now but it will be, aligned + writeable + c_contiguous as default flag setting), strides are always evenly divisible by itemsize. Moreover, for So I assume the only edge case is presenting a misleading data_at(row, col)
*data(row, col) If you have an |
I kind of don't like that data is combined with data_at in your implementation but maybe it's just to foreign to experience. Definitely don't like the data vs mutable_data variant, but maybe that's the best option... I still figure users will often be using data() + some_offset often enough... maybe bytes is the answer, though I'd still prefer p() shorthand considering how many times I think I'm likely to type it. As much as I like metaprogramming and functional programming, size() implementation would probably be best to forloop out, same with get_byte_offset. Not just for ensuring good speed for such frequently used routines but that code will be explosive/distracting/high penalty in debug mode. How often do people debug indexes? :-) Overall comments - given the nature of writing libraries in C/C++ and calling from python not just for wrapping, but for speed, we need to offer no-runtime check paths and/or disable them when NDEBUG is defined, but there are cases where we want checks on still too. Maybe that comes from operator() vs .at, maybe _nothrow like variants. Recurring theme: preserving ndarray's generality and existing behaviors - but do allow things to go faster. Noting index_at implementation is faulty and assumes packed arrays, should just use dims and itemsize - for loop only again. numpy .astype returns a copy, .view lets you change the dtype without copy and where some checks on itemsize are performed. Follow numpy semantics for principal of least surprise. char as *ptr type for Array-Proxy, not sure if this should be char - AFAICR I don't think C++11 specified char to be 1 byte, but uint8_t is guaranteed to and in C++11. Don't fall into the trap of everywhere handling packed data, the assumption insidiously creeps abound. Maybe you have an optimized specialization again (through flags or type) that has that feature and overloads indexing methods, but in general, something like array_t::in, and ::out should not assume packed. Make those optimizations opt-in. |
@aldanor on mutable_data vs data, how about just data() and assert_writable/ ensure_writable which throws an exception if not. That way it's done once early on and then you don't see mutable vs non-mutable variants in code below. data()/data() const will handle whether or not the type is const. |
Working with NumPy data structures is a very limited experience unless you can access the underlying C objects directly, this PR partially addresses that.
PyArray_Proxy
andPyArrayDescr_Proxy
are headers of the corresponding NumPy C objects. Those haven't changed at least since v1.5 and are unlikely to change in the future (although NumPy devs mentioned that it's possible in release 2.0 -- hence why they changed most macros to functions to avoid potential ABI breakage; if that ever occurs, the only way to handle it for us would be to include numpy headers at compile time).dtype
now use C API --has_fields
,kind
,itemsize
ndim
,shape
,strides
,itemsize
,nbytes
,size
, all matching the existing NumPy C/Python APIdata
/mutable_data
-- direct access without having to request a bufferindex_at
,data_at
) -- forarray
the pointers areuint8_t
(index is counted in bytes), forarray_t
it's the underlying typeT
(index is in elements). This could serve like a good precursor for eventually implementing tensors with fixed number of dimensions (e.g. 1-D, 2-D) -- some checks could be made at compile-time then.Notes / a few open questions:
*_at
andmutable_*
methods), a few methods could be probably axed. For instance,data/data_at
could be merged into a single method, etc.Py_intptr_t
is of same size assize_t
? On all flat address space platforms (= all sane platforms) this would be true, I think.array::data()
returnuint8_t*
orvoid*
? (the former makes more sense to me)check_ndim
, should a check be made only whenNDEBUG
is not defined, like a debug assert, or should it be there always? Maybe it could always be done for general class of arrays, and removed for fixed-ndim arrays if we ever get to implement them.*_at
andat
methods? Maybe!NDEBUG
only. It's not currently checked at all.writeable
flag be only checked when notNDEBUG
or should it be done always? Unlike bounds checking, I'm thinking it should always be done (unless we know for sure an array is writeable because its ctor/caster ensures that, see below).mutable_*
methods exist is that (generally) we only know if the array is writeable or not at runtime, even forarray_t
, so we need to ensure we can write to buffer. It's explicitly stated in NumPy C API guide that this flag must be respected. If we choose to split array types into a hierarchy as I suggested earlier (e.g.array_in
,array_out
) and writeable flag is enforced for "out" arrays, this check can be bypassed for those types -- I'm planning to address this in another PR.