Skip to content

Changes in NumPy buffered iteration #500

Closed
@seberg

Description

@seberg

I changed the NumPy buffering setup a bit to simplify the code and make it faster in general.

This probably has little or no effect on numexpr (the core loop size may be smaller occasionally, but you don't use GROWINNER which is a change einsum noticed: a huge core reduced the summation precision).

I noticed this "fixed-size" optimization that assumes the inner-loop has a fixed size until the end of the iteration:

/*
* First do all the blocks with a compile-time fixed size.
* This makes a big difference (30-50% on some tests).
*/
block_size = *size_ptr;
while (block_size == BLOCK_SIZE1) {
#define REDUCTION_INNER_LOOP
#define BLOCK_SIZE BLOCK_SIZE1
#include "interp_body.cpp"
#undef BLOCK_SIZE
#undef REDUCTION_INNER_LOOP
iternext(iter);
block_size = *size_ptr;
}
/* Then finish off the rest */
if (block_size > 0) do {
#define REDUCTION_INNER_LOOP
#define BLOCK_SIZE block_size
#include "interp_body.cpp"
#undef BLOCK_SIZE
#undef REDUCTION_INNER_LOOP
} while (iternext(iter));

Only for non-contiguous, non-reduction (except reduce to scalar) use-cases this fast path may not be hit anymore, because:

  • NumPy may now shrink the buffersize a bit to align better with the iteration shape.
  • NumPy will more often have intermittently smaller buffers/chunks that are then grown to full size again. (Previously common in reduction operations only)

For contiguous ops without a reduction (or a reduction along all axes). You still always get the requested buffersize (until the end).

So in the end, my hope is that the fast-path still kicks in for the most relevant use-cases. But if you/someone notices a performance regression, I can take a closer look.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions