Description
I changed the NumPy buffering setup a bit to simplify the code and make it faster in general.
This probably has little or no effect on numexpr
(the core loop size may be smaller occasionally, but you don't use GROWINNER
which is a change einsum
noticed: a huge core reduced the summation precision).
I noticed this "fixed-size" optimization that assumes the inner-loop has a fixed size until the end of the iteration:
numexpr/numexpr/interpreter.cpp
Lines 638 to 660 in 2378606
Only for non-contiguous, non-reduction (except reduce to scalar) use-cases this fast path may not be hit anymore, because:
- NumPy may now shrink the buffersize a bit to align better with the iteration shape.
- NumPy will more often have intermittently smaller buffers/chunks that are then grown to full size again. (Previously common in reduction operations only)
For contiguous ops without a reduction (or a reduction along all axes). You still always get the requested buffersize (until the end).
So in the end, my hope is that the fast-path still kicks in for the most relevant use-cases. But if you/someone notices a performance regression, I can take a closer look.