Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] add support to the "out=" argument #592

Open
zachmoshe opened this issue Apr 10, 2023 · 18 comments
Open

[FEATURE REQUEST] add support to the "out=" argument #592

zachmoshe opened this issue Apr 10, 2023 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@zachmoshe
Copy link

Many numpy functions support the "out=" argument, allowing you to specify an existing array that results will be written into. This saves time and memory allocations especially if the calculation is used repeatedly.

I couldn't find support for this in ulab.numpy, any chance to add this in a future release? I believe it could be extremely useful with microcontrollers.

@zachmoshe zachmoshe added the enhancement New feature or request label Apr 10, 2023
@v923z
Copy link
Owner

v923z commented Apr 10, 2023

I see your point, and this could definitely be added. Since it's going to eat into flash space, I would probably implement this as an option.

One issue I would like you to clarify is, how alignment of dtypes is to be handled. E.g., what should happen, if you want to calculate the std of an int8 array, and you specify int8 as the output?

Also, I would like to ask you to specify a handful of methods where you want to see it first. Implementing this for everything is a relatively big undertaking, and I would like to have just a couple methods first, where we can sort out all issues.

@zachmoshe
Copy link
Author

zachmoshe commented Apr 10, 2023

A good point.. I think the best is to stick to numpy's behavior.
Here for example, it's documented as:

out: ndarray, optional
Alternative output array in which to place the result. It must have the same shape as the expected output, but the type of the output values will be cast if necessary.

Looks like they try casting to the actual output type. I guess if that complicates the implementation, it's also ok to raise an exception if they don't match.

As for methods - my current use case is mixing several audio channels (3 channels are read from files into a 3x4096 array and then mixed using np.mean() before sent to I2S), but if I try to think more globally then this is probably my prioritization:
top tier: min, max, sum, mean, std, dot
second tier: arg{max,min,sort}, median, minimum, maximum, sort (looks like sort creates an internal copy, so not sure if it's that helpful here)

@dpgeorge
Copy link
Contributor

I would second this as a good feature to have for embedded use. Being able to perform operations in-place would be a big bonus (especially if it follows how numpy does it).

@jimmo
Copy link
Contributor

jimmo commented Apr 11, 2023

On the flash-space-cost point, a configuration option that makes this the only way to do it would be a great potential tradeoff -- i.e. it would be quite useful to have a "no allocations inside ulab" mode where I always have to use out=. (There are some cases where you probably need to allocate temporary arrays for calculation, but in terms of GC and fragmentation pressure, if those temporary allocs are immediately paired in the correct order with a gc_free in the same function then the heap will be in the same state as you started. (gc_free puts the search point for the next alloc back to the free'd address).

@v923z
Copy link
Owner

v923z commented Apr 11, 2023

out: ndarray, optional
Alternative output array in which to place the result. It must have the same shape as the expected output, but the type of the output values will be cast if necessary.

Looks like they try casting to the actual output type. I guess if that complicates the implementation, it's also ok to raise an exception if they don't match.

Casting is expensive, because either you unroll all possible combinations in-place for each method (requires more flash), or you implement that as a function, which then has the function call overhead, and is significantly slower.

@v923z
Copy link
Owner

v923z commented Apr 11, 2023

On the flash-space-cost point, a configuration option that makes this the only way to do it would be a great potential tradeoff -- i.e. it would be quite useful to have a "no allocations inside ulab" mode where I always have to use out=.

Wouldn't this trip pretty much everyone?

I see two options here:

  1. Either there is allocation, but it's internal, so RAM would be free'd, when the function returns, as you explained above. Then you could use that space as a scratch pad, and do the casting only at the very end, so the cast_the_output_dtype function would be called only once, hence the expenses could be kept low.
  2. Or it's done as in the case of the binary operators, i.e., all possibly input and output dtype combinations are in a macro, like here
    #if NDARRAY_HAS_BINARY_OP_EQUAL
    if(op == MP_BINARY_OP_EQUAL) {
    if(lhs->dtype == NDARRAY_UINT8) {
    if(rhs->dtype == NDARRAY_UINT8) {
    EQUALITY_LOOP(results, array, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, ==);
    } else if(rhs->dtype == NDARRAY_INT8) {
    EQUALITY_LOOP(results, array, uint8_t, int8_t, larray, lstrides, rarray, rstrides, ==);

But I think this would be a great addition. I'm not dragging my feet, I'm just trying to explore the possibilities here.

@dpgeorge
Copy link
Contributor

I think to start with it would be OK to not support casting and require that the input array (passed to out=) be of the correct type.

@v923z
Copy link
Owner

v923z commented Apr 12, 2023

This would be relatively easy to implement, so if Damien's proposal is acceptable to everyone, I would just go ahead with that.

@v923z v923z self-assigned this Apr 12, 2023
@zachmoshe
Copy link
Author

Just wanted to add np.roll as another candidate for out=

@mcskatkat
Copy link

mcskatkat commented Jul 20, 2023

On the flash-space-cost point, a configuration option that makes this the only way to do it would be a great potential tradeoff -- i.e. it would be quite useful to have a "no allocations inside ulab" mode where I always have to use out=.

I second @jimmo 's suggestion of having a configuration option for "no !implicit! allocations inside ulab".

This is very useful for real-time system programming. Otherwise, we have to tread carefully to avoid unintentional allocation that may trigger the garbage collector, which could be disastrous when there are short deadlines to meet.

However, I would prefer this to be a runtime module attribute rather than a compile time configuration.

To implement this without code size penalties, I suggest something of this sort:

  • only one piece of code in ulab allocates a new ndarray (perhaps this is already the case)
  • this allocation point is reachable via two C entrypoints, explicit_alloc and implicit_alloc.
  • explicit_alloc will only be called from Python entrypoints dedicated to allocating new arrays, such as empty, zeros, ones, full etc.
  • All other code that makes allocations (e.g. binary addition) calls implicit_alloc unless, of course, out= is provided
  • implicit_alloc raises MemoryError if called when the "no implicit allocation" module attribute is set

Optionally there could be a second, separate attribute that also prohibits explicit_alloc, for cases when people want to protect against programmer errors or "untrusted" code.

@jornamon
Copy link

jornamon commented Mar 5, 2024

I leave the implementation details to the experts, but I'd like to reignite the discussion on introducing an out argument for operations in ulab, which I believe is crucial for embedded systems. Even the simplest implementation, under strict usage conditions (correct shape/type, no casting, restricted operations), could hugely enhance ulab's utility, especially in DSP applications and tight loops with sampled data.

We seek performance when using ulab, and indeed, ulab's performance is impressive. However, when running in tight loops with sampled data, the continuous memory allocation leads to eventual garbage collections. This is problematic in itself, but it becomes a critical issue in real-time applications, particularly on platforms with significant amounts of PSRAM, where garbage collection can take a considerable amount of time.

A real-world example: an ESP32-S3 with 8MB of PSRAM, which is not an unusual platform but rather an increasingly popular configuration, can take more than 400ms to perform a garbage collection in a worst-case scenario. Other platforms might experience a smaller penalty, but at the cost of more frequent garbage collections. Under these conditions, you can negate all performance gains of ulab, rendering it almost unusable.

I'm implementing a Kalman filter for a project and planned to use ulab for the matrix operations (already utilizing numpy in a CPython implementation). But, when porting it to Micropython/ulab, I realized I can't use it due to unpredictable memory allocation. I can't afford to have a garbage collection in the middle of the filter loop. This is just my personal experience, but I'm sure many others face similar challenges.

By the way, before finding this thread, while considering writing a small native C module to implement the basic operations in-place, I stumbled upon your excellent work at https://micropython-usermod.readthedocs.io. So thank you @v923z, because, one way or another, I will be utilizing your amazing contributions.

@v923z
Copy link
Owner

v923z commented Mar 5, 2024

There is no question about the benefit of adding the out keyword arguments, the implementation is only a question of time.

The fact of the matter is, people request all kinds of features (some of them quite exotic, and sometimes I wonder, whether the request was genuine), but when I ask for at least a reasonable test script to go with the implementation, then they tend to disappear.

Having said this, out is really on the top of my list, and as soon as I finish the PID implementation, I'll see to it. Universal functions already support it, as does the spectrogram.

Here is a list from one of the comments above

  • min
  • max
  • sum
  • mean
  • std
  • dot
  • median
  • minimum
  • maximum

Beyond these, what would you need for your filters?

@v923z
Copy link
Owner

v923z commented Mar 5, 2024

By the way, before finding this thread, while considering writing a small native C module to implement the basic operations in-place,

The easiest solution in this case is to add the function in the utils module of ulab. Then it'll be automatically compiled, and you'll have access to all internal functions. If you can make the filter somewhat generic, and if you think that it could be useful for others, we could add that to ulab here.

@jornamon
Copy link

jornamon commented Mar 5, 2024

I have also noticed that there are some random requests here and there on this repo, so I'm glad to hear that this one is already in your plans!!

To implment a standard Kalman filter you use matrix addition, subtraction, multiplication, transposition and inversion. There are several variants that may requiere more operations, but I don't remember from the top of my head.

The easiest solution in this case is to add the function in the utils module of ulab. Then it'll be automatically compiled, and you'll have access to all internal functions. If you can make the filter somewhat generic, and if you think that it could be useful for others, we could add that to ulab here.

Oh! I read about this when I first found ulab, but I had forgotten about it. Thanks for the reminder! A Kalman filter is generic and valid for a lot of use cases, so I will try to implement it this way so you can check it out and see if it's worth adding it to ulab.

@v923z
Copy link
Owner

v923z commented Mar 5, 2024

To implment a standard Kalman filter you use matrix addition, subtraction, multiplication, transposition and inversion.

What would take the out keyword argument in this case? If you use the binary operators, they'll return a new object, so it seems to me that you might want to add functions that do matrix addition etc., and accept the out keyword argument.

@jornamon
Copy link

jornamon commented Mar 6, 2024

Yes, something similar to numpy.add. I think there's no other way to do it. I'm not sure if, given that the operation itself is already defined in the binary operator, the code might be reused with just small adaptations.

@v923z
Copy link
Owner

v923z commented Mar 6, 2024

In principle, it's doable: in

if(lhs->dtype == NDARRAY_UINT8) {
if(rhs->dtype == NDARRAY_UINT8) {
results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT8);
BINARY_LOOP(results, uint8_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, +);
} else if(rhs->dtype == NDARRAY_INT8) {
we have to assign results to the matrix that out points to.

However, you mentioned matrix inversion, which will produce a new matrix, no matter what. If you want to save on that, then we would need to add https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.inv.html, which allows you to overwrite the data.

@jornamon
Copy link

jornamon commented Mar 6, 2024

I didn't notice the different approach in scipy.linalg.inv, I wonder why they chose to allow overwriting the input but not writing to an existing matrix. Maybe it's not usual for algorithms to keep the original matrix after it's inverted, who knows. Using overwrite_ a would maintain compatibility with scipy, but implementing it with an out argument would be more consistent with other functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants