[FEATURE REQUEST] add support to the "out=" argument #592

zachmoshe · 2023-04-10T06:54:20Z

Many numpy functions support the "out=" argument, allowing you to specify an existing array that results will be written into. This saves time and memory allocations especially if the calculation is used repeatedly.

I couldn't find support for this in ulab.numpy, any chance to add this in a future release? I believe it could be extremely useful with microcontrollers.

The text was updated successfully, but these errors were encountered:

v923z · 2023-04-10T07:00:11Z

I see your point, and this could definitely be added. Since it's going to eat into flash space, I would probably implement this as an option.

One issue I would like you to clarify is, how alignment of dtypes is to be handled. E.g., what should happen, if you want to calculate the std of an int8 array, and you specify int8 as the output?

Also, I would like to ask you to specify a handful of methods where you want to see it first. Implementing this for everything is a relatively big undertaking, and I would like to have just a couple methods first, where we can sort out all issues.

zachmoshe · 2023-04-10T07:44:18Z

A good point.. I think the best is to stick to numpy's behavior.
Here for example, it's documented as:

out: ndarray, optional
Alternative output array in which to place the result. It must have the same shape as the expected output, but the type of the output values will be cast if necessary.

Looks like they try casting to the actual output type. I guess if that complicates the implementation, it's also ok to raise an exception if they don't match.

As for methods - my current use case is mixing several audio channels (3 channels are read from files into a 3x4096 array and then mixed using np.mean() before sent to I2S), but if I try to think more globally then this is probably my prioritization:
top tier: min, max, sum, mean, std, dot
second tier: arg{max,min,sort}, median, minimum, maximum, sort (looks like sort creates an internal copy, so not sure if it's that helpful here)

dpgeorge · 2023-04-11T01:41:32Z

I would second this as a good feature to have for embedded use. Being able to perform operations in-place would be a big bonus (especially if it follows how numpy does it).

jimmo · 2023-04-11T02:24:35Z

On the flash-space-cost point, a configuration option that makes this the only way to do it would be a great potential tradeoff -- i.e. it would be quite useful to have a "no allocations inside ulab" mode where I always have to use out=. (There are some cases where you probably need to allocate temporary arrays for calculation, but in terms of GC and fragmentation pressure, if those temporary allocs are immediately paired in the correct order with a gc_free in the same function then the heap will be in the same state as you started. (gc_free puts the search point for the next alloc back to the free'd address).

v923z · 2023-04-11T15:16:10Z

out: ndarray, optional
Alternative output array in which to place the result. It must have the same shape as the expected output, but the type of the output values will be cast if necessary.
Looks like they try casting to the actual output type. I guess if that complicates the implementation, it's also ok to raise an exception if they don't match.

Casting is expensive, because either you unroll all possible combinations in-place for each method (requires more flash), or you implement that as a function, which then has the function call overhead, and is significantly slower.

v923z · 2023-04-11T15:26:51Z

On the flash-space-cost point, a configuration option that makes this the only way to do it would be a great potential tradeoff -- i.e. it would be quite useful to have a "no allocations inside ulab" mode where I always have to use out=.

Wouldn't this trip pretty much everyone?

I see two options here:

Either there is allocation, but it's internal, so RAM would be free'd, when the function returns, as you explained above. Then you could use that space as a scratch pad, and do the casting only at the very end, so the cast_the_output_dtype function would be called only once, hence the expenses could be kept low.

Or it's done as in the case of the binary operators, i.e., all possibly input and output dtype combinations are in a macro, like here

micropython-ulab/code/ndarray_operators.c

Lines 56 to 62 in 8585407

    
           #if NDARRAY_HAS_BINARY_OP_EQUAL 
        
           if(op == MP_BINARY_OP_EQUAL) { 
        
               if(lhs->dtype == NDARRAY_UINT8) { 
        
                   if(rhs->dtype == NDARRAY_UINT8) { 
        
                       EQUALITY_LOOP(results, array, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, ==); 
        
                   } else if(rhs->dtype == NDARRAY_INT8) { 
        
                       EQUALITY_LOOP(results, array, uint8_t, int8_t, larray, lstrides, rarray, rstrides, ==);

But I think this would be a great addition. I'm not dragging my feet, I'm just trying to explore the possibilities here.

dpgeorge · 2023-04-12T02:54:47Z

I think to start with it would be OK to not support casting and require that the input array (passed to out=) be of the correct type.

v923z · 2023-04-12T19:53:21Z

This would be relatively easy to implement, so if Damien's proposal is acceptable to everyone, I would just go ahead with that.

zachmoshe · 2023-04-23T20:23:57Z

Just wanted to add np.roll as another candidate for out=

mcskatkat · 2023-07-20T13:47:24Z

On the flash-space-cost point, a configuration option that makes this the only way to do it would be a great potential tradeoff -- i.e. it would be quite useful to have a "no allocations inside ulab" mode where I always have to use out=.

I second @jimmo 's suggestion of having a configuration option for "no !implicit! allocations inside ulab".

This is very useful for real-time system programming. Otherwise, we have to tread carefully to avoid unintentional allocation that may trigger the garbage collector, which could be disastrous when there are short deadlines to meet.

However, I would prefer this to be a runtime module attribute rather than a compile time configuration.

To implement this without code size penalties, I suggest something of this sort:

only one piece of code in ulab allocates a new ndarray (perhaps this is already the case)
this allocation point is reachable via two C entrypoints, explicit_alloc and implicit_alloc.
explicit_alloc will only be called from Python entrypoints dedicated to allocating new arrays, such as empty, zeros, ones, full etc.
All other code that makes allocations (e.g. binary addition) calls implicit_alloc unless, of course, out= is provided
implicit_alloc raises MemoryError if called when the "no implicit allocation" module attribute is set

Optionally there could be a second, separate attribute that also prohibits explicit_alloc, for cases when people want to protect against programmer errors or "untrusted" code.

jornamon · 2024-03-05T17:57:20Z

I leave the implementation details to the experts, but I'd like to reignite the discussion on introducing an out argument for operations in ulab, which I believe is crucial for embedded systems. Even the simplest implementation, under strict usage conditions (correct shape/type, no casting, restricted operations), could hugely enhance ulab's utility, especially in DSP applications and tight loops with sampled data.

We seek performance when using ulab, and indeed, ulab's performance is impressive. However, when running in tight loops with sampled data, the continuous memory allocation leads to eventual garbage collections. This is problematic in itself, but it becomes a critical issue in real-time applications, particularly on platforms with significant amounts of PSRAM, where garbage collection can take a considerable amount of time.

A real-world example: an ESP32-S3 with 8MB of PSRAM, which is not an unusual platform but rather an increasingly popular configuration, can take more than 400ms to perform a garbage collection in a worst-case scenario. Other platforms might experience a smaller penalty, but at the cost of more frequent garbage collections. Under these conditions, you can negate all performance gains of ulab, rendering it almost unusable.

I'm implementing a Kalman filter for a project and planned to use ulab for the matrix operations (already utilizing numpy in a CPython implementation). But, when porting it to Micropython/ulab, I realized I can't use it due to unpredictable memory allocation. I can't afford to have a garbage collection in the middle of the filter loop. This is just my personal experience, but I'm sure many others face similar challenges.

By the way, before finding this thread, while considering writing a small native C module to implement the basic operations in-place, I stumbled upon your excellent work at https://micropython-usermod.readthedocs.io. So thank you @v923z, because, one way or another, I will be utilizing your amazing contributions.

v923z · 2024-03-05T18:09:04Z

There is no question about the benefit of adding the out keyword arguments, the implementation is only a question of time.

The fact of the matter is, people request all kinds of features (some of them quite exotic, and sometimes I wonder, whether the request was genuine), but when I ask for at least a reasonable test script to go with the implementation, then they tend to disappear.

Having said this, out is really on the top of my list, and as soon as I finish the PID implementation, I'll see to it. Universal functions already support it, as does the spectrogram.

Here is a list from one of the comments above

Beyond these, what would you need for your filters?

v923z · 2024-03-05T18:12:51Z

By the way, before finding this thread, while considering writing a small native C module to implement the basic operations in-place,

The easiest solution in this case is to add the function in the utils module of ulab. Then it'll be automatically compiled, and you'll have access to all internal functions. If you can make the filter somewhat generic, and if you think that it could be useful for others, we could add that to ulab here.

jornamon · 2024-03-05T19:50:25Z

I have also noticed that there are some random requests here and there on this repo, so I'm glad to hear that this one is already in your plans!!

To implment a standard Kalman filter you use matrix addition, subtraction, multiplication, transposition and inversion. There are several variants that may requiere more operations, but I don't remember from the top of my head.

The easiest solution in this case is to add the function in the utils module of ulab. Then it'll be automatically compiled, and you'll have access to all internal functions. If you can make the filter somewhat generic, and if you think that it could be useful for others, we could add that to ulab here.

Oh! I read about this when I first found ulab, but I had forgotten about it. Thanks for the reminder! A Kalman filter is generic and valid for a lot of use cases, so I will try to implement it this way so you can check it out and see if it's worth adding it to ulab.

v923z · 2024-03-05T19:54:28Z

To implment a standard Kalman filter you use matrix addition, subtraction, multiplication, transposition and inversion.

What would take the out keyword argument in this case? If you use the binary operators, they'll return a new object, so it seems to me that you might want to add functions that do matrix addition etc., and accept the out keyword argument.

jornamon · 2024-03-06T04:52:51Z

Yes, something similar to numpy.add. I think there's no other way to do it. I'm not sure if, given that the operation itself is already defined in the binary operator, the code might be reused with just small adaptations.

v923z · 2024-03-06T05:54:53Z

In principle, it's doable: in

micropython-ulab/code/ndarray_operators.c

Lines 182 to 186 in 63dfbd1

    
           if(lhs->dtype == NDARRAY_UINT8) { 
        
               if(rhs->dtype == NDARRAY_UINT8) { 
        
                   results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT8); 
        
                   BINARY_LOOP(results, uint8_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, +); 
        
               } else if(rhs->dtype == NDARRAY_INT8) {

we have to assign results to the matrix that out points to.

However, you mentioned matrix inversion, which will produce a new matrix, no matter what. If you want to save on that, then we would need to add https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.inv.html, which allows you to overwrite the data.

jornamon · 2024-03-06T20:11:14Z

I didn't notice the different approach in scipy.linalg.inv, I wonder why they chose to allow overwriting the input but not writing to an existing matrix. Maybe it's not usual for algorithms to keep the original matrix after it's inverted, who knows. Using overwrite_ a would maintain compatibility with scipy, but implementing it with an out argument would be more consistent with other functions.

zachmoshe added the enhancement New feature or request label Apr 10, 2023

v923z self-assigned this Apr 12, 2023

v923z mentioned this issue Jul 20, 2023

[FEATURE REQUEST] Add np.random Module #569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] add support to the "out=" argument #592

[FEATURE REQUEST] add support to the "out=" argument #592

zachmoshe commented Apr 10, 2023

v923z commented Apr 10, 2023

zachmoshe commented Apr 10, 2023 •

edited

Loading

dpgeorge commented Apr 11, 2023

jimmo commented Apr 11, 2023

v923z commented Apr 11, 2023

v923z commented Apr 11, 2023 •

edited

Loading

dpgeorge commented Apr 12, 2023

v923z commented Apr 12, 2023

zachmoshe commented Apr 23, 2023

mcskatkat commented Jul 20, 2023 •

edited

Loading

jornamon commented Mar 5, 2024

v923z commented Mar 5, 2024

v923z commented Mar 5, 2024 •

edited

Loading

jornamon commented Mar 5, 2024

v923z commented Mar 5, 2024

jornamon commented Mar 6, 2024

v923z commented Mar 6, 2024 •

edited

Loading

jornamon commented Mar 6, 2024

[FEATURE REQUEST] add support to the "out=" argument #592

[FEATURE REQUEST] add support to the "out=" argument #592

Comments

zachmoshe commented Apr 10, 2023

v923z commented Apr 10, 2023

zachmoshe commented Apr 10, 2023 • edited Loading

dpgeorge commented Apr 11, 2023

jimmo commented Apr 11, 2023

v923z commented Apr 11, 2023

v923z commented Apr 11, 2023 • edited Loading

dpgeorge commented Apr 12, 2023

v923z commented Apr 12, 2023

zachmoshe commented Apr 23, 2023

mcskatkat commented Jul 20, 2023 • edited Loading

jornamon commented Mar 5, 2024

v923z commented Mar 5, 2024

v923z commented Mar 5, 2024 • edited Loading

jornamon commented Mar 5, 2024

v923z commented Mar 5, 2024

jornamon commented Mar 6, 2024

v923z commented Mar 6, 2024 • edited Loading

jornamon commented Mar 6, 2024

zachmoshe commented Apr 10, 2023 •

edited

Loading

v923z commented Apr 11, 2023 •

edited

Loading

mcskatkat commented Jul 20, 2023 •

edited

Loading

v923z commented Mar 5, 2024 •

edited

Loading

v923z commented Mar 6, 2024 •

edited

Loading