soc: nordic: add cache helpers #71278

nika-nordic · 2024-04-09T14:16:05Z

(Simplified PR description from original proposal, since the scope is vastly reduced)

PR adds cache helpers specific to nRF devices that streamline the process of allocating DMA buffer in correct memory region and managing the data cache on nRF54H20.

soc/nordic/common/dmm.h

hubertmis

I suggested clarifications in API description.

But the issue requiring changes is the algorithm description in dmm.c. Current one could fail in some scenarios (race conditions).

soc/nordic/common/CMakeLists.txt

soc/nordic/common/dmm.h

soc/nordic/common/dmm.c

soc/nordic/common/dmm.h

nordic-krch · 2024-04-10T10:29:52Z

soc/nordic/common/dmm.h

since in dmm_dma_buffer_in_release you are also providing user_buffer then it seems that user_buffer is redundant here.

the thing is that dmm_dma_buffer_in_prepare needs to check whether the provided buffer has correct attributes (location, alignment) before performing dynamic allocation:

if user_buffer is in correct location, reachable by DMA of given device, and has correct attributes, then *buffer_in = user_buffer;

otherwise *buffer_in = dynamically_allocate(user_length);

nordic-krch · 2024-04-10T10:42:35Z

soc/nordic/common/dmm.h

I wonder if this function should not be split into alloc and prepare. Currently this api does not cover case when driver would like to allocated at compile time a buffer (e.g UART for polling out or for RX flushing) and then use this API only for cache management (if we want to use only dmm for cache management). alloc might be common for both directions.

For dynamic buffer it would look like

err = dmm_dma_buffer_alloc(dev, &dma_buffer, length); if (err == 0) dmm_dma_buffer_out_prepare(dev, dma_buffer, user_buffer, length);

For statically allocated buffers

uint8_t dma_buffer[length] MEM_REGION(...); // Prepare function just copies and performs cache writeback (if needed) dmm_dma_buffer_out_prepare(dev, dma_buffer, user_buffer, length2); //length2 <= length

As a next step dmm API should be extended with macros for static buffer allocation. These macros can be used by the drivers covering the scenario you described, or by the applications to improve the performance of the drivers.

If the buffers are allocated using the to-be-created dmm macros, they are guaranteed to be bypassed by dmm_dma_buffer_in/out_prepare() without dynamic buffer allocation and copying overhead.

I don't think the driver should have any indication if the buffer in use is to be bypassed or not by dmm. The driver should not track the qualities of the buffers, because it's dmm's responsibility. The driver should blindly call dmm_..._prepare() on any buffer.

Currently this api does not cover case when driver would like to allocated at compile time a buffer (e.g UART for polling out or for RX flushing) and then use this API only for cache management (if we want to use only dmm for cache management).

Actually this API covers that - if the buffer has correct attributes (location and cache alignment) then no dynamic allocation will happen and dmm will simply return what was provided to it

nordic-krch · 2024-04-10T10:46:45Z

soc/nordic/common/dmm.h

Is this the device that will use DMA (e.g. UARTE) or mempool device that is associated with the DMA device through DT reference?
Imo, it cannot be the peripheral because there is no generic, runtime device API to get mempool reference from the device.

Originally I wanted to use reference to device that will use DMA, but if we cannot obtain mempool associated with it in runtime then I guess it must be mempool device provided directly by the context calling dmm (i.e. UARTE driver). It this feasible?

No. The device drivers must be generic: they must work correctly with the compatible peripherals regardless of how they are integrated into the system (which memory is DMAble by them, what are properties of this memory). The integration information should be included in the device tree properties of the compatible node and retrieved by the device driver. The device driver should use only the node it instantiates.

If a device tree node misses a property like mempool, then maybe the driver should not use dmm at all.

Scratch my previous message. I misunderstood the proposal.

A device driver X uses a reference to a device tree node it instantiates (N), with no changes compared to the current device drivers.
What we would modify is to add a property M to node N being a reference to a mempool. The device driver X would retrieve property M from node N and use const struct dev* (D) represented by M. When X calls dmm functions it would use D as the dev argument.

If I correctly described the proposal, then it's feasible.

I see another issue here. In dmm.c _prepare API we need to check whether the provided buffer is already in correct memory region and if not, use mempool device to dynamically allocate new buffer.

Does it mean we actually need to provide two devices to _prepare API:

reference to RAM region node, associated with specified peripheral node (to check buffer location),

reference to mempool device (to perform dynamic allocation if buffer is in incorrect location) ?

If yes, I am wondering if these two could be somehow linked, so we provide single device instead of two.

Imo, it cannot be the peripheral because there is no generic, runtime device API to get mempool reference from the device.

@nordic-krch would be it feasible to add such runtime device API? I believe it would improve the abstraction and also solve this issue

Does it mean we actually need to provide two devices to _prepare API

Good point. I don't think a device tree node should have an indication of software configuration like if a given memory region contains a pool for dynamically allocated memory. The device tree node should describe the hardware properties like which memory region is DMA-able by this peripheral. Something like:

ram30: memory@0x... { properties = NON_CACHEABLE; }; uart130 { compatible = "nordic,..."; memory = <&ram30 &ram31>; };

And dmm should figure out internally which allocator to use to correctly allocate a buffer in memory pointed by &uart130.

updated PR - now dmm accepts memory region address pointer as first parameter. Device drivers (and other components) can easily obtain this value by peeking at devicetree property. dmm has internal list of regions and finds the one matching the one provided by the user of the API

nordic-krch · 2024-04-15T13:50:08Z

I've been experimenting with cache, mvdma and access time to RAM (in particular RAM3 used for slow peripherals). I've noticed that access to RAM3 is very slow. On average it takes ~0.37us to transfer a byte from global RAM (cacheable) to RAM3 and ~0.3us to transfer from RAM3 to global RAM. Dependency is linear so memcpy of 100 bytes takes 30-37us (100 cycles per byte, all with DCACHE enabled). It is a lot of time. On the other hand, setting up a MVDMA transfer takes ~4us (it's mainly writing back data and job descriptors from DCACHE). MVDMA transfer takes approximately same time as memcpy. Given that, I think that it might be good to have asynchronous API as well. Something like:

void handler(void *dmm_buffer, size_t length, void *user_data)
{
    struct device *dev = (struct device *)user_data;
    buffer_set(dev, dmm_buffer, length);
    transfer_start(dev);
}

dmm_buffer_prepare(void *buffer, size_t len, dmm_handler_t handler, void *user_data)

Depending on the threshold user handler would be called directly from the context of buffer_prepare call or from MVDMA interrupt handler.

nika-nordic · 2024-04-15T15:19:33Z

@nordic-krch alternatively we could make the _prepare call always blocking and handle MVDMA in dmm context, but I guess the consequence is that _prepare API will no longer be usable in IRQ context and it needs to be (for example for UARTE ENDRX event), correct?

My understanding is that adding handler parameter to both out_prepare and in_prepare will make the code more flexible:

in first version of the code, without MVDMA, we can simply call handler at the end of the _prepare function execution
once the MVDMA handling is added, handler could be called asynchronously from MVDMA interrupt context
if user wants to enforce copying in-place then handler can be NULL
Let me know if it makes sense (cc: @hubertmis as this is extension of original design )

nika-nordic · 2024-04-15T15:24:05Z

soc/nordic/common/dmm.h

general comment - now I see that this _dma_ part could be removed from all of the APIs, as it feels to be redundant

hubertmis · 2024-04-16T04:44:09Z

cc: @hubertmis as this is extension of original design

I don't know the details of the requirement to manage DMA buffers directly from ISRs. I would expect ISRs to be short and avoid memcpy() or setting DMA transfers in there. Why not simply defer the execution from an ISR context to a thread context (or a workqueue) and call blocking dmm functions from there?

BTW, for MVDMA the buffers must be prepared as well, potentially copied to the correctly aligned and padded cacheable buffers. If we implemented MVDMA usage instead of memcpy for dmm, we would get a recursive execution: UART driver calling dmm calling MVDMA driver calling dmm calling MVDMA driver calling dmm...

nordic-krch · 2024-04-17T07:56:17Z

We already have performance issues due to peripheral register access and buffer management so imo having an option to use DMA and reduce CPU is a plus even if there will be cases when it is not used. Deferring to thread context takes additional amount of time (posting a semaphore is few us) and we already have asynchronous API in UART where next buffer is requested from the interrupt context and there is rtio allows asynchronous i2c and spi transfers (not yet supported by nordic).

for MVDMA the buffers must be prepared as well, potentially copied to the correctly aligned and padded cacheable buffers.

I think that at some point zephyr will get __cacheline_aligned attribute (like in linux https://elixir.bootlin.com/linux/latest/source/include/linux/cache.h#L53) and buffers used for drivers will be tagged with that (e.g. in sensor drivers) so dma-dmm might be used then. __cacheline_aligned would use dedicated section so it is possible to detect if address is from that section and act accordingly. Something like:

int dmm_prepare(buf, len, handler, user_data)
{
  if (in_cache_section(buf) && len > thr) {
     dma_dmm(buf, len, handler, user_data);
} else {
    dmm_buf = memcpy_dmm(buf, len);
    handler(dmm_buf, len, user_data);
}

we would get a recursive execution

Sorry, I didn't get where would recursive execution would happen.

nika-nordic · 2024-04-17T10:49:43Z

think that at some point zephyr will get __cacheline_aligned attribute (like in linux https://elixir.bootlin.com/linux/latest/source/include/linux/cache.h#L53) and buffers used for drivers will be tagged with that (e.g. in sensor drivers) so dma-dmm might be used then. __cacheline_aligned would use dedicated section so it is possible to detect if address is from that section and act accordingly.

Please take a look at the pseudo-code for _prepare API in dmm.c - when provided buffer is in correct memory region and is correctly aligned then the provided buffer is simply returned to the user, without dynamic allocation. According to original design this logic would be present in dmm.c so the driver does not need to contain any additional logic.

Sorry, I didn't get where would recursive execution would happen.

If MVDMA driver would use dmm as well then the recursive execution could happen. So MVDMA driver in this use-case would need to use buffers allocated at compile-time to avoid dynamic allocation and extra copying.

(...) and we already have asynchronous API in UART where next buffer is requested from the interrupt context and there is rtio allows asynchronous i2c and spi transfers (not yet supported by nordic).

I think simply adding handler as additional parameter to _prepare API would allow us to cover this case in the future, as I described in: #71278 (comment)

hubertmis · 2024-04-17T14:14:53Z

I think that at some point zephyr will get __cacheline_aligned attribute

I don't think this will happen due to performance reasons.

Let's discuss an example:

We can get a network frame through Bluetooth or Wi-Fi radio. The network frame is stored in network buffers. A part of the network frame structure is payload which we could want to transmit using I2S, UART, or whatever. This payload in network buffers is not aligned to anything DMA-related.

In Linux, a user-space application would take the network frame, parse it, and send it to I2S or UART driver. This would require data to cross kernel/user space two times, copying the buffers, and during each copy, the buffer could get aligned with __cacheline_aligned.

But in Zephyr in MCUs, we prefer a zero-copy model, so if that's possible I2S should DMA data out directly from the network buffers. Only in architectures in which that's not possible, the data should be copied.

I can imagine application code like this:

net_buf_len = recv(socket, net_buffer);
result = parse_net_frame(net_buffer, net_buf_len, &payload, &payload_len); // get pointer to the middle of the network frame
if (result) {
  i2s_send(payload, payload_len);
}
free_net_buffer(net_buffer);

As you can see, nothing in this code gets aligned by the application. The network buffer is aligned to the Bluetooth or Wi-Fi's DMA. The payload is potentially moved and aligned to I2S DMA by the I2S driver. This data movement would happen in nRF54H, but never in nRF52 or nRF54L.

But I agree we can add the handler parameter to potentially implement MVDMA helper in the future. Maybe something like copying the unaligned ends of the buffer with memcpy and the middle part by MVDMA.

nordic-krch · 2024-04-18T05:40:59Z

I don't think this will happen due to performance reasons.

That's the example where it might e.g. sensor, shell transport having a static buffer for transfering data can be tagged.

hubertmis · 2024-04-18T06:31:29Z

I don't think this will happen due to performance reasons.

That's the example where it might e.g. sensor, shell transport having a static buffer for transfering data can be tagged.

Sure, there are examples in which buffers can be tagged by the application. There is an idea to extend the dmm API with such tags. But we need to be prepared for tag-less buffers as well because there are also scenarios in which tags won't be used.

nika-nordic · 2024-04-22T14:23:46Z

@hubertmis @nordic-krch
Can you describe what buffer tagging is for future reference?

hubertmis · 2024-04-22T15:34:22Z

@hubertmis @nordic-krch Can you describe what buffer tagging is for future reference?

I understood tagging as adding preprocessor macros ensuring DMAbility of the buffers like:

uint8_t untagged_buffer[BUFFER_SIZE];
uint8_t tagged_buffer[BUFFER_SIZE] DMM_MEM_DMABLE_BY(device_tree_node);

The tag would expand to setting a linker section contained in RAM achievable by DMA, and to alignment/padding parameters if caching requires them.

This attribute denotes that DMA operation can be performed from a given region. Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>

DMM stands for Device Memory Management and its role is to streamline the process of allocating DMA buffer in correct memory region and managing the data cache. Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>

Added tests verify output and input buffers allocation using dmm component. Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>

hubertmis

Spotted potential memory leaks in the spi driver. Other than that looks good!

drivers/spi/spi_nrfx_spis.c

drivers/spi/spi_nrfx_spim.c

soc/nordic/common/dmm.c

nika-nordic · 2024-06-13T12:20:33Z

I have dropped SPI commits from this PR - I will add them in separate PR

nordic-krch

minor comments which are not blocking.

nika-nordic requested review from anangl, carlescufi, gmarull, hubertmis and nordic-krch April 9, 2024 14:16

nika-nordic commented Apr 9, 2024

View reviewed changes

soc/nordic/common/dmm.h Outdated Show resolved Hide resolved

nika-nordic added the DNM This PR should not be merged (Do Not Merge) label Apr 10, 2024

nika-nordic marked this pull request as ready for review April 10, 2024 08:51

zephyrbot added the platform: nRF Nordic nRFx label Apr 10, 2024

zephyrbot assigned anangl Apr 10, 2024

nika-nordic requested a review from kl-cruz April 10, 2024 09:10

hubertmis requested changes Apr 10, 2024

View reviewed changes

nordic-krch reviewed Apr 10, 2024

View reviewed changes

carlescufi assigned nordic-krch and unassigned anangl Apr 11, 2024

nika-nordic force-pushed the add_dmm_api branch from 5fb3132 to 50c31e3 Compare April 15, 2024 15:20

nika-nordic commented Apr 15, 2024

View reviewed changes

nika-nordic force-pushed the add_dmm_api branch from 50c31e3 to 01aa6bd Compare May 22, 2024 13:40

zephyrbot added the area: Samples Samples label May 22, 2024

zephyrbot requested a review from jaz1-nordic May 22, 2024 13:41

nika-nordic added 3 commits June 13, 2024 13:12

boards: nordic: nrf54h20dk: add DMA attribute to RAM21 & RAM3x

6d85c10

This attribute denotes that DMA operation can be performed from a given region. Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>

soc: nordic: add dmm component

1789b75

DMM stands for Device Memory Management and its role is to streamline the process of allocating DMA buffer in correct memory region and managing the data cache. Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>

tests: boards: nrf: add tests for dmm component

87eaca4

Added tests verify output and input buffers allocation using dmm component. Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>

nika-nordic changed the title ~~soc: nordic: add dmm component~~ soc: nordic: add cache helpers Jun 13, 2024

nika-nordic force-pushed the add_dmm_api branch from 01aa6bd to a4cdaa8 Compare June 13, 2024 11:15

nika-nordic requested a review from hubertmis June 13, 2024 11:16

zephyrbot added the area: SPI SPI bus label Jun 13, 2024

zephyrbot requested review from tbursztyka and teburd June 13, 2024 11:16

nika-nordic removed the DNM This PR should not be merged (Do Not Merge) label Jun 13, 2024

carlescufi requested review from kartben and osaether June 13, 2024 11:18

gmarull previously approved these changes Jun 13, 2024

View reviewed changes

hubertmis previously approved these changes Jun 13, 2024

View reviewed changes

drivers/spi/spi_nrfx_spis.c Outdated Show resolved Hide resolved

drivers/spi/spi_nrfx_spim.c Outdated Show resolved Hide resolved

nordic-krch reviewed Jun 13, 2024

View reviewed changes

soc/nordic/common/dmm.c Outdated Show resolved Hide resolved

soc/nordic/common/dmm.c Outdated Show resolved Hide resolved

nika-nordic dismissed stale reviews from hubertmis and gmarull via 87eaca4 June 13, 2024 12:19

nika-nordic force-pushed the add_dmm_api branch from a4cdaa8 to 87eaca4 Compare June 13, 2024 12:19

carlescufi requested review from gmarull, hubertmis and nordic-krch June 13, 2024 12:29

hubertmis approved these changes Jun 13, 2024

View reviewed changes

nordic-krch approved these changes Jun 13, 2024

View reviewed changes

gmarull approved these changes Jun 13, 2024

View reviewed changes

nashif approved these changes Jun 13, 2024

View reviewed changes

nashif merged commit d67abdd into zephyrproject-rtos:main Jun 13, 2024

This was referenced Jun 14, 2024

drivers: spi: nrf: use dmm #74290

Closed

[nrf fromtree/fromlist] add dmm component nrfconnect/sdk-zephyr#1817

Closed

nordic-piks mentioned this pull request Jun 17, 2024

tests: drivers: uart and spi fails on nrf54h20 - ASSERTION FAIL heap size is too small #74404

Closed

soc: nordic: add cache helpers #71278

soc: nordic: add cache helpers #71278

Uh oh!

Conversation

nika-nordic commented Apr 9, 2024 • edited by carlescufi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hubertmis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nika-nordic Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nordic-krch commented Apr 15, 2024

Uh oh!

nika-nordic commented Apr 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubertmis commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nordic-krch commented Apr 17, 2024

Uh oh!

nika-nordic commented Apr 17, 2024

Uh oh!

hubertmis commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nordic-krch commented Apr 18, 2024

Uh oh!

hubertmis commented Apr 18, 2024

Uh oh!

nika-nordic commented Apr 22, 2024

Uh oh!

hubertmis commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hubertmis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nika-nordic commented Apr 9, 2024 •

edited by carlescufi

Loading

nika-nordic Apr 17, 2024 •

edited

Loading

hubertmis commented Apr 16, 2024 •

edited

Loading

hubertmis commented Apr 17, 2024 •

edited

Loading

hubertmis commented Apr 22, 2024 •

edited

Loading