Skip to content

Conversation

@nika-nordic
Copy link
Contributor

@nika-nordic nika-nordic commented Apr 9, 2024

(Simplified PR description from original proposal, since the scope is vastly reduced)

PR adds cache helpers specific to nRF devices that streamline the process of allocating DMA buffer in correct memory region and managing the data cache on nRF54H20.

@nika-nordic nika-nordic added the DNM This PR should not be merged (Do Not Merge) label Apr 10, 2024
@nika-nordic nika-nordic marked this pull request as ready for review April 10, 2024 08:51
@zephyrbot zephyrbot added the platform: nRF Nordic nRFx label Apr 10, 2024
@nika-nordic nika-nordic requested a review from kl-cruz April 10, 2024 09:10
Copy link
Member

@hubertmis hubertmis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested clarifications in API description.

But the issue requiring changes is the algorithm description in dmm.c. Current one could fail in some scenarios (race conditions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since in dmm_dma_buffer_in_release you are also providing user_buffer then it seems that user_buffer is redundant here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is that dmm_dma_buffer_in_prepare needs to check whether the provided buffer has correct attributes (location, alignment) before performing dynamic allocation:

  • if user_buffer is in correct location, reachable by DMA of given device, and has correct attributes, then *buffer_in = user_buffer;
  • otherwise *buffer_in = dynamically_allocate(user_length);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this function should not be split into alloc and prepare. Currently this api does not cover case when driver would like to allocated at compile time a buffer (e.g UART for polling out or for RX flushing) and then use this API only for cache management (if we want to use only dmm for cache management). alloc might be common for both directions.

For dynamic buffer it would look like

err = dmm_dma_buffer_alloc(dev, &dma_buffer, length);
if (err == 0) dmm_dma_buffer_out_prepare(dev, dma_buffer, user_buffer, length);

For statically allocated buffers

uint8_t dma_buffer[length] MEM_REGION(...);

// Prepare function just copies and performs cache writeback (if needed)
dmm_dma_buffer_out_prepare(dev, dma_buffer, user_buffer, length2); //length2 <= length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a next step dmm API should be extended with macros for static buffer allocation. These macros can be used by the drivers covering the scenario you described, or by the applications to improve the performance of the drivers.

If the buffers are allocated using the to-be-created dmm macros, they are guaranteed to be bypassed by dmm_dma_buffer_in/out_prepare() without dynamic buffer allocation and copying overhead.

I don't think the driver should have any indication if the buffer in use is to be bypassed or not by dmm. The driver should not track the qualities of the buffers, because it's dmm's responsibility. The driver should blindly call dmm_..._prepare() on any buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this api does not cover case when driver would like to allocated at compile time a buffer (e.g UART for polling out or for RX flushing) and then use this API only for cache management (if we want to use only dmm for cache management).

Actually this API covers that - if the buffer has correct attributes (location and cache alignment) then no dynamic allocation will happen and dmm will simply return what was provided to it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the device that will use DMA (e.g. UARTE) or mempool device that is associated with the DMA device through DT reference?
Imo, it cannot be the peripheral because there is no generic, runtime device API to get mempool reference from the device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I wanted to use reference to device that will use DMA, but if we cannot obtain mempool associated with it in runtime then I guess it must be mempool device provided directly by the context calling dmm (i.e. UARTE driver). It this feasible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The device drivers must be generic: they must work correctly with the compatible peripherals regardless of how they are integrated into the system (which memory is DMAble by them, what are properties of this memory). The integration information should be included in the device tree properties of the compatible node and retrieved by the device driver. The device driver should use only the node it instantiates.

If a device tree node misses a property like mempool, then maybe the driver should not use dmm at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch my previous message. I misunderstood the proposal.

A device driver X uses a reference to a device tree node it instantiates (N), with no changes compared to the current device drivers.
What we would modify is to add a property M to node N being a reference to a mempool. The device driver X would retrieve property M from node N and use const struct dev* (D) represented by M. When X calls dmm functions it would use D as the dev argument.

If I correctly described the proposal, then it's feasible.

Copy link
Contributor Author

@nika-nordic nika-nordic Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see another issue here. In dmm.c _prepare API we need to check whether the provided buffer is already in correct memory region and if not, use mempool device to dynamically allocate new buffer.

Does it mean we actually need to provide two devices to _prepare API:

  • reference to RAM region node, associated with specified peripheral node (to check buffer location),
  • reference to mempool device (to perform dynamic allocation if buffer is in incorrect location) ?

If yes, I am wondering if these two could be somehow linked, so we provide single device instead of two.

Imo, it cannot be the peripheral because there is no generic, runtime device API to get mempool reference from the device.

@nordic-krch would be it feasible to add such runtime device API? I believe it would improve the abstraction and also solve this issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we actually need to provide two devices to _prepare API

Good point. I don't think a device tree node should have an indication of software configuration like if a given memory region contains a pool for dynamically allocated memory. The device tree node should describe the hardware properties like which memory region is DMA-able by this peripheral. Something like:

ram30: memory@0x... {
  properties = NON_CACHEABLE;
};

uart130 {
  compatible = "nordic,...";
  memory = <&ram30 &ram31>;
};

And dmm should figure out internally which allocator to use to correctly allocate a buffer in memory pointed by &uart130.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR - now dmm accepts memory region address pointer as first parameter. Device drivers (and other components) can easily obtain this value by peeking at devicetree property. dmm has internal list of regions and finds the one matching the one provided by the user of the API

@carlescufi carlescufi assigned nordic-krch and unassigned anangl Apr 11, 2024
@nordic-krch
Copy link
Contributor

I've been experimenting with cache, mvdma and access time to RAM (in particular RAM3 used for slow peripherals). I've noticed that access to RAM3 is very slow. On average it takes ~0.37us to transfer a byte from global RAM (cacheable) to RAM3 and ~0.3us to transfer from RAM3 to global RAM. Dependency is linear so memcpy of 100 bytes takes 30-37us (100 cycles per byte, all with DCACHE enabled). It is a lot of time. On the other hand, setting up a MVDMA transfer takes ~4us (it's mainly writing back data and job descriptors from DCACHE). MVDMA transfer takes approximately same time as memcpy. Given that, I think that it might be good to have asynchronous API as well. Something like:

void handler(void *dmm_buffer, size_t length, void *user_data)
{
    struct device *dev = (struct device *)user_data;
    buffer_set(dev, dmm_buffer, length);
    transfer_start(dev);
}

dmm_buffer_prepare(void *buffer, size_t len, dmm_handler_t handler, void *user_data)

Depending on the threshold user handler would be called directly from the context of buffer_prepare call or from MVDMA interrupt handler.

@nika-nordic
Copy link
Contributor Author

@nordic-krch alternatively we could make the _prepare call always blocking and handle MVDMA in dmm context, but I guess the consequence is that _prepare API will no longer be usable in IRQ context and it needs to be (for example for UARTE ENDRX event), correct?

My understanding is that adding handler parameter to both out_prepare and in_prepare will make the code more flexible:

  • in first version of the code, without MVDMA, we can simply call handler at the end of the _prepare function execution
  • once the MVDMA handling is added, handler could be called asynchronously from MVDMA interrupt context
  • if user wants to enforce copying in-place then handler can be NULL
    Let me know if it makes sense (cc: @hubertmis as this is extension of original design )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comment - now I see that this _dma_ part could be removed from all of the APIs, as it feels to be redundant

@hubertmis
Copy link
Member

hubertmis commented Apr 16, 2024

  • cc: @hubertmis as this is extension of original design

I don't know the details of the requirement to manage DMA buffers directly from ISRs. I would expect ISRs to be short and avoid memcpy() or setting DMA transfers in there. Why not simply defer the execution from an ISR context to a thread context (or a workqueue) and call blocking dmm functions from there?

BTW, for MVDMA the buffers must be prepared as well, potentially copied to the correctly aligned and padded cacheable buffers. If we implemented MVDMA usage instead of memcpy for dmm, we would get a recursive execution: UART driver calling dmm calling MVDMA driver calling dmm calling MVDMA driver calling dmm...

@nordic-krch
Copy link
Contributor

We already have performance issues due to peripheral register access and buffer management so imo having an option to use DMA and reduce CPU is a plus even if there will be cases when it is not used. Deferring to thread context takes additional amount of time (posting a semaphore is few us) and we already have asynchronous API in UART where next buffer is requested from the interrupt context and there is rtio allows asynchronous i2c and spi transfers (not yet supported by nordic).

for MVDMA the buffers must be prepared as well, potentially copied to the correctly aligned and padded cacheable buffers.

I think that at some point zephyr will get __cacheline_aligned attribute (like in linux https://elixir.bootlin.com/linux/latest/source/include/linux/cache.h#L53) and buffers used for drivers will be tagged with that (e.g. in sensor drivers) so dma-dmm might be used then. __cacheline_aligned would use dedicated section so it is possible to detect if address is from that section and act accordingly. Something like:

int dmm_prepare(buf, len, handler, user_data)
{
  if (in_cache_section(buf) && len > thr) {
     dma_dmm(buf, len, handler, user_data);
} else {
    dmm_buf = memcpy_dmm(buf, len);
    handler(dmm_buf, len, user_data);
}

we would get a recursive execution

Sorry, I didn't get where would recursive execution would happen.

@nika-nordic
Copy link
Contributor Author

think that at some point zephyr will get __cacheline_aligned attribute (like in linux https://elixir.bootlin.com/linux/latest/source/include/linux/cache.h#L53) and buffers used for drivers will be tagged with that (e.g. in sensor drivers) so dma-dmm might be used then. __cacheline_aligned would use dedicated section so it is possible to detect if address is from that section and act accordingly.

Please take a look at the pseudo-code for _prepare API in dmm.c - when provided buffer is in correct memory region and is correctly aligned then the provided buffer is simply returned to the user, without dynamic allocation. According to original design this logic would be present in dmm.c so the driver does not need to contain any additional logic.

Sorry, I didn't get where would recursive execution would happen.

If MVDMA driver would use dmm as well then the recursive execution could happen. So MVDMA driver in this use-case would need to use buffers allocated at compile-time to avoid dynamic allocation and extra copying.

(...) and we already have asynchronous API in UART where next buffer is requested from the interrupt context and there is rtio allows asynchronous i2c and spi transfers (not yet supported by nordic).

I think simply adding handler as additional parameter to _prepare API would allow us to cover this case in the future, as I described in: #71278 (comment)

@hubertmis
Copy link
Member

hubertmis commented Apr 17, 2024

I think that at some point zephyr will get __cacheline_aligned attribute

I don't think this will happen due to performance reasons.

Let's discuss an example:

We can get a network frame through Bluetooth or Wi-Fi radio. The network frame is stored in network buffers. A part of the network frame structure is payload which we could want to transmit using I2S, UART, or whatever. This payload in network buffers is not aligned to anything DMA-related.

In Linux, a user-space application would take the network frame, parse it, and send it to I2S or UART driver. This would require data to cross kernel/user space two times, copying the buffers, and during each copy, the buffer could get aligned with __cacheline_aligned.

But in Zephyr in MCUs, we prefer a zero-copy model, so if that's possible I2S should DMA data out directly from the network buffers. Only in architectures in which that's not possible, the data should be copied.

I can imagine application code like this:

net_buf_len = recv(socket, net_buffer);
result = parse_net_frame(net_buffer, net_buf_len, &payload, &payload_len); // get pointer to the middle of the network frame
if (result) {
  i2s_send(payload, payload_len);
}
free_net_buffer(net_buffer);

As you can see, nothing in this code gets aligned by the application. The network buffer is aligned to the Bluetooth or Wi-Fi's DMA. The payload is potentially moved and aligned to I2S DMA by the I2S driver. This data movement would happen in nRF54H, but never in nRF52 or nRF54L.

But I agree we can add the handler parameter to potentially implement MVDMA helper in the future. Maybe something like copying the unaligned ends of the buffer with memcpy and the middle part by MVDMA.

@nordic-krch
Copy link
Contributor

I don't think this will happen due to performance reasons.

That's the example where it might e.g. sensor, shell transport having a static buffer for transfering data can be tagged.

@hubertmis
Copy link
Member

I don't think this will happen due to performance reasons.

That's the example where it might e.g. sensor, shell transport having a static buffer for transfering data can be tagged.

Sure, there are examples in which buffers can be tagged by the application. There is an idea to extend the dmm API with such tags. But we need to be prepared for tag-less buffers as well because there are also scenarios in which tags won't be used.

@nika-nordic
Copy link
Contributor Author

@hubertmis @nordic-krch
Can you describe what buffer tagging is for future reference?

@hubertmis
Copy link
Member

hubertmis commented Apr 22, 2024

@hubertmis @nordic-krch Can you describe what buffer tagging is for future reference?

I understood tagging as adding preprocessor macros ensuring DMAbility of the buffers like:

uint8_t untagged_buffer[BUFFER_SIZE];
uint8_t tagged_buffer[BUFFER_SIZE] DMM_MEM_DMABLE_BY(device_tree_node);

The tag would expand to setting a linker section contained in RAM achievable by DMA, and to alignment/padding parameters if caching requires them.

This attribute denotes that DMA operation
can be performed from a given region.

Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>
DMM stands for Device Memory Management and its role is to streamline
the process of allocating DMA buffer in correct memory region
and managing the data cache.

Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>
Added tests verify output and input buffers allocation
using dmm component.

Signed-off-by: Nikodem Kastelik <nikodem.kastelik@nordicsemi.no>
@nika-nordic nika-nordic changed the title soc: nordic: add dmm component soc: nordic: add cache helpers Jun 13, 2024
@nika-nordic nika-nordic requested a review from hubertmis June 13, 2024 11:16
@zephyrbot zephyrbot added the area: SPI SPI bus label Jun 13, 2024
@zephyrbot zephyrbot requested review from tbursztyka and teburd June 13, 2024 11:16
@nika-nordic nika-nordic removed the DNM This PR should not be merged (Do Not Merge) label Jun 13, 2024
@carlescufi carlescufi requested review from kartben and osaether June 13, 2024 11:18
gmarull
gmarull previously approved these changes Jun 13, 2024
hubertmis
hubertmis previously approved these changes Jun 13, 2024
Copy link
Member

@hubertmis hubertmis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted potential memory leaks in the spi driver. Other than that looks good!

@nika-nordic
Copy link
Contributor Author

I have dropped SPI commits from this PR - I will add them in separate PR

Copy link
Contributor

@nordic-krch nordic-krch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments which are not blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.