New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to allocate pages of memory in userland that bypass L2 cache? #463
Comments
What is the attraction of performing the allocation in userspace? Asking a driver (possibly a dma-buf driver) to call dma_alloc_coherent on your behalf is the obvious way to go about this. |
Also, are you sure the DMA controller is coherent with the ARM L1s? It might appear to be since things don't linger in L1 for very long, but I don think that it actually is - VPU L1-coherent perhaps. |
On 1 May 2018 at 15:43, juj ***@***.***> wrote:
I'm attempting to allocate memory in user space that would bypass L2 cache when writing to them, since the DMA peripheral is not aware of the L2 cache, and is coherent with the main CPU only up to the L1 cache level.
Is this kind of feat achievable from user space in some way? (I can appreciate if this is seen hacky, but still, would very much like to pull this off in user space)
I've been trying this out with an example written by Wallacoloo at https://github.com/Wallacoloo/Raspberry-Pi-DMA-Example , more specifically with this code
***@***.***
where the plan of action is to first allocate a regular cached page to virtual memory, then open a pagemap device file to figure what the physical address is that this cached page points to, and then mmap an uncached version of the same physical address over to the user space process. Attempting this though, I find that the whole system crashes on this line, printing out
Allocating uncached memory
Created uncached memory view, memsetting it to zero
Sleep before memset done
Memset done
before my Pi 3 Model B crashes altogether. Commenting out the memset avoids the crash, as well as changing this line from
return virtToPhys(virt, pagemapfd) | 0x40000000;
to
return virtToPhys(virt, pagemapfd);
(but which then naturally does not map over to uncached memory)
Any thoughts if there might be a way to achieve this kind of task in user space? I.e. ultimately feeding the DMA controller directly from user space, without having to round trip through a kernel module?
You could just write yourself a kernel driver to do whatever it is
you're trying to do - it's not that hard. There are some good examples
around to get you started.
Or you could write a kernel driver which allocated the memory uncached
and let you mmap() it into user space. When I've seen this done before
though it usually seems (at least to me) like a bit of a hack.
But there's also a cache flushing system call you might be able to use:
http://man7.org/linux/man-pages/man2/cacheflush.2.html
You could just allocate memory with mmap() and then use that to flush
cached pages to physical memory. Although you might have some
annoyances if you ever have to deal with more than one page at a time.
But I think just writing a kernel module would be easier!
Luke
|
If you're wanting a simple kernel driver, then the Android ION allocator (I'm waiting for @luked99 to shudder!) is merged to mainline and will act as an allocator giving you a dmabuf. You can then mmap the dmabuf into userspace, and use the ioctl(DMA_BUF_IOCTL_SYNC) call to control cache flushing. IIRC The simplest list of kernel options I used were ANDROID, ION, and ION_CMA_HEAP, which maps any CMA regions as an Android heaps. (You may want to increase the size of the CMA heap using |
I have previously implemented a driver that attempts to do this, but my understanding is that it is not the memory allocation bit (be it in kernel or user land) that is the problem, but even if you do allocate memory on the kernel side, and pass the physical/bus address of that memory over to user space, the user space still has to create a view to that physical memory that bypasses the cache. Iiuc whether you got your allocated physical address by
The reason I want to do this in userspace is to have a fast connection from Bypassing the whole struct GpuMemory
{
uint32_t allocationHandle;
void *virtualAddr;
uintptr_t busAddress;
uint32_t sizeBytes;
};
// Sends a pointer to the given buffer over to the VideoCore mailbox. See https://github.com/raspberrypi/firmware/wiki/Mailbox-property-interface
void SendMailbox(void *buffer)
{
int vcio = open("/dev/vcio", 0);
if (vcio < 0) FATAL_ERROR("Failed to open VideoCore kernel mailbox!");
int ret = ioctl(vcio, _IOWR(/*MAJOR_NUM=*/100, 0, char *), buffer);
close(vcio);
if (ret < 0) FATAL_ERROR("SendMailbox failed in ioctl!");
}
// Defines the structure of a Mailbox message
template<int PayloadSize>
struct MailboxMessage
{
MailboxMessage(uint32_t messageId):messageSize(sizeof(*this)), requestCode(0), messageId(messageId), messageSizeBytes(sizeof(uint32_t)*PayloadSize), dataSizeBytes(sizeof(uint32_t)*PayloadSize), messageEndSentinel(0) {}
uint32_t messageSize;
uint32_t requestCode;
uint32_t messageId;
uint32_t messageSizeBytes;
uint32_t dataSizeBytes;
union
{
uint32_t payload[PayloadSize];
uint32_t result;
};
uint32_t messageEndSentinel;
};
// Message IDs for different mailbox GPU memory allocation messages
#define MEM_ALLOC_MESSAGE 0x3000c // This message is 3 u32s: numBytes, alignment and flags
#define MEM_FREE_MESSAGE 0x3000f // This message is 1 u32: handle
#define MEM_LOCK_MESSAGE 0x3000d // 1 u32: handle
#define MEM_UNLOCK_MESSAGE 0x3000e // 1 u32: handle
// Memory allocation flags
#define MEM_ALLOC_FLAG_DIRECT (1 << 2) // Allocate uncached memory that bypasses L1 and L2 cache on loads and stores
// Sends a mailbox message with 1xuint32 payload
uint32_t Mailbox(uint32_t messageId, uint32_t payload0)
{
MailboxMessage<1> msg(messageId);
msg.payload[0] = payload0;
SendMailbox(&msg);
return msg.result;
}
// Sends a mailbox message with 3xuint32 payload
uint32_t Mailbox(uint32_t messageId, uint32_t payload0, uint32_t payload1, uint32_t payload2)
{
MailboxMessage<3> msg(messageId);
msg.payload[0] = payload0;
msg.payload[1] = payload1;
msg.payload[2] = payload2;
SendMailbox(&msg);
return msg.result;
}
#define BUS_TO_PHYS(x) ((x) & ~0xC0000000)
// Allocates the given number of bytes in GPU side memory, and returns the virtual address and physical bus address of the allocated memory block.
// The virtual address holds an uncached view to the allocated memory, so writes and reads to that memory address bypass the L1 and L2 caches. Use
// this kind of memory to pass data blocks over to the DMA controller to process.
GpuMemory AllocateUncachedGpuMemory(uint32_t numBytes)
{
GpuMemory mem;
mem.sizeBytes = ALIGN_UP(numBytes, PAGE_SIZE);
mem.allocationHandle = Mailbox(MEM_ALLOC_MESSAGE, /*size=*/mem.sizeBytes, /*alignment=*/PAGE_SIZE, /*flags=*/MEM_ALLOC_FLAG_DIRECT);
mem.busAddress = Mailbox(MEM_LOCK_MESSAGE, mem.allocationHandle);
mem.virtualAddr = mmap(0, mem.sizeBytes, PROT_READ | PROT_WRITE, MAP_SHARED, mem_fd, BUS_TO_PHYS(mem.busAddress));
if (mem.virtualAddr == MAP_FAILED) FATAL_ERROR("Failed to mmap GPU memory!");
return mem;
}
void FreeUncachedGpuMemory(GpuMemory mem)
{
munmap(mem.virtualAddr, mem.sizeBytes);
Mailbox(MEM_UNLOCK_MESSAGE, mem.allocationHandle);
Mailbox(MEM_FREE_MESSAGE, mem.allocationHandle);
} With this, I am able to allocate memory to user space and have an uncached view to it. Then feeding the DMA peripheral works out beautifully as well, and I now have a 65mbps display data stream flowing from dispmanx over to a SPI based display via DMA, that is able to saturate the bus (running at 66.67Mhz). The memory resides in GPU space, but that seems fine, I only need a ring buffer of about 300KB in size.
Thanks, I was not aware this existed for ARM, thought this was a x86 ISA-like feature. I'm happy with the current method now, being able to skip caching altogether feels preferred here.
I think I am mistaken on this aspect, tried to retrace back where I got this impression, but could not find it. I believe you more on this. It looks like I'll close this one as resolved. Thanks for all the help here. |
I'm attempting to allocate memory in user space that would bypass L2 cache when writing to them, since the DMA peripheral is not aware of the L2 cache, and is coherent with the main CPU only up to the L1 cache level.
Is this kind of feat achievable from user space in some way? (I can appreciate if this is seen hacky, but still, would very much like to pull this off in user space)
I've been trying this out with an example written by Wallacoloo at https://github.com/Wallacoloo/Raspberry-Pi-DMA-Example , more specifically with this code
juj/Raspberry-Pi-DMA-Example@4dd506b
where the plan of action is to first allocate a regular cached page to virtual memory, then open a pagemap device file to figure what the physical address is that this cached page points to, and then mmap an uncached version of the same physical address over to the user space process. Attempting this though, I find that the whole system crashes on this line, printing out
before my Pi 3 Model B crashes altogether. Commenting out the memset avoids the crash, as well as changing this line from
to
return virtToPhys(virt, pagemapfd);
(but which then naturally does not map over to uncached memory)
Any thoughts if there might be a way to achieve this kind of task in user space? I.e. ultimately feeding the DMA controller directly from user space, without having to round trip through a kernel module?
The text was updated successfully, but these errors were encountered: