Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to allocate pages of memory in userland that bypass L2 cache? #463

Closed
juj opened this issue May 1, 2018 · 5 comments
Closed

Comments

@juj
Copy link

juj commented May 1, 2018

I'm attempting to allocate memory in user space that would bypass L2 cache when writing to them, since the DMA peripheral is not aware of the L2 cache, and is coherent with the main CPU only up to the L1 cache level.

Is this kind of feat achievable from user space in some way? (I can appreciate if this is seen hacky, but still, would very much like to pull this off in user space)

I've been trying this out with an example written by Wallacoloo at https://github.com/Wallacoloo/Raspberry-Pi-DMA-Example , more specifically with this code

juj/Raspberry-Pi-DMA-Example@4dd506b

where the plan of action is to first allocate a regular cached page to virtual memory, then open a pagemap device file to figure what the physical address is that this cached page points to, and then mmap an uncached version of the same physical address over to the user space process. Attempting this though, I find that the whole system crashes on this line, printing out

Allocating uncached memory
Created uncached memory view, memsetting it to zero
Sleep before memset done
Memset done

before my Pi 3 Model B crashes altogether. Commenting out the memset avoids the crash, as well as changing this line from

return virtToPhys(virt, pagemapfd) | 0x40000000;

to

return virtToPhys(virt, pagemapfd);

(but which then naturally does not map over to uncached memory)

Any thoughts if there might be a way to achieve this kind of task in user space? I.e. ultimately feeding the DMA controller directly from user space, without having to round trip through a kernel module?

@pelwell
Copy link
Contributor

pelwell commented May 1, 2018

What is the attraction of performing the allocation in userspace? Asking a driver (possibly a dma-buf driver) to call dma_alloc_coherent on your behalf is the obvious way to go about this.

@pelwell
Copy link
Contributor

pelwell commented May 1, 2018

Also, are you sure the DMA controller is coherent with the ARM L1s? It might appear to be since things don't linger in L1 for very long, but I don think that it actually is - VPU L1-coherent perhaps.

@luked99
Copy link
Contributor

luked99 commented May 1, 2018 via email

@6by9
Copy link
Contributor

6by9 commented May 2, 2018

If you're wanting a simple kernel driver, then the Android ION allocator (I'm waiting for @luked99 to shudder!) is merged to mainline and will act as an allocator giving you a dmabuf. You can then mmap the dmabuf into userspace, and use the ioctl(DMA_BUF_IOCTL_SYNC) call to control cache flushing.

IIRC The simplest list of kernel options I used were ANDROID, ION, and ION_CMA_HEAP, which maps any CMA regions as an Android heaps. (You may want to increase the size of the CMA heap using cma=64M in /boot/cmdline.txt). You may choose to try one of the other allocators instead if you have no need for contiguous memory.
Looking at the history I believe it should work. When I was trying it raspberrypi/linux@54c153a#diff-7d2c15faf5a2c23a64c4287413e99ada hadn't been submitted which caused entertaining allocations, or more often failures..

@juj
Copy link
Author

juj commented May 2, 2018

What is the attraction of performing the allocation in userspace?

You could just write yourself a kernel driver to do whatever it is you're trying to do

If you're wanting a simple kernel driver, ...

I have previously implemented a driver that attempts to do this, but my understanding is that it is not the memory allocation bit (be it in kernel or user land) that is the problem, but even if you do allocate memory on the kernel side, and pass the physical/bus address of that memory over to user space, the user space still has to create a view to that physical memory that bypasses the cache. Iiuc whether you got your allocated physical address by /proc/self/pagemap or by kernel space, the path to map that to a virtual address would be the same.

What is the attraction of performing the allocation in userspace?

The reason I want to do this in userspace is to have a fast connection from dispmanx over to the DMA peripheral, as I don't think I'm able to call the dispmanx snapshotting API from a kernel module. I did try to write a kernel driver, but never could get such a user space -> kernel space combination be able to transfer a continuous 65mbps stream from dispmanx -> user space program -> kernel space program -> DMA peripheral. Given that there did exist a repository that mentioned being able to do this kind of uncached allocation in user space, I went to explore that route instead.

Bypassing the whole /proc/self/pagemap approach, I was able to get a mailbox-based method working, and now using the following code

struct GpuMemory
{
  uint32_t allocationHandle;
  void *virtualAddr;
  uintptr_t busAddress;
  uint32_t sizeBytes;
};

// Sends a pointer to the given buffer over to the VideoCore mailbox. See https://github.com/raspberrypi/firmware/wiki/Mailbox-property-interface
void SendMailbox(void *buffer)
{
  int vcio = open("/dev/vcio", 0);
  if (vcio < 0) FATAL_ERROR("Failed to open VideoCore kernel mailbox!");
  int ret = ioctl(vcio, _IOWR(/*MAJOR_NUM=*/100, 0, char *), buffer);
  close(vcio);
  if (ret < 0) FATAL_ERROR("SendMailbox failed in ioctl!");
}

// Defines the structure of a Mailbox message
template<int PayloadSize>
struct MailboxMessage
{
  MailboxMessage(uint32_t messageId):messageSize(sizeof(*this)), requestCode(0), messageId(messageId), messageSizeBytes(sizeof(uint32_t)*PayloadSize), dataSizeBytes(sizeof(uint32_t)*PayloadSize), messageEndSentinel(0) {}
  uint32_t messageSize;
  uint32_t requestCode;
  uint32_t messageId;
  uint32_t messageSizeBytes;
  uint32_t dataSizeBytes;
  union
  {
    uint32_t payload[PayloadSize];
    uint32_t result;
  };
  uint32_t messageEndSentinel;
};

// Message IDs for different mailbox GPU memory allocation messages
#define MEM_ALLOC_MESSAGE 0x3000c // This message is 3 u32s: numBytes, alignment and flags
#define MEM_FREE_MESSAGE 0x3000f // This message is 1 u32: handle
#define MEM_LOCK_MESSAGE 0x3000d // 1 u32: handle
#define MEM_UNLOCK_MESSAGE 0x3000e // 1 u32: handle

// Memory allocation flags
#define MEM_ALLOC_FLAG_DIRECT (1 << 2) // Allocate uncached memory that bypasses L1 and L2 cache on loads and stores

// Sends a mailbox message with 1xuint32 payload
uint32_t Mailbox(uint32_t messageId, uint32_t payload0)
{
  MailboxMessage<1> msg(messageId);
  msg.payload[0] = payload0;
  SendMailbox(&msg);
  return msg.result;
}

// Sends a mailbox message with 3xuint32 payload
uint32_t Mailbox(uint32_t messageId, uint32_t payload0, uint32_t payload1, uint32_t payload2)
{
  MailboxMessage<3> msg(messageId);
  msg.payload[0] = payload0;
  msg.payload[1] = payload1;
  msg.payload[2] = payload2;
  SendMailbox(&msg);
  return msg.result;
}

#define BUS_TO_PHYS(x) ((x) & ~0xC0000000)

// Allocates the given number of bytes in GPU side memory, and returns the virtual address and physical bus address of the allocated memory block.
// The virtual address holds an uncached view to the allocated memory, so writes and reads to that memory address bypass the L1 and L2 caches. Use
// this kind of memory to pass data blocks over to the DMA controller to process.
GpuMemory AllocateUncachedGpuMemory(uint32_t numBytes)
{
  GpuMemory mem;
  mem.sizeBytes = ALIGN_UP(numBytes, PAGE_SIZE);
  mem.allocationHandle = Mailbox(MEM_ALLOC_MESSAGE, /*size=*/mem.sizeBytes, /*alignment=*/PAGE_SIZE, /*flags=*/MEM_ALLOC_FLAG_DIRECT);
  mem.busAddress = Mailbox(MEM_LOCK_MESSAGE, mem.allocationHandle);
  mem.virtualAddr = mmap(0, mem.sizeBytes, PROT_READ | PROT_WRITE, MAP_SHARED, mem_fd, BUS_TO_PHYS(mem.busAddress));
  if (mem.virtualAddr == MAP_FAILED) FATAL_ERROR("Failed to mmap GPU memory!");
  return mem;
}

void FreeUncachedGpuMemory(GpuMemory mem)
{
  munmap(mem.virtualAddr, mem.sizeBytes);
  Mailbox(MEM_UNLOCK_MESSAGE, mem.allocationHandle);
  Mailbox(MEM_FREE_MESSAGE, mem.allocationHandle);
}

With this, I am able to allocate memory to user space and have an uncached view to it. Then feeding the DMA peripheral works out beautifully as well, and I now have a 65mbps display data stream flowing from dispmanx over to a SPI based display via DMA, that is able to saturate the bus (running at 66.67Mhz).

The memory resides in GPU space, but that seems fine, I only need a ring buffer of about 300KB in size.

But there's also a cache flushing system call

Thanks, I was not aware this existed for ARM, thought this was a x86 ISA-like feature. I'm happy with the current method now, being able to skip caching altogether feels preferred here.

Also, are you sure the DMA controller is coherent with the ARM L1s? It might appear to be since things don't linger in L1 for very long, but I don think that it actually is - VPU L1-coherent perhaps.

I think I am mistaken on this aspect, tried to retrace back where I got this impression, but could not find it. I believe you more on this. It looks like MEM_ALLOC_FLAG_DIRECT allocation gives me memory that bypasses both L1 and L2 and works out well here.

I'll close this one as resolved. Thanks for all the help here.

@juj juj closed this as completed May 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants