New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOW-TO] processing high resolution video at high framerate #740
Comments
I found a kind of "workaround" : having a pool of processes copying the mapped memory in parallel make things faster, at the expense of an insane CPU usage. I'm don't have any background in the domain of hardware, but I find this surprising, it makes me think that there must be a better way. My testing code starts from with the same "mmapped" buffer (with only gray information), and then it runs up to 4 parallel tasks, each of which copies a slice of the buffer into a pre-allocated numpy array. The resulting framerate is as follows:
Here is my testing code (ugly, I'm a beginner with Python):
|
Hi, you don't say what kind of Pi you're using. Trying your original script on my Pi 4 I seem to get about 37fps (rather than 31). I can get this up to 56fps by changing The root cause of the difficultly with these buffers is that they're allocated by the V4L2 kernel drivers which makes them "uncached" and therefore slow to use because you have to go all the way out to the external memory for everything. Normally, unless you're very careful, the best way to use them is simply to |
Hi, I don't understand the explanation about why more buffers are needed. I naively thought that, either the reading time of frames is quick enough to follow the framerate and release a buffer before the other one is ready for reading, or it is slower and will never wait when capping I tried to understand this dma-buf thing, but I couldn't find detailed explanations of the cache aspects. Interestingly, several search results about this are about Raspberry pi video buffers copy performance:
One thing that grabbed my attention is the mention of ioctl in the kernel documentation, which doesn't seem present in picamera2:
You say that the numpy copy seems slow: do you imply that numpy could be doing worse than a simple, "optimal" memcpy ? (I didn't try yet) So, until a different dma-buf allocator is implemented, I guess I will just have to lower my resolution*framerate ambition, or buy a Pi 4 ;-) |
I don't really understand why larger buffer numbers don't help. It is possible that I'm running an unreleased version of the code which might have some effect. I don't know how numpy copies arrays. So far as I know there's no reason for it to do anything other than the optimal memcpy, but I've never checked. numpy always "seems" slow to me, so perhaps I'm just being suspicious for no good reason. When we get V4L2 to allocate buffers, so far as I know they are required to be uncached, perhaps this makes them "easy to use correctly" (but slow). So we do have it on our list to revisit this in the (hopefully not very distant) future, but I couldn't say when. |
If an unreleased version would cause more buffers to improve things, I'm looking forward to trying that... |
I'm not sure what to suggest. We know these buffers are "slow", though I'm a bit surprised how slow they are. I'd have expected a memcpy in C to do better, though don't have any figures to hand. Can you say anything about what processing you intended to do on them? You have the option of working on the uncopied buffers, though whichever way you go, I can't imagine there's much the CPU can do with a large buffer like this at these framerates. |
I would like to try some motion detection of a very small object (hence the high resolution) with low latency, using OpenCV basic primitives such as absdiff. It will most probably require working on only part of the frames (where the object currently is) to make this feasible, though using 4 cores, a few things might be doable on full-resolution frames. I should probably try to crop the frames as needed directly on the captured array before copy, to only copy what I actually need. Still, it would be best if the copy wasn't as long and cpu-intensive. |
I've tried this on a Pi 4 and it seems to take about 40ms per 18MB (a 12MP YUV420 image), which is slow. The time is the same both using Python, and regular memcpy in C++. So I don't really understand what's happening. |
Hi,
I would like to do some processing of high resolution video at the highest possible framerate, but I don't need color information. The sensor modes say that a 2304x1296 resolution is supported with a maximum framerate of 56.03, but it seems that retrieving the frames from picamera2's buffers takes too much time for that, and I don't understand why it is. According to Linux perf profiling, most of the time is spent in __memcpy_generic (one cpu core is always 100% used) and the amount of data to copy does not match the processor's memory capabilities if I understand correctly.
If I naively get RGB frames from the "main" buffer with capture_array, I have 10 FPS. Such frames should be roughly 12MB large with a 4byte pixel size. That's a transfer speed of 120MB per second.
Then I tried to use the YUV format instead, which is more compact and well suited for black and white extraction. Here is a small test:
This yields a 22FPS speed, which is consistent with the data size which is half the rgb size if I'm correct.
Finally, I managed to improve speed a little further by modifying request.make_array to only copy the luminance part of YUV frames:
This yields roughtly 31FPS, which is better, but not twice faster, I have no idea why.
But according to "tinymembench", my Raspberry Pi 3B can move 1000MB per second in standard memcpy, which is an order of magnitude higher than what I see. And if I understand how frames handling works, the frames are supposed to be placed in memory by the libcamera framework, and request.make_buffer only copies them to free the buffers...
I'm sure that it's the call to make_buffer which is the bottleneck, because if I only do "capture_request" and "release" without doing anything, I get the expected 56FPS framerate.
So what am I missing ? Is there a fundamental reason why I cannot go faster, or could it be a performance issue in the implementation ?
Thanks !
The text was updated successfully, but these errors were encountered: