Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOW-TO] processing high resolution video at high framerate #740

Open
tturpin opened this issue Jul 7, 2023 · 8 comments
Open

[HOW-TO] processing high resolution video at high framerate #740

tturpin opened this issue Jul 7, 2023 · 8 comments

Comments

@tturpin
Copy link

tturpin commented Jul 7, 2023

Hi,

I would like to do some processing of high resolution video at the highest possible framerate, but I don't need color information. The sensor modes say that a 2304x1296 resolution is supported with a maximum framerate of 56.03, but it seems that retrieving the frames from picamera2's buffers takes too much time for that, and I don't understand why it is. According to Linux perf profiling, most of the time is spent in __memcpy_generic (one cpu core is always 100% used) and the amount of data to copy does not match the processor's memory capabilities if I understand correctly.

If I naively get RGB frames from the "main" buffer with capture_array, I have 10 FPS. Such frames should be roughly 12MB large with a 4byte pixel size. That's a transfer speed of 120MB per second.

Then I tried to use the YUV format instead, which is more compact and well suited for black and white extraction. Here is a small test:

from picamera2 import Picamera2
import time

picam2 = Picamera2()
config = picam2.create_video_configuration({"size": (2304, 1296)},lores={"size": (2304, 1296)},controls={"FrameRate": 56.03},buffer_count=2)
picam2.align_configuration(config)
picam2.configure(config)
picam2.start()

last_time=time.time()
last_count=0
count=0
while True:
    t0=time.time()
    count=count+1
    if t0 >= last_time + 1 :
        print(count-last_count)
        last_count=count
        last_time=t0
    frame = picam2.capture_array("lores")

This yields a 22FPS speed, which is consistent with the data size which is half the rgb size if I'm correct.
Finally, I managed to improve speed a little further by modifying request.make_array to only copy the luminance part of YUV frames:

from picamera2 import Picamera2
from picamera2.request import _MappedBuffer
import time
import numpy as np

picam2 = Picamera2()
config = picam2.create_video_configuration({"size": (2304, 1296)},lores={"size": (2304, 1296)},controls={"FrameRate": 56.03},buffer_count=2)
picam2.align_configuration(config)
picam2.configure(config)
picam2.start()

s=picam2.stream_configuration("lores")["stride"]
(w,h)=picam2.stream_configuration("lores")["size"]

class _GrayBuffer(_MappedBuffer):
    def __enter__(self):
        import mmap
        fd = self._MappedBuffer__fb.planes[0].fd
        planes_metadata = self._MappedBuffer__fb.metadata.planes
        buflen = planes_metadata[0].bytes_used
        self._MappedBuffer__mm = mmap.mmap(fd,buflen,mmap.MAP_SHARED,mmap.PROT_READ | mmap.PROT_WRITE)
        return self._MappedBuffer__mm

def make_gray_buffer(request):
    with _GrayBuffer(request,"lores") as b:
        return np.array(b,dtype=np.uint8)

last_time=time.time()
last_count=0
count=0
while True:
    t0=time.time()
    count=count+1
    if t0 >= last_time + 1 :
        print(count-last_count)
        last_count=count
        last_time=t0
    request = picam2.capture_request()
    frame=make_gray_buffer(request)
    request.release()
    gray=frame.reshape((h,s))

This yields roughtly 31FPS, which is better, but not twice faster, I have no idea why.

But according to "tinymembench", my Raspberry Pi 3B can move 1000MB per second in standard memcpy, which is an order of magnitude higher than what I see. And if I understand how frames handling works, the frames are supposed to be placed in memory by the libcamera framework, and request.make_buffer only copies them to free the buffers...

I'm sure that it's the call to make_buffer which is the bottleneck, because if I only do "capture_request" and "release" without doing anything, I get the expected 56FPS framerate.

So what am I missing ? Is there a fundamental reason why I cannot go faster, or could it be a performance issue in the implementation ?

Thanks !

@tturpin
Copy link
Author

tturpin commented Jul 8, 2023

I found a kind of "workaround" : having a pool of processes copying the mapped memory in parallel make things faster, at the expense of an insane CPU usage. I'm don't have any background in the domain of hardware, but I find this surprising, it makes me think that there must be a better way.

My testing code starts from with the same "mmapped" buffer (with only gray information), and then it runs up to 4 parallel tasks, each of which copies a slice of the buffer into a pre-allocated numpy array. The resulting framerate is as follows:

  • 30FPS with 1 thread (so, there is some slight overhead w.r.t the version without threads)
  • 43FPS with 2 threads
  • 45FPS with 3 threads
  • 46FPS with 4 threads
    It doesn't run 4x faster unfortunately, and processing the extracted frames will add to the load, but it definitely has some effect and using 2 threads may be worth it. Reducing the resolution to 1920x1080, the 56FPS sensor limit can be reached.

Here is my testing code (ugly, I'm a beginner with Python):

from picamera2 import Picamera2
from picamera2.request import _MappedBuffer
import time
import numpy as np
import concurrent.futures

res=(2304, 1296)
framerate=56.03

picam2 = Picamera2()
config = picam2.create_video_configuration({"size": res},lores={"size": res},controls={"FrameRate": framerate},buffer_count=2)
picam2.align_configuration(config)
picam2.configure(config)
picam2.start()

s=picam2.stream_configuration("lores")["stride"]
(w,h)=picam2.stream_configuration("lores")["size"]

slices = 3
slice_len=int(w*h/slices)

def copy_slice(result, m,i):
    start=i*slice_len
    stop=(i+1)*slice_len
    np.copyto(result[start:stop], np.array(m, copy=False,dtype=np.uint8)[start:stop])

executor = concurrent.futures.ThreadPoolExecutor(max_workers=slices)

class _GrayBuffer(_MappedBuffer):
    def __enter__(self):
        import mmap
        fd = self._MappedBuffer__fb.planes[0].fd
        planes_metadata = self._MappedBuffer__fb.metadata.planes
        buflen = planes_metadata[0].bytes_used
        self._MappedBuffer__mm = mmap.mmap(fd,buflen,mmap.MAP_SHARED,mmap.PROT_READ | mmap.PROT_WRITE)
        return self._MappedBuffer__mm

    def make_gray(self):
        import mmap
        fd = self._MappedBuffer__fb.planes[0].fd
        planes_metadata = self._MappedBuffer__fb.metadata.planes
        buflen = planes_metadata[0].bytes_used
        m = mmap.mmap(fd,buflen,mmap.MAP_SHARED,mmap.PROT_READ | mmap.PROT_WRITE)
        result = np.empty(buflen,dtype=np.uint8)
        future = [executor.submit(copy_slice, result, m, i) for i in range(slices)]
        for f in future:
            f.result()
        return result

def make_gray_buffer(request):
    return _GrayBuffer(request,"lores").make_gray()

last_time=time.time()
last_count=0
count=0
while True:
    t0=time.time()
    count=count+1
    if t0 >= last_time + 1 :
        print(count-last_count)
        last_count=count
        last_time=t0
    request = picam2.capture_request()
    frame=make_gray_buffer(request)
    request.release()
    gray=frame.reshape((h,s))

@davidplowman
Copy link
Collaborator

Hi, you don't say what kind of Pi you're using. Trying your original script on my Pi 4 I seem to get about 37fps (rather than 31). I can get this up to 56fps by changing buffer_count=2 to buffer_count=6 (or indeed just deleting it entirely, as it defaults to 6, I believe). The reason this helps is because with 2 buffers, you're starving the camera pipeline by not getting buffers back to it fast enough, so it ends up dropping frames as it has nowhere to put them. With higher framerates, it's important to have plenty of buffers.

The root cause of the difficultly with these buffers is that they're allocated by the V4L2 kernel drivers which makes them "uncached" and therefore slow to use because you have to go all the way out to the external memory for everything. Normally, unless you're very careful, the best way to use them is simply to memcpy everything into a regular cached buffer. But even this seems to be relatively slow in numpy. Using a different allocator is on our list of things to look at, though it's non-trivial because we have to start looking at where the caches do and don't need to be flushed/invalidated explicitly, to say nothing of the effect using cached image buffers would have on other code.

@tturpin
Copy link
Author

tturpin commented Jul 10, 2023

Hi,
Thanks for your help. Sorry I didn't mention my configuration. It's a Raspberry Pi 3B, and I'm running the 64-bit Raspberry pi OS as I found that it improved things.
I just tried increasing the number of buffers and in my case, it make things worse, whether I set it to 3,4,5, or 6.

I don't understand the explanation about why more buffers are needed. I naively thought that, either the reading time of frames is quick enough to follow the framerate and release a buffer before the other one is ready for reading, or it is slower and will never wait when capping capture. I also tried queue=False which also decreased the framerate. I will need to re-read the doc again.

I tried to understand this dma-buf thing, but I couldn't find detailed explanations of the cache aspects. Interestingly, several search results about this are about Raspberry pi video buffers copy performance:

One thing that grabbed my attention is the mention of ioctl in the kernel documentation, which doesn't seem present in picamera2:

For correctness and optimal performance, it is always required to use SYNC_START and SYNC_END before and after, respectively, when accessing the mapped address.
Could it be missing, and a possible way of improvement ?

You say that the numpy copy seems slow: do you imply that numpy could be doing worse than a simple, "optimal" memcpy ? (I didn't try yet)

So, until a different dma-buf allocator is implemented, I guess I will just have to lower my resolution*framerate ambition, or buy a Pi 4 ;-)

@davidplowman
Copy link
Collaborator

I don't really understand why larger buffer numbers don't help. It is possible that I'm running an unreleased version of the code which might have some effect.

I don't know how numpy copies arrays. So far as I know there's no reason for it to do anything other than the optimal memcpy, but I've never checked. numpy always "seems" slow to me, so perhaps I'm just being suspicious for no good reason.

When we get V4L2 to allocate buffers, so far as I know they are required to be uncached, perhaps this makes them "easy to use correctly" (but slow). So we do have it on our list to revisit this in the (hopefully not very distant) future, but I couldn't say when.

@tturpin
Copy link
Author

tturpin commented Jul 10, 2023

If an unreleased version would cause more buffers to improve things, I'm looking forward to trying that...
I checked that numpy is not worse than doing a simple b.read() on the mmapped buffer. I also tried the ioctl suggested in the kernel but only got an Invalid argument OSError.

@davidplowman
Copy link
Collaborator

I'm not sure what to suggest. We know these buffers are "slow", though I'm a bit surprised how slow they are. I'd have expected a memcpy in C to do better, though don't have any figures to hand. Can you say anything about what processing you intended to do on them? You have the option of working on the uncopied buffers, though whichever way you go, I can't imagine there's much the CPU can do with a large buffer like this at these framerates.

@tturpin
Copy link
Author

tturpin commented Jul 10, 2023

I would like to try some motion detection of a very small object (hence the high resolution) with low latency, using OpenCV basic primitives such as absdiff. It will most probably require working on only part of the frames (where the object currently is) to make this feasible, though using 4 cores, a few things might be doable on full-resolution frames.

I should probably try to crop the frames as needed directly on the captured array before copy, to only copy what I actually need.

Still, it would be best if the copy wasn't as long and cpu-intensive.

@davidplowman
Copy link
Collaborator

I've tried this on a Pi 4 and it seems to take about 40ms per 18MB (a 12MP YUV420 image), which is slow. The time is the same both using Python, and regular memcpy in C++. So I don't really understand what's happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants