Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where does the OPAQUE mmal buffer data pointer actually point to? #691

Closed
doctorseus opened this issue Nov 30, 2016 · 21 comments
Closed

Where does the OPAQUE mmal buffer data pointer actually point to? #691

doctorseus opened this issue Nov 30, 2016 · 21 comments

Comments

@doctorseus
Copy link

doctorseus commented Nov 30, 2016

I am trying to obtain the physical memory pointer to the buffers used to store the camera frames. These buffers should be located in the GPU memory space as far as I know. But is there a way to obtain the physical address of these buffers?

Right now the values stored in the data field of the a MMAL_BUFFER_HEADER_T object returned by a (camera) component using MMAL_ENCODING_OPAQUE are looking like as if there is another layer running on the closed source firmware managing these buffers. And everything I can get on the ARM side is a value which can be associated with a certain buffer, but only by the firmware running on the VPU. Is this correct? Is there any information available about this?

Big picture: I want to run a compute kernel on the QPU which uses the raw data of the camera frame. It looks like I would be able to do that as long as I am able to get the physical pointer to the camera frame first. (Like the fastpath (GL_OES_EGL_image_external) for OpenGL but without limiting myself to OpenGL shaders only.)

For example this is such a buffer used later on for the OpenGL fastpath:
https://github.com/raspberrypi/userland/blob/bb15afe33b313fe045d52277a78653d288e04f67/host_applications/linux/apps/raspicam/RaspiTex.c#L447

@doctorseus
Copy link
Author

doctorseus commented Nov 30, 2016

https://www.raspberrypi.org/forums/viewtopic.php?t=53698&p=413535

Seems like my assumption was correct. Anyway, if someone stumples over this and has an idea how to achieve what I am planning to do, please let me know.

@Seneral
Copy link

Seneral commented Mar 7, 2020

@doctorseus Sorry to bother you but have you ever found a solution?
Trying to port my OpenGLES CV code to a QPU assembly program for improved performance and flexibility. This is step is the most unclear so far. Of course I could use the KHR extension first to interpret the buffer but that would be a huge waste. Better not using opaque handles at all and needlessly moving the buffer around. Any help is greatly appreciated

@doctorseus
Copy link
Author

doctorseus commented Mar 7, 2020

@Seneral that was some time ago but yes I actually was able to progress on this with some help of @6by9, a Engineer on the Raspberry Pi team. You can find the thread with relevant details on the physical buffer address here: https://www.raspberrypi.org/forums/viewtopic.php?f=43&t=167652

I have a repo where I documented my progress and I actually managed to run a custom shader in real-time. The repo was private but I set it to public just now in case you can get something useful out of it: https://github.com/doctorseus/rpi-qpucamera-sandbox
Look here: https://github.com/doctorseus/rpi-qpucamera-sandbox/blob/master/qpucamera/qpucamera.c#L105

Edit: All of this is quite complex, I would assume you will need to have some good understanding of all involved topics to be able to follow along.

Please note that writing these shaders by hand is really time-intensive and with the progress of the open source GPU driver, I believe now default on the Rapsberry Pi 4, it should be possible to run any OpenCL code which would make all this much easier. I didn't try it myself but I would guess nowadays it would be better to explore into that direction instead of the approach I took here. Or at least combine their shader compiler and the raw execution path I took. Let me know if you are able to get something going.

@Seneral
Copy link

Seneral commented Mar 7, 2020

Thank you so much, just read through the thread and that was exactly the missing puzzle piece I needed. Seems I'm pretty much in the same boat as you, want the absolute lowest latency with optimized code and don't mind diving into all this. Need it for three separate projects so much to gain for me.
Thanks so much for the code as well, will take a look tomorrow.

Oh in regards to OpenCL; Just like OpenCV I fear abstractions will keep me from actually making use of all the optimizations possible with the QPUs. Just from looking at the videocore IV documentation once there's so much I just can't see a compiler making use of compared to custom assembly code.

@Seneral
Copy link

Seneral commented Mar 8, 2020

Thanks, that code is a goldmine. Pretty much everything I've taken a look at from hello_fft seperated out for custom use in a nice package. Just saved me (and hopefully others) a ton of effort.

It seems your QPU program is categorized as a general purpose program and is iterating over the whole frame, meaning one QPU does the whole frame. From my research I'd have imagined having to set up the V3D pipeline using control lists to do the frame tiling so the scheduler can set the QPUs to work on each tile automatically. I did not fully research this yet though, only read through the VC IV docs once - any reason you did not follow this path or did you simply tried this first and then stopped?

The thing is even then the 720p60 is an impressive result. With a 2-pass shader (tbf doing a lot of fetching in a 5x5 kernel) for blob detection on OpenGLES I only managed 640p45/720p20 so even with this I'm already VERY happy with the results.

@doctorseus
Copy link
Author

doctorseus commented Mar 8, 2020

At the moment I don't even know what each of the shaders was used for in the repo and which one worked (maybe you can get it working), so this is all from memory:
First, for me the QPU was just a vector processor, looking into the documentation it appears to be like this "16-way virtual parallelism (4-way multiplexed)" so when you run one of my shaders it will use not one but all 4 QPUs to process 4x4=16 values at once.
It appears that the next step I wanted to go was to use the TMU. I don't know much about this anymore but I know that the general issue was that the format the image is stored by the VPU (camera) is not in the format the TMU can support. So first I needed (shader) code to convert it. (In the forum this is also a topic, the T-format and the Broadcom guys used the VPU (another SMID processor) to convert it but that is not accessible by us).
I believe this is were I ran out of free-time and stopped working on it.

The rest of the V3D pipeline might not be of use, these are helpful mechanisms for handling/scheduling framebuffers etc, everything you need when you want to implement an OpenGL driver basically.

@Seneral
Copy link

Seneral commented Mar 8, 2020

Ok yeah so after some more research, yeah you're rught the V3D pipeline is not too useful if you don't need geometry processing stuff. It's way easier and faster to do the tiling of the data yourself. I'm going to use direct register access as documented instead of the Mailbox interface for that, gives much more control.
I just want to say that before you've actually only used the Vertex data pipeline to move texture data. The only TMU (for texture fetching) use I've found is in the _tex version, writing it into the vertex pipeline (I assume for debugging).
Optimally instead of the vertex pipeline it would use the TMU for fetching and the TLB (Tile Buffer) for writing. Not exactly sure yet how that works without the full v3d pipeline but am sure it would allow for much more throughput than the vertex pipe.

Also, while you are right that you do 16 values at once, you are actually only using one out of 12 QPUs.

So all in all if I am correct, the fact you still got 720p60 out of it is VERY impressive.

@doctorseus
Copy link
Author

doctorseus commented Mar 8, 2020

Yes right, this is all correct (also re-read some of the documentation just now). And I believe that was also the goal. I also just realized that the changed the GPU implementation with the latest Raspberry Pi which is definitely a pain for all this. But I read articles which mentioned that at least mesa should have got the patches to allow targeting the new VideoCore VI (but still no open documentation).

Well I have no reason to believe it wasn't the case. It is true that I just moved the output buffer further down the line to the h264 encoder for visualization but as far as I can remember there was no backlog so it appeared it was fast enough (even with format conversion).

Edit: Also the parameter "1" in execute_qpu says how many QPUs to use. ;-)

@Seneral
Copy link

Seneral commented Mar 8, 2020

Wait so you send it directly to the h264 encoder? Fastpath to encoder that was impossible on GL? Right makes sense, not a bad idea.
Not sure what format conversion you did - did you gather all YUV components separately in the QPU code and convert to RGB? Can't seem to immediately see that in the QPU code. Plan to work on YUV data directly anyway (Y only to be exact).
Feel much more confident starting to work on this now, seeing all this already built. Will definitely update you when I get it running with some more complex shaders!

@Seneral
Copy link

Seneral commented May 25, 2020

In case you're interested, I've released my code, basically continuing were you left off.
Currently facing some problems for tiled rendering (all QPUs working on their part of the screen), but the rest works fine. Got some simple 0.5-threshold mask to work on the QPU, just for testing.
https://github.com/Seneral/VC4CV

@doctorseus
Copy link
Author

doctorseus commented May 25, 2020

Hey, just had a look at the code, it is pretty cool that you got all that working and decided to stick with it! I don't have the HW right now to try it but did you get a bit more info on what platforms this can be used at the moment? Are there breaking changes with the new VideoCore / RPi4? I assume so as you stated in your documentation that it targets "VideoCore IV".

Edit: Also big thanks for sharing your progress!

@pelwell
Copy link
Contributor

pelwell commented May 25, 2020

BCM2835 is VideoCore IV, so that doesn't tell you anything about the code, but I'd be very surprised if it was Pi 4-specific.

@doctorseus
Copy link
Author

doctorseus commented May 25, 2020

BCM2835 is VideoCore IV, so that doesn't tell you anything about the code, but I'd be very surprised if it was Pi 4-specific.

Not sure what you reference here. RPi4 uses a BCM2711 with an VideoCore VI and apparently there are additional mesa patches which lets me assume there are changes to the underlying GPU architecture which also makes it likely (but not certain) that the code discussed here will not work properly.
Edit: I tested my code only on VC IV so I was wondering if @Seneral tested VC VI too.

@pelwell
Copy link
Contributor

pelwell commented May 25, 2020

"VideoCore IV" != "VideoCore VI". If the documentation says VCIV then it's not aimed at 2711.

@doctorseus
Copy link
Author

doctorseus commented May 25, 2020

Ah I get where you are coming from (after re-reading my comment). As it's a pretty big commitment writing code for a platform which is EOL (in some way) I assumed that he tried it on both in the past two months, the VC IV and VC VI but concluded that it only will work on VC IV so decided to only mention that in the documentation. Either that, OR he didn't try it. Hence my question.

@Seneral
Copy link

Seneral commented May 25, 2020

Oh, I explicitly don't support VideoCore VI because I only plan to target VideoCore IV. What you call EOL is for me the only reason I even do this - the low price, power requirements and size of the RPi Zero. If you're ok with a 35€-50€ and big SBC there are probably easier solutions out there, even outside Raspberry Pis, like the Tinkerboards. Tried to explicitly point that out - the only reason to put in so much effort is if you really need that extra bit of performance out of the Zero specifically.
So yeah, it supports VideoCore IV, which means RPi1, 2, 3 and Zero, although I only tested on the Zero.

The VideoCore VI could be supported in the future though, once there's an assembly compiler for it. It works slightly differently, has some more options (TMU can write now - not sure how useful yet, but may greatly increase write speed if all QPUs write simultaneously, who knows). There's already py-videocore6 made by the geniuses at Idein (they did a ton of cool stuff with the VideoCore IV already). But as far as I know not a normal assembly compiler, which I would greatly prefer.

But IMO the value you get from using the QPU on the RPi4 is most probably lower than the value you get by doing it on the RPi Zero. RPi 2-3 is even lower value since they barely offer an advantage over the RPi Zero in that regard (only using QPU), and you could just use the more powerful RPi 4 for better results, provided VideoCore VI is well supported by then.

@JamesH65
Copy link
Contributor

Ah I get where you are coming from (after re-reading my comment). As it's a pretty big commitment writing code for a platform which is EOL (in some way) I assumed that he tried it on both in the past two months, the VC IV and VC VI but concluded that it only will work on VC IV so decided to only mention that in the documentation. Either that, OR he didn't try it. Hence my question.

Apart from the very first Pi model with the small GPIO header, none of the Raspberry Pi range has been EOL'ed. You can still buy them, all new.

@Seneral
Copy link

Seneral commented May 25, 2020

Yeah I think he meant virtually replaced by newer, more powerful models. Although unless they intend on bringing out a VideoCore VI based board in the Zero form factor AND price range, the Zero will always stay relevant to some degree.

@doctorseus
Copy link
Author

@Seneral well I understand what you mean, the most benefit for all this is for sure on the older platforms. I specifically targeted the Zero too when I started the project in 2016 and it still has an very attractive price point if you have a use-case.
But personally I would like to have support for older and also upcoming platforms. ;-)

It's a shame that we don't have the documentation for the VC VI.

"EOL (in some way)" -> going forward we will not see a new board with an VC IV. But it's for sure nice that we still have a good availability for older models.

Anyway, nice work, I will keep an eye on the repo and will hopefully have some time to give it a try.

@Seneral
Copy link

Seneral commented May 25, 2020

Thanks! I mean I would really want a VideoCore VI based board in the Zero form factor.
But that also only if we get better cameras (either specialized cameras that make better use of the CSI lanes, or a board with more CSI lanes exposed).
If I get the tiled rendering to work, I expect that the work I need to do on the QPU will be done much faster than the camera can supply frames. So at least for my use case, that is blob detection for a VR tracking system, this is as far as I need to go.
Well for now I'm still fighting some sort of synchronization issue that I can't seem to figure out.

@Seneral
Copy link

Seneral commented May 25, 2020

Btw here's a good writeup of some early findings on the VideoCore VI (haven't researched a lot further tbh). Seems it has a ton of new hurdles when it comes to GPGPU
https://blog.idein.jp/post/190588113970
From that video you can also see they achieve roughly double the performance on the RPi4 with with the full MobileNetV3 model (18fps) compared to their implementation of MobileNetV2 on the RPi Zero (8fps). I'm not sure that is worth the extra effort, even more so considering (don't quote me on this) the model used on the RPi4 is actually lighter by about a third.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants