Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
[imx] Deinterlacing rework #5805
Conversation
|
Thanks much for doing that PR. Let's see if @koying and @FernetMenta see the potential for helix. We are already late, but reviewed code will bring IMX further to feature completeness. I will take a look tomorrow night. |
|
Nice that there is progress on this platform but I don't think beta4 in transition to RC is the right time for a 1.1k loc change. |
|
Agreed that it's a bad timing. Will defer to I*** and we will likely also need a builder that is compiling this code path. Of course in mean time this code can be reviewed that it can go in after release |
MartijnKaijser
added Helix I***** Improvement and removed Helix
labels
Nov 25, 2014
|
Thanks guys, take your time. My concern is only about implementation and how to improve that thing to make the IMX6 work with Kodi and to get the best performance out of it. I run that code along with Gotham already for a while on my box. But during rebasing I noticed that things changed in Helix so I don't want to guarantee anything although tested last night. |
|
You can easily add fixups to this PR later on. So if you start using Helix 2014-11-25 9:22 GMT+01:00 smallint notifications@github.com:
Fingerprint: 4606 DA19 EC2E 9A0B 0157 C81B DA07 CF63 1A99 5A9B |
|
Can you please split up the PR into:
It's really hard to review those mixed changes. There is a lot of debugging code still in there, most sever one in decode that always prints a line (see also the implemented logging levels). This: xbmc-imx6/xbmc@a4346e2#diff-41e0ceb3de6f45669f617deae0dbeb0cR2988 needs dicussion as it it bypasses an infrastructure we implemented for helix. |
|
@fritsch Splitting up everything into chunks how you request might take some time because I just ported the code that was based against a complete different HEAD with "old" deinterlacing to the helix master. I had to resolve lots of conflicts. Lets try ... The debugging code is still necessary but by default not compiled in unless you activated IMX_PROFILE* or TRACE_FRAMES. Where is always a line printed in Decode, can you point me to it? Regarding the last point: that is interesting. I would like to learn about that and probably the new infrastructure makes some code in the mixer obsolete. You know while we implemented all that we had to read the Kodi code at that time due to the lack of documentation. Honestly, I do not expect technical documentation so no offense here, I just want to point out that it took time to make all that work and to understand the code logic (or at least a bit of it). I am pretty sure that it can be improved in terms of architecture with your help and your insight into the current code base. So please don't take the code as it would have been written by an experienced team member. I am very open to changes! Thanks |
|
@smallint: Thanks for your reply, the log output is here: xbmc-imx6/xbmc@a4346e2#diff-68cec69f8ac35fc16407c6baa250b93bR658 To use the new component logging, it's enough to put something like that before the debuglog
The Trace_frames infrastructure is probably okay as it is for debugging only. |
That code is in a comment block and not used. Github needs better code highlighting! ;) |
|
Hehe - yeah. Another point, please remove commented code, that you don't 2014-11-25 18:50 GMT+01:00 smallint notifications@github.com:
Fingerprint: 4606 DA19 EC2E 9A0B 0157 C81B DA07 CF63 1A99 5A9B |
FernetMenta
commented on an outdated diff
Nov 25, 2014
| @@ -2966,6 +2985,8 @@ unsigned int CLinuxRendererGLES::GetOptimalBufferSize() | ||
| m_format == RENDER_FMT_EGLIMG || | ||
| m_format == RENDER_FMT_MEDIACODEC) | ||
| return 2; | ||
| + else if(m_format == RENDER_FMT_IMXMAP) | ||
| + return 1; |
FernetMenta
Member
|
|
I want as less buffers as possible queued in the renderer because the Mixer thread of the decoder holds already up to 3 buffers for either deinterlacing or pass-throug. Having additional buffers queued in the renderer means that more buffers need to be allocated in DMA and that size is quite limited. Again, deinterlacing is processed in the Decoder itself and the render "just" needs to render the texture without further processing. The IPU/VPU buffers are directly mapped to the GL texture. That was my intention. |
|
Again :) one is wrong anyway because you need at least as many as the gl/gles context. While one buffer is written to, the other one is flipped to front buffer for display. But with only 2 buffers you would synchronize renderer with decoder/mixer. A small delay of video player thread will cause a missed slot for vblank and you'll end up rendereing the last frame again and as a result a future frame needs to be dropped. |
|
OK, got it. But actually this delay should be handled by the buffers in the mixer. If the video thread is slow or lacking then there are already buffers in the output queue of the mixer. But I see your point and will change that to 2 as all the other backends do and test it. I am currently building all the stuff again and will submit the requested split up commit. Thanks. |
|
you don't get the point. changing it to 2 does not change anything because of fails safe in renderManager. It already uses 2. |
|
@smallint: Can you perhaps help us a little bit on understanding imx xbmc architecture? When I look into the mainline Code, I see a big Lock() taken from the render thread. Doesn't this completely stall the decoder? Can you briefly describe how the parallelism (e.g. separate mixer thread) that is now integrated worksaround such hard "locking" foo? Have a look also in VAAPI / VDPAU how the surface buffers are reused with minimal locking by maintaining two vectors. Perhaps we can start discussing from an architecture point of view. |
|
@FernetMenta Currently the IMX6 with GLES rendering the VPU/IPU buffers does not manage to render overlays smoothly anyway. 3 buffers won't help here. I am currently only considering fullscreen performance without controls rendered which is already enough on that box with HD and deinterlacing enabled. The decoder itself still returns 3 in GetAllowedReferences() so this should be the value we are actually talking about. The 1 I return is was just for flexibility to let the Decoder decide how many buffers are queued in the renderer. We are still on the safe side, aren't we? |
|
@fritsch The Lock() is gone with this PR. I have completely removed that. Before the Lock was used if the decoder was reset or disposed to synchronize the rendering and the buffer disposal. But that wasn't an issue at all because the current path is fast enough without deinterlacing. All I am testing with is deinterlacing since that utilizes the memory bus much more that just decoding and rendering. I agree that an explanation of my idea of the current code would help. But I cannot just put it into two lines, I need to think about that and write some longer text later. Would you recommend to do here in the comments? |
|
@smallint: Yeah, don't make yourself too much stress, please. Let's concentrate on the split out, then in the splitted out code we can perhaps easily see the ideas and suggest improvements. Thanks much as always! |
|
I split it up into 12 commits and pushed it. Be aware that not every commit would compile since I did that as dry-run on another machine. The final commit compiles and is currently running on my box. I hope you can now easier check the changes. |
|
Thanks for splitting! I actually tested the code right now to see how it is working. 576i is nicely deinterlaced and displayed as 25 fps. Will look at the code tomorrow evening as promissed. |
|
The OSD thing is puzzling me, too. With Gotham I could see drops and skips nicely but in Helix it seems to be always 0 although the framerate is below 25fps. Lets see what you will find out ... |
|
@FernetMenta You are right. After checking the code again, returning 1 is not enough. The GetAllowedReferences return does not override the GetOptimalBufferSize value. I am going to correct this, sorry for the noise. |
chbmuc
commented
Nov 26, 2014
|
I built OpenELEC master with these patches and can confirm that it’s working very well with SD content (567i), but 1080i it’s not running smoothly. Using the suggested devmem settings improves the situation noticeable, but still there are minor glitches.
I wonder if these settings should be included. Looking at the code I would vote for inclusion in Helix, because it’s almost certain, that it doesn’t affect other platforms. I made some comments in the patches, but I couldn’t find anything related to performance. |
It won't get into helix, which is not an issue at all as the code quality is more important than the features. Background of this PR was to get the architecture discussed and optimized. We do it right once and for all. It's not a matter of getting something partly working, but establishing a design and architecture that is usable and expandable and easy to debug in the future. If we see obvious fixes that need to go into helix, they can be splitted out and cherry-picked. @chbmuc what do those devmem settings do? |
chbmuc
commented
Nov 26, 2014
|
The devmem settings are mentioned here: xbmc-imx6#70 (comment) "gives the VPU memory read and write highest priority." |
|
@chbmuc thanks for that information. @FernetMenta here is a plot of deinterlace / decoder times as you just asked for it https://cloud.githubusercontent.com/assets/6400406/4179063/d29c8412-36b5-11e4-942a-ae20b5dd3ad3.png I am quite wondering, as the decoder speed is highly fluctuating, it seems like something parallel is getting serialized here, which looks like a stall. |
|
@fritsch Since we are using the boxes at the limit of their memory bandwidth (my conclusion to the tests so far) we need to adjust many settings carefully to different use cases. A media center like Kodi probably needs different kernel optimizations than a web server or a NAS. To reproduce the plots I made you can activate IMX_PROFILE_BUFFERS and grep the output to prepare the plots. I used gnuplot as you noticed already. I can post the script I use for that later. |
|
As promised the script I use to prepare the plots:
Then I run gnuplot with a selection of those files. |
|
@smallint Thank you very much. I started to read the code in detail yesterday. From a "algorithm" side I really wonder what that deinterlacer spends its time in, it should not need that long time at all, cause it is hard coded for one future and one past ref, so should be optimized hardware wise for exactly this? Edit: Other postprocessing APIs have the possibility to add more refs or less to adjust performance. Intel VPP for example only uses one of those, just enough for a bit of motion compensation. Here @FernetMenta is the specialist on howto manage those reference frames without having to allocate additional memory, by constantly reusing allocated buffers. Those flags / options we currently have, right - nothing more? So the only parameter affecting quality is ipu_motion_sel? I cannot find that documented :-( Besides that 37.4.11.1 VDIC Features <- tells that the internal deinterlacer can only cope with 980px horizontally. I am not 100% sure they are really talking about horizontal widths, vertically would make more sense.
I am currently travelling again, it's a real rough time for me before this christmas :-( What we could try though, would be: What happens when we start to render? How does this decrease the decoder throughput. Perhaps we can speed up things by somehow using: 37.4.11.2.1 Out of interest: Did you experiment with the frame doubling flags, does this cost additional load (besides we could perhaps killing the render)? |
|
As you could see from the plots above, the VPU spends more time on decoding than the IPU on deinterlacing. Currently the default HD profile uses HIGH_MOTION which is only using two fields or one (codec) buffer. LOW_MOTION is using three fields or two (codec) buffers and is therefore much slower. Enabling that for HD in the moment is a no go. The buffers are allocated during initialization only and one VPU buffer is pointing to the previous VPU buffer. Buffers are always reused and the entire codec implementation avoids dynamic memory allocations where possible. I also tried to preallocate GL textures for each allocated buffer but that also did not help.
That are the only options we have with the current kernel API. You can check the documentation "i.MX 6Dual/6Quad Applications Processor Reference Manual" available from freescale but I agree that documentation about APIs and best practices, limitations and so is hard to find. ipu_motion_sel is documented as well as all the filters used by the VDIC.
I think they are right. As I understood @wolfgar and some documents buffers exceeding that resolution limit (IIRC 1024x1024) need to be split up. And that also takes time. I just want to add that my Wandboard running an old kernel with overclocked VPU handles HD deinterlacing smoothly. I don't know what magic @wolfgar applied to that kernel ;). I tried many different settings and kernel patches but nothing improved the situation significantly on the Cubox. You can read about that in the imx6 issues, we discussed already a lot about it. Thank you @fritsch. |
|
@wolfgar Any chance the cubox could get a bit of magic for those boards? (Now as the cubox TV is out and yeah, no TV without deinterlacing). This would give us more room for a simple, but optimized architecture (like the Mixer one proposed) and we would not have to sacrifice this simplicity for workarounding driver inabilities :-) |
|
Btw. perhaps to add. I am clearly aware that kernel debugging concerning performance is most likely the hardest job you can think of. In the beginning of the VDPAU via radeon oss implementation performance was at roughly 66% of the fglrx binary blob implementation. Christian digged into it for hours and weeks and finally it was a simple response of a message transmitted that was ignored (a two liner) :-) |
|
Does that cubox TV support smooth full HD deinterlacing with any software product available? If not, why is it called cubox TV then ;) I don't think that @wolfgar is the right person (although I am confident that he would have the capabilities and knowledge) for that. I would expect more input from vendors claiming "best media center experience" for their boxes. I have never asked that question before but does Kodi/Android on the Cubox show the same performance limitations with deinterlacing? |
Exactly what I thought. It's the same chip /configuration as the 4 pro, but without bluetooth / wireless. There is no Deinterlacing on Android at all. All those nice boxes that advertize TV experience cannot do deinterlacing. |
Don't tell me ... I hoped there was at least a "reference" implementation showing that it is possible. So gstreamer-imx6 with vpu, ipu and eglsink could be an option to test performance against Kodi. I have never managed to get this combination running and never tried hard enough ... |
|
Well, let's say we don't have control over interlacing on android. |
chbmuc
commented
Nov 28, 2014
|
I don't know if it helps, but there is a thread in the freescale forum about 1080i deinterlance support for iMX53. It also has some library code to support VDI split mode and an android test, but I must admit, that I don't understand it: |
|
Hi there, Sorry for my long silence, I wish I could have been able to join this thread earlier. Then I guess there are very interesting remarks regarding locking in this thread : they could improve behavior when we browse the GUI with video but that it might not be the root cause of our deinterlacing issues Regarding raw performance, the VDIC engine should not be the bottleneck. But the advertized performances are a little confusing :
The 2 first figures might be compatible (one is only about output rate the other about core deinterlacing processing max rate) What is also sure is that tweaking the VPU priority on AXI bus improves behavior : it seems to prove there is memory bandwidth at play. Yet, this bottleneck should only be limited to some timeslices as the global bandwidth of i.MX6Q DDR is quite good : 64bit bus @1066mhz and should be able to cope well with the global datastream Mem->VPU->Mem->IPU(VDIC)->Mem->GPU->IPU(DP) Unfortunately I don't know exactly why my wandboard kernel behaves better otherwise be sure that I would already have shared this info of course ;) Smallint profiling and plotting is really interesting and already gives pretty interesting data. But I guess we could have an even deeper understanding by using lttng to have a global view of threads scheduling and of locking involved : I wanted to go that way for a long time but had not the required time to do it. I hope it will change soon... |
|
@rabeeh has posted a kernel patch here: xbmc-imx6#70 (comment) which affects the kernel and increases the burst size. |
|
Rebased. I hope the merge of wolfgars fix is correct. |
|
@fritsch @smallint @rabeeh I pushed this fix more than one year ago in my kernels (and forgot about it ;) ) |
|
@wolfgar That looks great! Let us know how it works for you, I am going to test it as well. |
|
@fritsch That's a pitty. Btw: do you know how to make CLog to log to stderr via an API call? I am currently building a test app. |
rabeeh
commented
Nov 30, 2014
|
@smallint see the Ilog.h I cannot find an additional function call to accomplish that. In doubt just use printf :-). If you need it the other way round, e.g. having a lib logged into the xbmc.log you can reopen stdout as xbmc.log via freopen? |
|
@fritsch I am reusing the codec components and they are using CLog so I would like to redirect that output to stderr instead o a logfile. But I can also live with a logfile. |
|
@rabeeh Yes, the VPU is running at 325Mhz on my Wandboard. But that settings has gone in recent kernels. We discussed that already. I know that this is a difference to my Wandboard. I set up an vanilla Arch on that board and the fps are around 25 but always scratching the limit and playback is not perfectly smooth. So, yes, increasing the VPU freq to 325Mhz would certainly help but I don't know how to do that with kernel >= 3.10. |
|
As you have asked for some numbers: I dumped the binary data that was passed to Decode into a file and played that back in a test app. Currently without GPU rendering. Just VPU/IPU. It is a 1080i/50 stream. Targeted are 25fps or 40ms/frame. No deinterlacing:
With deinterlacing:
That is all fine (just looking at the average). I am going to add a quick and dirty renderer and check what will happen to those numbers. Numbers gathered on a Wandboard running the latest Arch. |
|
@smallint: Can you check the 1080p50 sample with that method, please: 2014-11-30 21:12 GMT+01:00 smallint notifications@github.com:
Fingerprint: 4606 DA19 EC2E 9A0B 0157 C81B DA07 CF63 1A99 5A9B |
|
@fritsch I am still testing with my stream and will upload that app later that you (or someone else) can test yourself. I will also test your stream later. |
cmichal2
commented
Nov 30, 2014
|
I've been using smallint's patch from: xbmc-imx6#70 in geexbox/gothm with the VPU352M patches for ages, and for 1080i OTA content in North America this combination gives smooth playback, as long as there are no menus overlaid. I ported the freescale kernel patches that enable the VPU clock speed bump from 3.10 to 3.14. They are here: |
|
@cmichal2 Give the helix version a try. There are improved render infrastructure so menus should be more solid also, now. Also your performance comparison would be nice to know (with the very same kernel config). |
|
Here the output for VPU/IPU/GPU:
And it played smoothly. But that is of course without any sound and any other stuff in the background. So we are already at the limit with that combination. FB_MUTLI_BUFFER is set to 2 and my screen runs with 50Hz. |
we need to decode 50 progressive frames per second (!) if you just tested the 1080p50 sample. No deinterlacing should be invovled. Edit: Now I got it. You tested your 1080i50 file with a "dirty" renderer. That's near to skip / drop :-). I assumed this before. I still think the bottleneck is with the renderer. Thefore testing the 1080p50 file would be nice to see the decoder throughput (framewise) and then the throtteling of the render. |
|
@rabeeh : What are you sure about ? what should I remember ? |
|
In your case, yes. In my case no, 50 fields or 25 frames. I tested with my sample. Btw, the CPU load was almost 0 in my case whereas Helix runs about 160% when playing a stream. Is there something we can just switch off? Here goes your sample:
It plays also quite smoothly. |
|
I added a new branch perftest to xbmc-imx6. The test app is https://github.com/xbmc-imx6/xbmc/blob/perftest/perf-test.cpp. In this branch the define DUMP_STREAM is active in the IMX decoder which outputs a file stream.dump whenever you start playing a stream. Have fun compiling the test app ;) |
|
Thanks for that measurements. Not quite sure why you force deinterlacing, it makes no sense for progressive content. I see that decoding alone is fast enough 16 ms means we can theoretically decode 1080p60. Normally the render in kodi would skip and the decoder should drop at some point under such conditions. Concerning the 160% we should find out what is causing that. |
|
@cmichal2 : thanks a lot for porting the VPU@352M, I am currently pushing it in my 3.14 kernel |
|
@fritsch I try all options for each dump regardless of its "sense" ;) Also rendering is fast enough for your content without deinterlacing. If the decoder runs with Kodi it does not activate deinterlacing for progressive content. This is just my "dumb" test app. |
|
Keep in mind that my tests were made under perfect conditions: no demuxing, no audio stuff, no nothing but bare metal decoding and rendering. Profiling all that within Kodi to get possible app bottlenecks could cause some headaches ;) |
|
@wolfgar Is your 3.14 kernel for Cubox? I would appreciate to get kernel sources from one place for this box with CEC working and VPU@352M enabled. Otherwise we need to sync with @pepedog to incorporate all the required fixes. Currently it lacks proper CEC support. |
|
@smallint : OK I know the issue with CEC stuff. I helped with sorting this out for openelec last week; I ping pepedog so that he can fix this. By the time I setup a cuboxi kernel and share it with you in a few minutes... |
pepedog
commented
Dec 1, 2014
|
Someone said cec is working in arch 3.14 and said they had permission problem |
|
thanks so much for your testapp, I have compiled it and run test on samples provided by fritsch
25ms means 40fps and not 50fps so we definitively might have an issue here... Regarding deinterlacing, we don't have exactly same figures,
with kernel patched both for burst size @64bits and with the VPU@352Mhz, results are
Edit : An update With the last kernel (64bits burst and VPU@352Mhz) if I increase the VPU priority (devmem 0x00c49100 w 7 && devmem 0x00c49104 w 7) , here are the results (still with1080i50_h264.ts) :
Visually the skipped frames are almost not noticeable... |
|
I have uploaded the samples @wolfgar is talking about here: http://solidrun.maltegrosse.de/~fritsch/ Those are all the relevant TV formats german TV is producing: 720p50, 567i50, 1080i50 - I also added the famous burosch1.mpg to see the deinterlacing quality. The 1080p50 file is for benchmarking decoder in combination with render. |
rabeeh
commented
Dec 1, 2014
|
@wolfgar - about this - Clearly my intentions are 100% technical to solve this issue and not accuse anyone of hiding anything. |
|
@fritsch Thank you, I just wanted to ask for the sample @wolfgar used. My stream is a dump from ServusTV and renders much faster than your sample on wolfgars Cubox. Tonight I am going to test your 1080i50 streams on the Wandboard on a plain Arch kernel. Too bad that CEC isn't wired correctly on this board ...
The hw decoder never drops. It keeps decoding all the time to prevent artifacts. But dropped images bypass the deinterlacer and are not rendered (handled by Kodi).
Hehe, my fault while coping with numbers ... you are right. |
|
Yes, we also throw away already decoded images :-) which is really expensive, but it's the last chance we have in such cases and we don't come along with render skipping. That also only makes sense when the decoder is a separate thread and has some already decoded frames in it's queue. @smallint the Mbff sample also is ServusTV, the other one Pro7 HD. |
|
OK, 1080i50_h264.ts needs 32ms for decoding only. All the other numbers are very similar to @wolfgar. It is not a pitty being not able to watch HD+ content ;) but from the technical point of view it is a challenge. |
|
@smallint: Check the 720p50 sample, which is "Öffentlich Rechtliches". Currently we are only able to get 44 fps out in xbmc, that's a real pitty. There are other 1080i50 contents like "nice" shopping tv channels, that have same bitrate combination. So we should be able to play them all obviously, to be TV safe. |
rabeeh
commented
Dec 1, 2014
|
Here is more information on 1080i50_sample.ts from @fritsch The tool help is - I'v got the following numbers on 1080i50_sample.ts with de-interlacing - Below are the results with the break down per internal unit (IPU/VPU/GPU/ARM). The max read/write performance i'v got from the memory controller was ~1.7GB/sec; while in the same platform, disabling dirty regions and going to the main menu with the rss line i'm able to get ~3.4GB/sec. No conclusions yet; but this tool makes it possible to measure things. Following are profiling numbers by running ./mmdc_prof - Total cycles count: 528068607 VPU (./mmdc_prof -m 0x003F0013) ARM (./mmdc_prof -m 0x00060001) GPU (./mmdc_prof -m 0x003E0003) IPU (./mmdc_prof -m 0x00060004) |
rabeeh
commented
Dec 1, 2014
|
Looking more at the data; one important field to check is also the average read/write burst size per unit. This can lead to very low utilization on the DDR bus since it's going out with a transaction with a burst of 16 bytes instead of 64 bytes. |
|
@fritsch Your ServusTV sample (mbaff):
|
|
@smallint That's nice and would be enough. Btw. the frame doubling, the api provides, would kill the render unbelievably hard, right? |
|
@fritsch ... and 720p50_h264:
15ms is awesome and more than enough for 50fps. My Wandboard renders this stream in Kodi with 50 fps. |
That's awesome! But yeah, compare that to internal xbmc rendering, that's then something to cry about. It can barely render 44 fps. |
No, my Kodi renders this stream at 50 fps. But regardless of my or your number, do you think someone would open that box and attack this issue inside Kodi? So many systems with different configurations are out there and Kodi does already a fantastic job with running on all of them. But maybe this little platform is an opportunity to optimize further to the benefit of all the others. |
|
I found the numbers @rabeeh has given before quite interesting. Cause that opens another point of optimizing the architecture and the kernel drivers. |
|
Hi @rabeeh thanks a lot for this very interesting mmdc_prof tool for profiling bus usage Regarding IPU write throughput, it definitively true that having a look at increasing write burst size would be very interesting, Yet as I wrote in this thread, the deinterlacer max output rate is 100Mpixels/sec output rate (stated in §37.4.11.7 of imx6Q RM). |
|
Sorry for the probably stupid question: is there a guide somewhere on how to cross compile Kodi with e.g. Yocto? My old recipe does not work anymore and adjusting it to the one from Arch leads to complains about "missing FFMPEG". I think things regarding cross compilation changed: any advice on how to proceed? I would like to test this on my "magic" kernel ;) |
rabeeh
commented
Dec 1, 2014
|
Below are observations for my hacking today. I'm not very happy with this patch since it removes the ability to lower the frequency of the processors (lately i was able to run HummingBoard with the quad processor completely without any heatsink). The above patch fixes 720p50_h264.mp4 and 1080i50_h264_mbaff.mp4. 1080p50_recode.mkv is still not - clearly memory bandwidth issue due to two reasons -
Notes - "the deinterlacer max output rate is 100Mpixels/sec .... write troughput of 100*6/4=150MB/s," - that should be more than enough. Actually i'v measured the write throughput on the 1080i50 and it's around 100MB/sec. The abuse comes from the fact that accessing the DDR with small bursts is extremely inefficient and wastes bus cycles. And looks like this 16 bytes is a limitation in the implementation itself. On the table now -
|
|
Thanks a lot for your answers @rabeeh.
Even if VPU @352mhz and increasing VPU prio have positive effect and is enough for some streams it is not yet enough for any streams ... I really hope @smallint can bisect the diff with his wandboard configuration because it seems to be a key... Kind regards |
|
Interesting discussion and information, thanks.
I cross checked and it is the case but depends on nAddressAlignment returned by the VPU. I haven't tested return values if they are ever != 16. Anyhow, checking again won't hurt. Regarding GPU performance: either in my testapp or in Kodi itself you can shrink the output quad to e.g. 50% and you will see that streams render smoothly which did not before. Reducing the GPU load helps but I also tested drawing a full screen quad but mapping only half of the texture (vertically) to it. The speed also improved significantly. I presumed that this is due to lower utilization of color space conversions and less IPU mem reads. My conclusion at that time was that the system itself needs to be balanced, just one part is not enough. That very high GPU mem bandwidth usage you were talking about is it reading/writing or combined? The 1080i50 test you made resulted in a total GPU bandwidth of 764.52 MB/s, right? |
As I understood the GPU bandwidth is ~ 760 MB/s for 1080i50 and for 1080p50 it is much higher, am I right? In the latter case we are sending 50 decoded buffers down to the GPU instead of 25 which doesn't sound insane in terms of memory bandwidth and a factor of 2 seems to be OK. But maybe I got the above numbers wrong. Can you post a comparison of 1080i50 and 1080p50 please? |
The kernel my Wandboard Yocto build uses is https://github.com/smallint/meta-jabe/blob/master/recipes-jabe/linux/linux-wandboard_3.0.35.bbappend. I am going to check again for any local modifications later. I definitely enabled 352M which is not reflected in the committed config. |
|
Hi smallint Le 2 décembre 2014 08:48:14 CET, smallint notifications@github.com a écrit :
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma brièveté. |
|
OK, I got it. What I can image:
Total: 7.5 R 6 W = 388.8 MB/s R 311.04 MB/s W Not sure if it works that way internally. EDIT: If depth buffer is active it would add to those numbers. Bilinear filtering also needs more read access. I also assume that all internal color values are RGB and not RGBA. EDIT2: Can you check the numbers with the test app? Here we have more control and knowledge of what is done at GPU level. |
pepedog
commented
Dec 2, 2014
|
I have just packaged (for myself for now) against this branch featdeintrework-pr |
I checked the test app on my "magic" setup and it did not perform better that your posted numbers. I could not test graphics because I chrooted into the Arch installation with the old kernel and even replacing the VIVANTE driver and GLES libs did not help to produce some colors on the screen. But anyhow, I don't think that that kernel is soo much faster than your latest one with any optimization you came up with. It decoded and deinterlaced the 1080i50 stream (ProSieben HD) with 40ms (no rendering). Probably there is no key in that installation ;) I will check possible optimizations on app level. |
|
OK @smallint thanks for a lot your results even if a little disappointing : Then it would be all about always being on the safe side with very little margin ? I have just though about your previous post about GPU bandwidth explanation and maybe we could save the GPU CSC step ? If it makes sense, the current figure which is Hmm time to tweak the IPU driver to check whether it is pure phantasms or if it can work that way.. ;) |
|
@wolfgar that means that we need to allocate two times the buffer size for deinterlaced frames. I do not know if we have enough memory for that. I can't really follow the VDIC bypass ;) Wouldn't that mean that we just define the output format of the IPU task to RGB? EDIT: glTexDirectVIVMap does not support RGB as input buffer format as far as I could see from the format enumeration. |
|
@smallint : well lets say that DMA buffers size is yet another burden that we can deal with (Especially, I guess that now that we moved to modern kernel 3.14 with CMA available, it should be easier to reserve memory for DMA and not being thigh by the hardcoded DMA zone and its allocator) yep in fact I guess that if you set the output format of your IPU task to RGB it could exactly behave like I say ! In fact, I have just checked and there is a already path available (MEM_VDI_PRP_VF_MEM) that could just do what I describe. Maybe no much tweaking is required ;) ... |
|
Hehe :-) Exactly thought the same while typing. Normally "Double" is the default, therefore we can remove the "double" and only tell: VS_INTERLACEMETHOD_IMX_BOB and VS_INTERLACEMETHOD_IMX vs VS_INTERLACEMETHOD_IMX_BOB_HALF and VS_INTERLACEMETHOD_IMX_HALF and making VS_INTERLACEMETHOD_IMX_HALF the default. I thought of those when thinking about double rate:
But yeah in the render it would be a param to the shader. Thanks again! |
|
I think the VDI splitting itself is not perfectly implemented in the kernel and thats why I bypass it in userland. We need to confirm that kernel issue and find a solution for it. Then all implementation (gstreamer, Qt5) will benefit from it. I removed that commit from my branch again since I don't want to find it its way in its current state into any builds. If ever the kernel is fixed in that regard it is not needed anymore. |
|
@rabeeh Can you comment about the kernel side? What happens internally, are you doing some "temporary allocs" for saving parts of the too width surface? @smallint Yeah! Getting those out again will be a huge pain, cause some will just pick it and use it "as it works" without interest to solve the real issue. I am still a bit buffled about the 34 ms the render currently needs. Most likely I did not fully understand what you exactly measure there. For just swapping a buffer it sounds really, really slow. Could you explain that again? |
|
The double rate thing is handled in the decoder and it just makes another buffer to the player available. An option somewhere, that it should be used is enough. You can even use the double rate for progressive content if you want. Probably not with the IPU deinterlacer but in the renderer itself. Thats why I think that double rate should not be tight to deinterlacing. |
|
I think we talk slightly into another direction. When I think of doublerate, I don't think of just "doubleing" an already existing progressive frame. I thought in deed of bobbing from an interlacing point of view. Outputting one frame for every bottom and another one for every top field. But most likely this named option won't do what I had in mind. Edit: And yeah, my qeustion concerning the render here, I am not sure we get that time down to 20 ms? |
|
The IMX double rate computes an interpolated frame between frame1 and frame2. So you will get 50 frames out of 25 frames. But actually this is what most TVs do nowadays with their motion enhancers (soap effect). So probably not necessarily very important at this stage.
No, there are no allocs. I know what the kernel is doing: it is splitting the stripes not correctly and reduces the DMA burst to 8 instead of 16 because the widths of the stripes are not multiple of 16 anymore. The questions is: why does it split that way?
Only swapping a buffer? It is decoding, deinterlacing and rendering. All that takes 34 ms. It is like:
If it would be "really, really slow" than we would not be able to render at 25fps, don't you think? The 34 ms means 6 ms left for Kodi per frame. |
|
Ah! thanks for that. This makes the 34ms much more appreciated. Cause that way, we can get some speed out of it, when doing something in parallel, maintaining some buffers in decoder, deinterlacer, render stage. I'd like to help more on that topic, but yeah ... everyday's job is currently too demanding to get anything done in my freetime. Edit: 34 * 25 < 1000 - I thought of a parallel architecture. |
The 34 ms are already for the parallelized version. Decoding, deinterlacing and rendering is done in theirs own threads. This is already the maximum speed we get so far. The test app is using the Kodi IMX6 Codec implementation and mimics the player and render thread of Kodi. |
That is actually an issue for all of us. What I am missing while we are talking here a lot about possible implementations is that someone says: "yes, I can do that and will look into it" or "no, I have no time at all for that stuff and you need to figure it out yourself". Eventually we have to implement it and I would like to know if we can get support from you guys (guidance/implementations) or not. Even if not then that is a statement and I know how to plan my further activities, no problem. My main concerns are changing internals like enums or translation IDs and stuff, using shaders in a wrong way and so on. |
|
Then here to make it official: I currently don't have time to help in detail, but I hope that will change in the near future. A lot devs, as @FernetMenta for example are looking at this PR and a ping will always give a result, especially with the shaders, when here the first patches arrive, in a small managable part, you will get help of course. Half of this thread went away to "internals" of kernel, gpu, erratas, documentation no one of us has had a clue before. So if we now concentrate back to the kodi integration it will get better. Do not hessitate to ping when parts affecting render and also other architecture arrives. |
|
Very good, thanks. I think we can move the kernel noise to private discussions with affected devs or back to xbmc-imx6. I will only talk about Kodi integration itself here. @wolfgar, do you agree? |
|
@smallint I totally agree that we should open another thread to speak about the kernel changes we would like to push in order to have a fix at the right place @fritsch would it make sense that we open an issue to speak about these imx6 kernel changes ? Or is there any other place that would be better fitted ? |
|
@wolfgar: Btw. having the dicussion on an imx6 issue is not a problem at all, as final code will go in here (xbmc github) anyways. So up to you. Just open whatever you find is best and put a link here. |
|
+1 for the forums. Github is not very friendly for discussions. |
|
To be pedantic: the kernel issue is not related to Kodi at all. It also affects gstreamer and friends. Actually the Freescale forum would be right place, wouldn't it? ... I am fine with any place. |
|
Well, as long as the goal is to have a better Kodi, I don't see a problem On 5 December 2014 at 14:02, smallint notifications@github.com wrote:
|
|
I am going to clean up the patch tonight and push it into this branch. Then you can start testing it and report back. The good thing is that it won't break things even if the kernel is fixed. I will be off for some days and will check back later wherever the discussion continues. |
|
Patch committed. If you are testing it and still have some performance issues please check your uEnv.txt and make sure that it contains:
to use 16bit fb. Otherwise the performance is worse. My Wandboard even handles 32bpp nicely but I am not sure for the Cubox. To be tested ... have fun |
cmichal2
commented
Dec 6, 2014
|
A quick comment with respect to power usage and VPU@352MHz. I plugged my cubox-i4 power supply (the 3A 5V supply sold by Solid Run) into a kill-a-watt style power meter. All measurements made running xbmc gotham. without the VPU352 patch: with the VPU352 patch: Its true that enabling this patch means disabling frequency scaling, but that doesn't appear to mean no power savings when idle. This patch has been referred to a couple of times here as 'overclocking the vpu' but running the vpu @ 352MHz is specified in the chip datasheet, and is a supported feature. The difference in power consumption seems pretty minimal to me. Freescale includes kernel patches to enable it. I don't know why it wasn't in 3.14, but I suspect it was because Freescale hadn't added the feature to their 3.10 kernel when the work was done to support these devices upstream. |
|
@cmichal2 Thank you for the figures and clarification. I will try to avoid to call the 325M overclocking from now on. ;) Has anyone tested the codec patch? Has the discussion continued somewhere else? Have a nice weekend ... |
|
@smallint discussion stopped during your holidays. I made some tests with my cubox i4 pro with those patches, the 16 bit framebuffer and OpenELEC. I now get arround 21 to 23 fps, so just a tiny bit too slow to not let audio / video drift away. Nice weekend! |
|
@fritsch with which stream? With my Cubox I get smooth playback with 25fps on ServusTV (e.g. your mbaff sample). Haven't tried 32bpp yet. What kernel are you running on? I am running my stuff on Archlinux with LK 3.10 and 3.14. |
|
@smallint : sorry I wanted to write a detailed status (including our latest private exchanges regarding the benefit of 16bytes alignment after additional tests) as a first post to initiate the thread about kernel possible improvement but I had a terrible week |
|
@smallint This is running a 3.14 kernel with a whole lot of patches (most are not imx6 related): https://github.com/OpenELEC/OpenELEC.tv/tree/openelec-5.0/projects/imx6/patches/linux Is there something missing @sraue should add? |
|
@fritsch showing the overlay with the stats ('o') reduces the framerate significantly in particular with LiveTV. I open the overlay only from time to time to check fps quickly and trust my eyes otherwise. And I also use my test app. Having the fps in the logs would sometimes help but only during testing. Are there other options without drawing expensive overlays? |
|
Yeah. Of course known. I do the same. I judge from Video to audio ging out
|
|
The bad deinterlacing quality is due to active LOW MOTION for SD. With HIGH motion it looks very good. Either still a setup problem with the buffers for LOW MOTION or the algo itself. I will check that. |
|
The only deinterlacers that have really good quality on those samples (in OSS world) are Intel HSW with MCDI and Yadif CPU deinterlacer. I would not care about that as it is really academic. |
|
How do you create screenshots in Kodi and where are they saved? ;) |
|
Just press "printscreen" or use: xbmc-send -a "TakeScreenshot" It asks you for the saving location that first time you do that. Btw. one of the above screenshots was produced with a gstreamer chain, cause the intel guys did not believe the xbmc screenshots :-) |
|
That's great! MCDI quality. Nice. this is with active HIGH Motion? What about the performance, are we fast enough to enable that for SD content? |
|
Yes, HIGH MOTION, much faster than LOW MOTION and easily doable with SD. You just need to set lowMotion = ... to false. We need settings for that ;) |
|
Yeah +1 to that. What do you mean with "works now better"?. Edit:
So yeah, just set it to false always. |
Regarding the custom splitting with respect to the initial state of this PR. You did not miss anything! ;) |
I took your sample called 1080i59.94 and played that smoothly with 30fps although the content is ... what is that actually?
Now you know why we are eager to optimize towards using that deinterlacer instead of simple bobbing. |
|
Lol, that I have known before as LiveTV is my primary use case for xbmc and during the last 4 years I have seen many of those algorithms coming especially in VAAPI, which developed very slowly until VPP appeared with BOB for a year, afterwards MADI and MCDI which are quite awesome. The idea to get real deinterlacing working (e.g. taking more frames into account) is clearly the goal. BOB was meant as a simple alternative for "current users". You know, I have also seen a lot of good hardware dying, the biggest death was AMD E Series, without real driver support by AMD, very good hardware. Now 3 years later we can start to really use it ... I don't want to see that happen again, therefore I am in favor of getting working code, in the sense of - does the job for current(!) user's content done. After your breakthrough with the surface splitting, we are probably 10% (speed) away of getting it done really, really fine with a bit of extra power on top to get it done more robust. |
|
I have been working on the double rate in the meantime and pushed that. But I needed to disable it for now because it introduces oscillating scanlines to the final image. I hoped that recent kernels fixed that issue but it seems to be still there. Anyway, the code for double rate and low motion is there but both is deactivated and only high motion is used for now which seems to work fine. |
|
"Experts" or of different oppinion concerning the "double rate" features. Especially early 400hz TVs calculated insane movements as in between images. So mostly we advice our users to just disable all those Motion functionality on the TV. Where do you see the pimary usecase? |
|
In recovering 50fps from Xi50 streams which are currently rendered at 25fps. It is not really that motion enhancement I was referring to earlier (I don't like that, too) but deinterlacing as you would do when you work on single fields instead of full frames. |
|
Ah! Oki, then I was right after reading the documentation, but I let myself be convinced by your findings. Cool. Do you already have some performance figures? How much fps can we actually render? |
50 with SD (except this burosch sample which is quite strange). HD does not work with GPU. Raw decoding + deinterlacing + double rate deinterlacing takes around 5ms per frame in average (SD). But there is either still a kernel issue or I am doing something wrong with buffer setup. You can test it by enabling that in the code (doubleRate = ...) and test it with e.g. N24 with that scroller at the bottom. |
Sorry for confusion ... I mixed things up ... too many input at that time. |
|
You are doing an awesome(!!) job. And the people over at solidrun should pay you ... cause you are basically fullfilling their "TV" marketing promise. |
|
guys what is the issue with moving the discussion to the forum? |
|
I have no account there ;) OK, can do that. If someone creates a topic I will jump on. |
|
Now with the xbmc to src renaming, this needs major "rebase". The easiest method to do so (no kidding), use git format-patch HEAD~27 (number of commits) Afterwards replace /xbmc to /src and use git am to get them in again, funny isn't it? |
What is wrong with "git merge master" and "git rebase master"? I did that and pushed the changes. The old branch is backup'd to featdeintrework-pr-beforerename. |
|
Would save you a step. But also fine.
|
|
@smallint Yeah - your rebase work needs to be redone - the src commit was reverted. Sorry for the trouble. |
smallint
added some commits
Nov 25, 2014
|
@fritsch Done. Any reason to revert the renaming? |
|
@smallint Yes. Github does not support --follow and our history "per file" was broken after the merge. Additionally we did not have to rename that out of law reasons ... so loosing history for a bit cosmetics was decided against. |
wolfgar
commented on xbmc/cores/dvdplayer/DVDCodecs/Video/DVDVideoCodecIMX.cpp in f63dfb6
Dec 21, 2014
|
Are you sure this is safe ? We cannot change the pointers which are returned that way I fear |
|
Yep, if not nAlign aligned. I will check it. Shall we always align on pages and just allocate additional 4k? |
|
I am currently working on another rework of the proposed method (forum) which integrates much more nicely into the current RenderManager workflow and will make it easier later to implement a GLES fallback. But this needs #6090. Not sure what you prefer. If this one already works for some people go ahead an merge it. But the next PR will remove most of this code again ;) |
|
That was the question, yes - thanks for the answer. So we keep that one open here until your rework based on #6090 is ready to be merged and then we can close this one. Thanks very much for this update. |
|
I pushed another branch with a rework that I would call final but still needs some tests. Would you mind checking the code in smallint/xbmc@55c8fbf which implements the screenshot and the yuv422 rgb conversion. I don't know if I used the correct Kodi compliant coefficents. Another issue @Memphiz was pointing us to is RenderCapture for e.g. boblight (funny thing, did not know about it) that needs to be fixed as well. Could someone give more input on that: how often will it be called, what interface to implement ... |
|
For rendercapture see here: https://github.com/xbmc/xbmc/blob/master/xbmc/cores/VideoRenderers/RenderCapture.h (e.x. the impl of rbpi) Its called for each frame in the best case ;) |
We are just happy that rendering is fast enough. Capturing the display at each output is just mad ;) I will look into this, thanks. What is RenderCapture used for otherwise or how can I activate it for testing? |
|
Its used for creating bookmarks eventually (no performance needed here) and boblight (boblight is the use case where we want as much as possible frames - normally for example on aml we use the hardware to downscale the frame to lets say 80x80 or 160x160 pixel and then grab this downscaled frame which has enough information to drive the leds). |
|
Thanks a lot for the additional input Memphiz, It may be possible to achieve this at the cost of one additional IPU task for each frame. When deinterlacing is not at play this additional activity will be easy to cope with (As output frame is "small", it is intrinsically cheaper than when we deinterlace). When deinterlacing is at play, I cannot tell for sure. |
|
for interlaced content it might be enough to capture a half frame if this makes it easier/faster |
|
@wolfgar deinterlacing is actually faster than blitting progressive frames according to my tests so one could enable deinterlacing all the time ;) |
smallint
referenced
this pull request
Feb 5, 2015
Merged
i.MX6 rework which decouples GUI (fb0) and video rendering (fb1), framebuffers are composed with DP in hw #6351
|
Superseded by #6351. |




smallint commentedNov 24, 2014
This is the rework of the HW deinterlacing for IMX6 boards (see xbmc-imx6/xbmc#70). It creates a mixer threads that offloads deinterlacing to the IPU in parallel to VPU decoding and GPU rendering.
What works:
What does not work:
This PR is for review to be integrated into the current IMX6 Codec implementation.