Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[imx] Deinterlacing rework #5805

Closed
wants to merge 28 commits into
from

Conversation

Projects
None yet
Member

smallint commented Nov 24, 2014

This is the rework of the HW deinterlacing for IMX6 boards (see xbmc-imx6/xbmc#70). It creates a mixer threads that offloads deinterlacing to the IPU in parallel to VPU decoding and GPU rendering.

What works:

  • Selectable deinterlacing modes: None, Auto, Force

What does not work:

  • Double rate feature. Can be easily implemented but needs proper settings in the GUI
  • Smooth playback for HD streams which needs to be tested. The performance for my test setup increased already compared to Gotham

This PR is for review to be integrated into the current IMX6 Codec implementation.

Member

fritsch commented Nov 25, 2014

Thanks much for doing that PR.

Let's see if @koying and @FernetMenta see the potential for helix. We are already late, but reviewed code will bring IMX further to feature completeness.

I will take a look tomorrow night.

Member

FernetMenta commented Nov 25, 2014

Nice that there is progress on this platform but I don't think beta4 in transition to RC is the right time for a 1.1k loc change.

Owner

MartijnKaijser commented Nov 25, 2014

Agreed that it's a bad timing. Will defer to I*** and we will likely also need a builder that is compiling this code path. Of course in mean time this code can be reviewed that it can go in after release

Member

smallint commented Nov 25, 2014

Thanks guys, take your time. My concern is only about implementation and how to improve that thing to make the IMX6 work with Kodi and to get the best performance out of it. I run that code along with Gotham already for a while on my box. But during rebasing I noticed that things changed in Helix so I don't want to guarantee anything although tested last night.

Member

fritsch commented Nov 25, 2014

You can easily add fixups to this PR later on. So if you start using Helix
now, that code here will automatically improve :-)

2014-11-25 9:22 GMT+01:00 smallint notifications@github.com:

Thanks guys, take your time. My concern is only about implementation and
how to improve that thing to make the IMX6 work with Kodi and to get the
best performance out of it. I run that code along with Gotham already for a
while on my box. But during rebasing I noticed that things changed in Helix
so I don't want to guarantee anything although tested last night.

Reply to this email directly or view it on GitHub
#5805 (comment).

               Key-ID:     0x1A995A9B
               keyserver: pgp.mit.edu

Fingerprint: 4606 DA19 EC2E 9A0B 0157 C81B DA07 CF63 1A99 5A9B

Member

fritsch commented Nov 25, 2014

@smallint:

Can you please split up the PR into:

  • Cosmetic changes
  • Architecture changes of current code
  • Deinterlacing Code (classes, thread, ...)
  • Integration of that deinterlacing code into current code

It's really hard to review those mixed changes.

There is a lot of debugging code still in there, most sever one in decode that always prints a line (see also the implemented logging levels).

This: xbmc-imx6/xbmc@a4346e2#diff-41e0ceb3de6f45669f617deae0dbeb0cR2988 needs dicussion as it it bypasses an infrastructure we implemented for helix.

Member

smallint commented Nov 25, 2014

@fritsch Splitting up everything into chunks how you request might take some time because I just ported the code that was based against a complete different HEAD with "old" deinterlacing to the helix master. I had to resolve lots of conflicts. Lets try ...

The debugging code is still necessary but by default not compiled in unless you activated IMX_PROFILE* or TRACE_FRAMES. Where is always a line printed in Decode, can you point me to it?

Regarding the last point: that is interesting. I would like to learn about that and probably the new infrastructure makes some code in the mixer obsolete. You know while we implemented all that we had to read the Kodi code at that time due to the lack of documentation. Honestly, I do not expect technical documentation so no offense here, I just want to point out that it took time to make all that work and to understand the code logic (or at least a bit of it). I am pretty sure that it can be improved in terms of architecture with your help and your insight into the current code base. So please don't take the code as it would have been written by an experienced team member. I am very open to changes! Thanks

Member

fritsch commented Nov 25, 2014

@smallint: Thanks for your reply, the log output is here: xbmc-imx6/xbmc@a4346e2#diff-68cec69f8ac35fc16407c6baa250b93bR658

To use the new component logging, it's enough to put something like that before the debuglog

if (g_advancedSettings.CanLogComponent(LOGVIDEO))
 here is my debuglog

The Trace_frames infrastructure is probably okay as it is for debugging only.

Member

smallint commented Nov 25, 2014

Thanks for your reply, the log output is here: xbmc-imx6@a4346e2#diff-68cec69f8ac35fc16407c6baa250b93bR658

That code is in a comment block and not used. Github needs better code highlighting! ;)

Member

fritsch commented Nov 25, 2014

Hehe - yeah. Another point, please remove commented code, that you don't
need.

2014-11-25 18:50 GMT+01:00 smallint notifications@github.com:

Thanks for your reply, the log output is here: xbmc-imx6@a4346e2
xbmc-imx6/xbmc@a4346e2
#diff-68cec69f8ac35fc16407c6baa250b93bR658

That code is in a comment block and not used. Github needs better code
highlighting! ;)

Reply to this email directly or view it on GitHub
#5805 (comment).

               Key-ID:     0x1A995A9B
               keyserver: pgp.mit.edu

Fingerprint: 4606 DA19 EC2E 9A0B 0157 C81B DA07 CF63 1A99 5A9B

@FernetMenta FernetMenta commented on an outdated diff Nov 25, 2014

xbmc/cores/VideoRenderers/LinuxRendererGLES.cpp
@@ -2966,6 +2985,8 @@ unsigned int CLinuxRendererGLES::GetOptimalBufferSize()
m_format == RENDER_FMT_EGLIMG ||
m_format == RENDER_FMT_MEDIACODEC)
return 2;
+ else if(m_format == RENDER_FMT_IMXMAP)
+ return 1;
@FernetMenta

FernetMenta Nov 25, 2014

Member

why 1? the number of render buffers is minimum of NUM_BUFFERS, GetOptimalBufferSize(), and Decoer::GetAllowedRefererences()
you would end up with a single buffer
EDIT: with 2 because there is a fail safe in renderManager

Member

smallint commented Nov 25, 2014

I want as less buffers as possible queued in the renderer because the Mixer thread of the decoder holds already up to 3 buffers for either deinterlacing or pass-throug. Having additional buffers queued in the renderer means that more buffers need to be allocated in DMA and that size is quite limited. Again, deinterlacing is processed in the Decoder itself and the render "just" needs to render the texture without further processing. The IPU/VPU buffers are directly mapped to the GL texture. That was my intention.

Member

FernetMenta commented Nov 25, 2014

Again :) one is wrong anyway because you need at least as many as the gl/gles context. While one buffer is written to, the other one is flipped to front buffer for display. But with only 2 buffers you would synchronize renderer with decoder/mixer. A small delay of video player thread will cause a missed slot for vblank and you'll end up rendereing the last frame again and as a result a future frame needs to be dropped.

Member

smallint commented Nov 25, 2014

OK, got it. But actually this delay should be handled by the buffers in the mixer. If the video thread is slow or lacking then there are already buffers in the output queue of the mixer. But I see your point and will change that to 2 as all the other backends do and test it. I am currently building all the stuff again and will submit the requested split up commit. Thanks.

Member

FernetMenta commented Nov 25, 2014

you don't get the point. changing it to 2 does not change anything because of fails safe in renderManager. It already uses 2.
rendering is asynchronous and additional rendering like OSD or subtitles need elasticity in the system. assume that on 50Hz a single rendering cycle needs 35ms instead of 20ms. a system with 3 buffers can catch up, unlikely with only 2.

Member

fritsch commented Nov 25, 2014

@smallint: Can you perhaps help us a little bit on understanding imx xbmc architecture? When I look into the mainline Code, I see a big Lock() taken from the render thread. Doesn't this completely stall the decoder?

Can you briefly describe how the parallelism (e.g. separate mixer thread) that is now integrated worksaround such hard "locking" foo?

Have a look also in VAAPI / VDPAU how the surface buffers are reused with minimal locking by maintaining two vectors.

Perhaps we can start discussing from an architecture point of view.

Member

smallint commented Nov 25, 2014

@FernetMenta Currently the IMX6 with GLES rendering the VPU/IPU buffers does not manage to render overlays smoothly anyway. 3 buffers won't help here. I am currently only considering fullscreen performance without controls rendered which is already enough on that box with HD and deinterlacing enabled. The decoder itself still returns 3 in GetAllowedReferences() so this should be the value we are actually talking about. The 1 I return is was just for flexibility to let the Decoder decide how many buffers are queued in the renderer. We are still on the safe side, aren't we?

Member

smallint commented Nov 25, 2014

@fritsch The Lock() is gone with this PR. I have completely removed that. Before the Lock was used if the decoder was reset or disposed to synchronize the rendering and the buffer disposal. But that wasn't an issue at all because the current path is fast enough without deinterlacing. All I am testing with is deinterlacing since that utilizes the memory bus much more that just decoding and rendering. I agree that an explanation of my idea of the current code would help. But I cannot just put it into two lines, I need to think about that and write some longer text later. Would you recommend to do here in the comments?

Member

fritsch commented Nov 25, 2014

@smallint: Yeah, don't make yourself too much stress, please.

Let's concentrate on the split out, then in the splitted out code we can perhaps easily see the ideas and suggest improvements.

Thanks much as always!

Member

smallint commented Nov 25, 2014

I split it up into 12 commits and pushed it. Be aware that not every commit would compile since I did that as dry-run on another machine. The final commit compiles and is currently running on my box. I hope you can now easier check the changes.

Member

fritsch commented Nov 25, 2014

Thanks for splitting! I actually tested the code right now to see how it is working.

576i is nicely deinterlaced and displayed as 25 fps.
1080i50 only gets arround 18 fps done on my system with this PR picked. I think dropping in the decoder is not yet working (judging from the OSD only for now).

Will look at the code tomorrow evening as promissed.

Member

smallint commented Nov 25, 2014

The OSD thing is puzzling me, too. With Gotham I could see drops and skips nicely but in Helix it seems to be always 0 although the framerate is below 25fps. Lets see what you will find out ...
EDIT: What box are you testing with?

Member

smallint commented Nov 25, 2014

@FernetMenta You are right. After checking the code again, returning 1 is not enough. The GetAllowedReferences return does not override the GetOptimalBufferSize value. I am going to correct this, sorry for the noise.
EDIT: I tested and committed it but it doesn't improve the deinterlacing performance. I am also wondering about the high cpu load while playback. With deinterlacing it is 160% and without still 130%. I am running a i-cubox pro.

chbmuc commented Nov 26, 2014

I built OpenELEC master with these patches and can confirm that it’s working very well with SD content (567i), but 1080i it’s not running smoothly.

Using the suggested devmem settings improves the situation noticeable, but still there are minor glitches.

devmem 0x00c49100 w 7
devmem 0x00c49104 w 7

I wonder if these settings should be included.

Looking at the code I would vote for inclusion in Helix, because it’s almost certain, that it doesn’t affect other platforms.

I made some comments in the patches, but I couldn’t find anything related to performance.

Member

fritsch commented Nov 26, 2014

Looking at the code I would vote for inclusion in Helix, because it’s almost certain, that it doesn’t affect other platforms.

It won't get into helix, which is not an issue at all as the code quality is more important than the features. Background of this PR was to get the architecture discussed and optimized.

We do it right once and for all. It's not a matter of getting something partly working, but establishing a design and architecture that is usable and expandable and easy to debug in the future. If we see obvious fixes that need to go into helix, they can be splitted out and cherry-picked.

@chbmuc what do those devmem settings do?

chbmuc commented Nov 26, 2014

The devmem settings are mentioned here: xbmc-imx6#70 (comment)

"gives the VPU memory read and write highest priority."

Member

fritsch commented Nov 26, 2014

@chbmuc thanks for that information.

@FernetMenta here is a plot of deinterlace / decoder times as you just asked for it

https://cloud.githubusercontent.com/assets/6400406/4179063/d29c8412-36b5-11e4-942a-ae20b5dd3ad3.png

I am quite wondering, as the decoder speed is highly fluctuating, it seems like something parallel is getting serialized here, which looks like a stall.

Member

smallint commented Nov 27, 2014

@fritsch Since we are using the boxes at the limit of their memory bandwidth (my conclusion to the tests so far) we need to adjust many settings carefully to different use cases. A media center like Kodi probably needs different kernel optimizations than a web server or a NAS. To reproduce the plots I made you can activate IMX_PROFILE_BUFFERS and grep the output to prepare the plots. I used gnuplot as you noticed already. I can post the script I use for that later.
This brings up another question: how do you do the testing with new codecs or do you have test applications for it? I think of an application that uses the Kodi libraries and plays a given video stream. If that application would use the same settings as Kodi does on could easily hack around with XML and test different options. Compiling Kodi all the time and navigating to the playback is quite time consuming. Is something like that available in a test folder of your devs? I did that for the task queue I use for the deinterlacer to test various conditions.

Member

smallint commented Nov 27, 2014

As promised the script I use to prepare the plots:

cat xbmc.log | grep "+D" | colrm 1 35 > d.log
cat xbmc.log | grep "+P" | colrm 1 35 > p.log
cat xbmc.log | grep "+V" | colrm 1 35 > v.log
cat xbmc.log | grep "+R" | colrm 1 35 > r.log

sed -i '1,10d' d.log
sed -i '1,10d' p.log
sed -i '1,10d' v.log
sed -i '1,10d' r.log

Then I run gnuplot with a selection of those files.

Member

fritsch commented Nov 28, 2014

@smallint Thank you very much. I started to read the code in detail yesterday. From a "algorithm" side I really wonder what that deinterlacer spends its time in, it should not need that long time at all, cause it is hard coded for one future and one past ref, so should be optimized hardware wise for exactly this?

Edit: Other postprocessing APIs have the possibility to add more refs or less to adjust performance. Intel VPP for example only uses one of those, just enough for a bit of motion compensation.

Here @FernetMenta is the specialist on howto manage those reference frames without having to allocate additional memory, by constantly reusing allocated buffers.

Those flags / options we currently have, right - nothing more? So the only parameter affecting quality is ipu_motion_sel? I cannot find that documented :-(

Besides that 37.4.11.1 VDIC Features <- tells that the internal deinterlacer can only cope with 980px horizontally. I am not 100% sure they are really talking about horizontal widths, vertically would make more sense.

struct ipu_deinterlace {
    bool    enable;
    u8  motion; /*see ipu_motion_sel*/
#define IPU_DEINTERLACE_FIELD_TOP   0
#define IPU_DEINTERLACE_FIELD_BOTTOM    1
#define IPU_DEINTERLACE_FIELD_MASK  \
        (IPU_DEINTERLACE_FIELD_TOP | IPU_DEINTERLACE_FIELD_BOTTOM)
    /* deinterlace frame rate double flags */
#define IPU_DEINTERLACE_RATE_EN     0x80
#define IPU_DEINTERLACE_RATE_FRAME1 0x40
#define IPU_DEINTERLACE_RATE_MASK   \
        (IPU_DEINTERLACE_RATE_EN | IPU_DEINTERLACE_RATE_FRAME1)
#define IPU_DEINTERLACE_MAX_FRAME   2
    u8  field_fmt;

I am currently travelling again, it's a real rough time for me before this christmas :-(

What we could try though, would be:
What is the performance of the decoder alone (when decoding 1080i) material and what is the throughput of the deinterlacer alone. If e.g. the deinterlacer speed without additional pressure is already too low, we get a real problem.

What happens when we start to render? How does this decrease the decoder throughput.

Perhaps we can speed up things by somehow using: 37.4.11.2.1
Vertical Filter Block (di_vfilt) which is "just" weighted average filter which could be reused for an "advanced" bobbing.

Out of interest: Did you experiment with the frame doubling flags, does this cost additional load (besides we could perhaps killing the render)?

Member

smallint commented Nov 28, 2014

As you could see from the plots above, the VPU spends more time on decoding than the IPU on deinterlacing.

Currently the default HD profile uses HIGH_MOTION which is only using two fields or one (codec) buffer. LOW_MOTION is using three fields or two (codec) buffers and is therefore much slower. Enabling that for HD in the moment is a no go. The buffers are allocated during initialization only and one VPU buffer is pointing to the previous VPU buffer. Buffers are always reused and the entire codec implementation avoids dynamic memory allocations where possible. I also tried to preallocate GL textures for each allocated buffer but that also did not help.

Those flags / options we currently have, right - nothing more? So the only parameter affecting quality is ipu_motion_sel? I cannot find that documented :-(

That are the only options we have with the current kernel API. You can check the documentation "i.MX 6Dual/6Quad Applications Processor Reference Manual" available from freescale but I agree that documentation about APIs and best practices, limitations and so is hard to find. ipu_motion_sel is documented as well as all the filters used by the VDIC.

Besides that 37.4.11.1 VDIC Features <- tells that the internal deinterlacer can only cope with 980px horizontally. I am not 100% sure they are really talking about horizontal widths, vertically would make more sense.

I think they are right. As I understood @wolfgar and some documents buffers exceeding that resolution limit (IIRC 1024x1024) need to be split up. And that also takes time.

I just want to add that my Wandboard running an old kernel with overclocked VPU handles HD deinterlacing smoothly. I don't know what magic @wolfgar applied to that kernel ;). I tried many different settings and kernel patches but nothing improved the situation significantly on the Cubox. You can read about that in the imx6 issues, we discussed already a lot about it. Thank you @fritsch.

Member

fritsch commented Nov 28, 2014

@wolfgar Any chance the cubox could get a bit of magic for those boards? (Now as the cubox TV is out and yeah, no TV without deinterlacing).

This would give us more room for a simple, but optimized architecture (like the Mixer one proposed) and we would not have to sacrifice this simplicity for workarounding driver inabilities :-)

Member

fritsch commented Nov 28, 2014

Btw. perhaps to add. I am clearly aware that kernel debugging concerning performance is most likely the hardest job you can think of. In the beginning of the VDPAU via radeon oss implementation performance was at roughly 66% of the fglrx binary blob implementation. Christian digged into it for hours and weeks and finally it was a simple response of a message transmitted that was ignored (a two liner) :-)

Member

smallint commented Nov 28, 2014

Does that cubox TV support smooth full HD deinterlacing with any software product available? If not, why is it called cubox TV then ;)

I don't think that @wolfgar is the right person (although I am confident that he would have the capabilities and knowledge) for that. I would expect more input from vendors claiming "best media center experience" for their boxes.

I have never asked that question before but does Kodi/Android on the Cubox show the same performance limitations with deinterlacing?

Member

fritsch commented Nov 28, 2014

@smallint

Does that cubox TV support smooth full HD deinterlacing with any software product available? If not, why is it called cubox TV then ;)

Exactly what I thought. It's the same chip /configuration as the 4 pro, but without bluetooth / wireless.

There is no Deinterlacing on Android at all. All those nice boxes that advertize TV experience cannot do deinterlacing.

Member

smallint commented Nov 28, 2014

There is no Deinterlacing on Android at all. All those nice boxes that advertize TV experience cannot do deinterlacing.

Don't tell me ... I hoped there was at least a "reference" implementation showing that it is possible. So gstreamer-imx6 with vpu, ipu and eglsink could be an option to test performance against Kodi. I have never managed to get this combination running and never tried hard enough ...

Owner

koying commented Nov 28, 2014

Well, let's say we don't have control over interlacing on android.
Some OMX components do it automatically, but indeed most don't care ;)

chbmuc commented Nov 28, 2014

I don't know if it helps, but there is a thread in the freescale forum about 1080i deinterlance support for iMX53. It also has some library code to support VDI split mode and an android test, but I must admit, that I don't understand it:

https://community.freescale.com/thread/292712

Contributor

wolfgar commented Nov 29, 2014

Hi there,

Sorry for my long silence, I wish I could have been able to join this thread earlier.
First of all, I would like to thank again smallint for his constant help with deinterlacing since we switched to video rendering trough GPU and vivante extensions

Then I guess there are very interesting remarks regarding locking in this thread : they could improve behavior when we browse the GUI with video but that it might not be the root cause of our deinterlacing issues

Regarding raw performance, the VDIC engine should not be the bottleneck. But the advertized performances are a little confusing :

  • 100Mpixels/sec output rate (stated in §37.4.11.7 of imx6Q RM)
  • 75100MP/s in §37.4.11.1
  • 240MP/s in §9.2.2.2

The 2 first figures might be compatible (one is only about output rate the other about core deinterlacing processing max rate)
The third is strange. Anyway the §9.2.2 states 1080i compatibility without reference to the 968 horizontal resolution limitation which is even more confusing...
What I can tell for sure is that when I performed first implementation, the picture was split into 4 parts and processed as 4 slices by VDIC : I even had to track a bug in kernel which created a blanking horizontal line along the slices borders ;)

What is also sure is that tweaking the VPU priority on AXI bus improves behavior : it seems to prove there is memory bandwidth at play. Yet, this bottleneck should only be limited to some timeslices as the global bandwidth of i.MX6Q DDR is quite good : 64bit bus @1066mhz and should be able to cope well with the global datastream Mem->VPU->Mem->IPU(VDIC)->Mem->GPU->IPU(DP)
even for 1080i@30fps

Unfortunately I don't know exactly why my wandboard kernel behaves better otherwise be sure that I would already have shared this info of course ;)

Smallint profiling and plotting is really interesting and already gives pretty interesting data. But I guess we could have an even deeper understanding by using lttng to have a global view of threads scheduling and of locking involved : I wanted to go that way for a long time but had not the required time to do it. I hope it will change soon...

Member

fritsch commented Nov 30, 2014

@rabeeh has posted a kernel patch here: xbmc-imx6#70 (comment) which affects the kernel and increases the burst size.

Member

fritsch commented Nov 30, 2014

@smallint please do a a quick rebase I have pushed the VP8 fix by @wolfgar so that one does not apply cleanly yet.

Member

smallint commented Nov 30, 2014

Rebased. I hope the merge of wolfgars fix is correct.

Contributor

wolfgar commented Nov 30, 2014

@fritsch @smallint @rabeeh I pushed this fix more than one year ago in my kernels (and forgot about it ;) )
For instance here : wolfgar/utilite@9cbcf0b
This could be the reason why my wandboard kernel behaves better... I give it a try...

Member

smallint commented Nov 30, 2014

@wolfgar That looks great! Let us know how it works for you, I am going to test it as well.

Member

fritsch commented Nov 30, 2014

@smallint It did not help. @wolfgar has the same results :-(

Member

smallint commented Nov 30, 2014

@fritsch That's a pitty.

Btw: do you know how to make CLog to log to stderr via an API call? I am currently building a test app.

rabeeh commented Nov 30, 2014

@smallint - i think you have previously mentioned that VPU @ 352mhz helps in de-interlacing. Is this really the story?
@wolfgar - i'm pretty sure you will remember this.

But then; we are shooting in the dark here; is the problem IPU performance? vpu performance? gpu? or what?

Member

fritsch commented Nov 30, 2014

@smallint see the Ilog.h I cannot find an additional function call to accomplish that. In doubt just use printf :-).

If you need it the other way round, e.g. having a lib logged into the xbmc.log you can reopen stdout as xbmc.log via freopen?

Member

smallint commented Nov 30, 2014

@fritsch I am reusing the codec components and they are using CLog so I would like to redirect that output to stderr instead o a logfile. But I can also live with a logfile.

Member

smallint commented Nov 30, 2014

@rabeeh Yes, the VPU is running at 325Mhz on my Wandboard. But that settings has gone in recent kernels. We discussed that already. I know that this is a difference to my Wandboard. I set up an vanilla Arch on that board and the fps are around 25 but always scratching the limit and playback is not perfectly smooth. So, yes, increasing the VPU freq to 325Mhz would certainly help but I don't know how to do that with kernel >= 3.10.

Member

smallint commented Nov 30, 2014

As you have asked for some numbers: I dumped the binary data that was passed to Decode into a file and played that back in a test app. Currently without GPU rendering. Just VPU/IPU. It is a 1080i/50 stream. Targeted are 25fps or 40ms/frame.

No deinterlacing:

Settings deinterlacing to OFF
Decoding of 960 frames took 17171 ms
Average frame decoding time: 17 ms

With deinterlacing:

Settings deinterlacing to FORCE
Decoding of 960 frames took 24682 ms
Average frame decoding time: 25 ms

That is all fine (just looking at the average). I am going to add a quick and dirty renderer and check what will happen to those numbers. Numbers gathered on a Wandboard running the latest Arch.

Member

fritsch commented Nov 30, 2014

@smallint: Can you check the 1080p50 sample with that method, please:
https://dl.dropboxusercontent.com/u/55728161/1080p50-recode.mkv
(deinterlacing off) of course.

2014-11-30 21:12 GMT+01:00 smallint notifications@github.com:

As you have asked for some numbers: I dumped the binary data that was
passed to Decode into a file and played that back in a test app. Currently
without GPU rendering. Just VPU/IPU. It is a 1080i/50 stream. Targeted are
25fps or 40ms/frame.

No deinterlacing:

Settings deinterlacing to OFF
Decoding of 960 frames took 17171 ms
Average frame decoding time: 17 ms

With deinterlacing:

Settings deinterlacing to FORCE
Decoding of 960 frames took 24682 ms
Average frame decoding time: 25 ms

That is all fine (just looking at the average). I am going to add a quick
and dirty renderer and check what will happen to those numbers. Numbers
gathered on a Wandboard running the latest Arch.

Reply to this email directly or view it on GitHub
#5805 (comment).

               Key-ID:     0x1A995A9B
               keyserver: pgp.mit.edu

Fingerprint: 4606 DA19 EC2E 9A0B 0157 C81B DA07 CF63 1A99 5A9B

Member

smallint commented Nov 30, 2014

@fritsch I am still testing with my stream and will upload that app later that you (or someone else) can test yourself. I will also test your stream later.

I've been using smallint's patch from: xbmc-imx6#70 in geexbox/gothm with the VPU352M patches for ages, and for 1080i OTA content in North America this combination gives smooth playback, as long as there are no menus overlaid.

I ported the freescale kernel patches that enable the VPU clock speed bump from 3.10 to 3.14. They are here:
http://phas.ubc.ca/~michal/vpu352-3.14-1.patch
http://phas.ubc.ca/~michal/vpu352-3.14-2.patch
http://phas.ubc.ca/~michal/vpu352-3.14-3.patch
http://phas.ubc.ca/~michal/vpu352-3.14-4.patch
with those applied and the CONFIG_MX6_VPU_352M option enabled in the kernel config it works quite nicely for me.

Member

fritsch commented Nov 30, 2014

@cmichal2 Give the helix version a try. There are improved render infrastructure so menus should be more solid also, now. Also your performance comparison would be nice to know (with the very same kernel config).

Member

smallint commented Nov 30, 2014

Here the output for VPU/IPU/GPU:

Set deinterlacing to FORCE and render
Decoding of 956 frames took 37237 ms
Average frame decoding time: 38 ms

And it played smoothly. But that is of course without any sound and any other stuff in the background. So we are already at the limit with that combination. FB_MUTLI_BUFFER is set to 2 and my screen runs with 50Hz.

Member

fritsch commented Nov 30, 2014

Average frame decoding time: 38 ms

we need to decode 50 progressive frames per second (!) if you just tested the 1080p50 sample. No deinterlacing should be invovled.

Edit: Now I got it. You tested your 1080i50 file with a "dirty" renderer. That's near to skip / drop :-).

I assumed this before. I still think the bottleneck is with the renderer. Thefore testing the 1080p50 file would be nice to see the decoder throughput (framewise) and then the throtteling of the render.

Contributor

wolfgar commented Nov 30, 2014

@rabeeh : What are you sure about ? what should I remember ?
If you speak about burst size : No I don't because it is clearly a change I pushed before looking at deinterlacing when I combined layers in IPU without using GPU...
I have never hidden anything voluntary from upstream or from smallint
So please be clearer about what you pretend here.

Member

smallint commented Nov 30, 2014

In your case, yes. In my case no, 50 fields or 25 frames. I tested with my sample. Btw, the CPU load was almost 0 in my case whereas Helix runs about 160% when playing a stream. Is there something we can just switch off?

Here goes your sample:

Set deinterlacing to OFF
Decoding of 1567 frames took 25148 ms
Average frame decoding time: 16 ms
Set deinterlacing to FORCE
Decoding of 1566 frames took 36270 ms
Average frame decoding time: 23 ms
Set deinterlacing to OFF and render
Decoding of 1565 frames took 37591 ms
Average frame decoding time: 24 ms
Set deinterlacing to FORCE and render
Decoding of 1564 frames took 56072 ms
Average frame decoding time: 35 ms

It plays also quite smoothly.

Member

smallint commented Nov 30, 2014

I added a new branch perftest to xbmc-imx6. The test app is https://github.com/xbmc-imx6/xbmc/blob/perftest/perf-test.cpp. In this branch the define DUMP_STREAM is active in the IMX decoder which outputs a file stream.dump whenever you start playing a stream. Have fun compiling the test app ;)

Member

fritsch commented Nov 30, 2014

Thanks for that measurements. Not quite sure why you force deinterlacing, it makes no sense for progressive content. I see that decoding alone is fast enough 16 ms means we can theoretically decode 1080p60.

Normally the render in kodi would skip and the decoder should drop at some point under such conditions.

Concerning the 160% we should find out what is causing that.

Contributor

wolfgar commented Nov 30, 2014

@cmichal2 : thanks a lot for porting the VPU@352M, I am currently pushing it in my 3.14 kernel
I am a little confused because I though we concluded it was not enough in our previous thread
anyway improving each step duration can help
@smallint : your profiling without rendering is really interesting : let's keep in mind we save a mem -> GPU -> mem -> ipu (for display not for deint) transfer for each picture in this configuration

Member

smallint commented Nov 30, 2014

@fritsch I try all options for each dump regardless of its "sense" ;) Also rendering is fast enough for your content without deinterlacing. If the decoder runs with Kodi it does not activate deinterlacing for progressive content. This is just my "dumb" test app.

Member

smallint commented Nov 30, 2014

Keep in mind that my tests were made under perfect conditions: no demuxing, no audio stuff, no nothing but bare metal decoding and rendering. Profiling all that within Kodi to get possible app bottlenecks could cause some headaches ;)

Member

smallint commented Nov 30, 2014

@wolfgar Is your 3.14 kernel for Cubox? I would appreciate to get kernel sources from one place for this box with CEC working and VPU@352M enabled. Otherwise we need to sync with @pepedog to incorporate all the required fixes. Currently it lacks proper CEC support.
@cmichal2 Thanks for the information. Since I had the feeling we never came to a conclusion in the other thread that this option would help as much as we hoped.

Contributor

wolfgar commented Nov 30, 2014

@smallint : OK I know the issue with CEC stuff. I helped with sorting this out for openelec last week; I ping pepedog so that he can fix this. By the time I setup a cuboxi kernel and share it with you in a few minutes...
PS : By the way I have sync with latest CEC version last week too

pepedog commented Dec 1, 2014

Someone said cec is working in arch 3.14 and said they had permission problem
http://www.solid-run.com/community/topic636-290.html#p15129
It doesn't work for me. Still haven't done libcec yet, battling with another platform

Contributor

wolfgar commented Dec 1, 2014

@smallint

thanks so much for your testapp, I have compiled it and run test on samples provided by fritsch
First regarding the 1080@50fps : I have been overoptimistic I fear
When rendering is enabled I get

Set deinterlacing to OFF and render
Reading extradata with 47 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Wrong chunk header
Decoding of 1960 frames took 49392 ms
Average frame decoding time: 25 ms

25ms means 40fps and not 50fps so we definitively might have an issue here...

Regarding deinterlacing, we don't have exactly same figures,
On cuboxi4, with default kernel, the file 1080i50_h264.ts gives the following

Set deinterlacing to FORCE
Reading extradata with 48 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Wrong chunk header
Decoding of 1148 frames took 47928 ms
Average frame decoding time: 41 ms

Set deinterlacing to FORCE and render
Reading extradata with 48 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Wrong chunk header
Decoding of 1148 frames took 64058 ms
Average frame decoding time: 55 ms

with kernel patched both for burst size @64bits and with the VPU@352Mhz, results are

Set deinterlacing to FORCE
Reading extradata with 48 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Wrong chunk header
Decoding of 1148 frames took 44044 ms
Average frame decoding time: 38 ms

Set deinterlacing to FORCE and render
Reading extradata with 48 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Wrong chunk header
Decoding of 1148 frames took 62110 ms
Average frame decoding time: 54 ms

Edit : An update

With the last kernel (64bits burst and VPU@352Mhz) if I increase the VPU priority (devmem 0x00c49100 w 7 && devmem 0x00c49104 w 7) , here are the results (still with1080i50_h264.ts) :

Set deinterlacing to FORCE
Reading extradata with 48 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Wrong chunk header
Decoding of 1148 frames took 41560 ms
Average frame decoding time: 36 ms

Set deinterlacing to FORCE and render
Reading extradata with 48 bytes
[INFO]  Product Info: i.MX6Q/D/S
[INFO]  bitstreamMode 1, chromaInterleave 0, mapType 0, tiled2LinearEnable 0
Decoding of 1145 frames took 57022 ms
Average frame decoding time: 49 ms

Visually the skipped frames are almost not noticeable...

Member

fritsch commented Dec 1, 2014

I have uploaded the samples @wolfgar is talking about here: http://solidrun.maltegrosse.de/~fritsch/ Those are all the relevant TV formats german TV is producing: 720p50, 567i50, 1080i50 - I also added the famous burosch1.mpg to see the deinterlacing quality. The 1080p50 file is for benchmarking decoder in combination with render.

rabeeh commented Dec 1, 2014

@wolfgar - about this -
"I have never hidden anything voluntary from upstream or from smallint
So please be clearer about what you pretend here."
my bad - i meant @smallint in this discussion -
xbmc-imx6#70 (comment)

Clearly my intentions are 100% technical to solve this issue and not accuse anyone of hiding anything.

Member

smallint commented Dec 1, 2014

@fritsch Thank you, I just wanted to ask for the sample @wolfgar used. My stream is a dump from ServusTV and renders much faster than your sample on wolfgars Cubox. Tonight I am going to test your 1080i50 streams on the Wandboard on a plain Arch kernel. Too bad that CEC isn't wired correctly on this board ...

Normally the render in kodi would skip and the decoder should drop at some point under such conditions.

The hw decoder never drops. It keeps decoding all the time to prevent artifacts. But dropped images bypass the deinterlacer and are not rendered (handled by Kodi).

@wolfgar

25ms means 40fps and not 50fps so we definitively might have an issue here...

Hehe, my fault while coping with numbers ... you are right.

Member

fritsch commented Dec 1, 2014

Yes, we also throw away already decoded images :-) which is really expensive, but it's the last chance we have in such cases and we don't come along with render skipping. That also only makes sense when the decoder is a separate thread and has some already decoded frames in it's queue.

@smallint the Mbff sample also is ServusTV, the other one Pro7 HD.

Member

smallint commented Dec 1, 2014

OK, 1080i50_h264.ts needs 32ms for decoding only. All the other numbers are very similar to @wolfgar. It is not a pitty being not able to watch HD+ content ;) but from the technical point of view it is a challenge.

Member

fritsch commented Dec 1, 2014

@smallint: Check the 720p50 sample, which is "Öffentlich Rechtliches". Currently we are only able to get 44 fps out in xbmc, that's a real pitty. There are other 1080i50 contents like "nice" shopping tv channels, that have same bitrate combination. So we should be able to play them all obviously, to be TV safe.

rabeeh commented Dec 1, 2014

Here is more information on 1080i50_sample.ts from @fritsch
I found a tool that can read memory controller performance counters based on AXI bus and actual memory transactions. I don't have the sources (yet); but here is the binary -
https://dl.dropboxusercontent.com/u/72661517/tbr/mmdc_prof.zip

The tool help is -
Usage: ./mmdc_prof [-m mask] [-p period]
-m (set MADPCR1 mask value. Default: 0x0)
Example:
All: -m 0x00000000
GPU3D: -m 0x003E0003
GPU2D: -m 0x003E000B
GPUVG: -m 0x003F0022
ARM: -m 0x00060001
IPU: -m 0x00060004
VPU: -m 0x003F0013

I'v got the following numbers on 1080i50_sample.ts with de-interlacing -
Kodi is able to render 17fps out of 25fps.

Below are the results with the break down per internal unit (IPU/VPU/GPU/ARM). The max read/write performance i'v got from the memory controller was ~1.7GB/sec; while in the same platform, disabling dirty regions and going to the main menu with the rss line i'm able to get ~3.4GB/sec.
So clearly this is not a DDR bandwidth issue (where it is the case in the 1080p50 content).

No conclusions yet; but this tool makes it possible to measure things.

Following are profiling numbers by running ./mmdc_prof -
MMDC new Profiling results:


Total cycles count: 528068607
Busy cycles count: 510848293
Read accesses count: 30264171
Write accesses count: 15870691
Read bytes count: 1183138234
Write bytes count: 657009248
Avg. Read burst size: 39
Avg. Write burst size: 41
Read: 1128.33 MB/s / Write: 626.57 MB/s Total: 1754.90 MB/s
Utilization: 22%
Bus Load: 96%
Bytes Access: 39

VPU (./mmdc_prof -m 0x003F0013)
Read: 268.73 MB/s / Write: 81.09 MB/s Total: 349.82 MB/s

ARM (./mmdc_prof -m 0x00060001)
Read: 59.46 MB/s / Write: 19.56 MB/s Total: 79.02 MB/s

GPU (./mmdc_prof -m 0x003E0003)
Read: 378.34 MB/s / Write: 386.19 MB/s Total: 764.52 MB/s

IPU (./mmdc_prof -m 0x00060004)
Read: 431.52 MB/s / Write: 71.88 MB/s Total: 503.40 MB/s

rabeeh commented Dec 1, 2014

Looking more at the data; one important field to check is also the average read/write burst size per unit.
Currently without setting IPU burst size for instance; I can see that it has fixed 16byte write burst to the memory controller (i guess that those writes are the actual de-interlacing function) -
IPU (./mmdc_prof -m 0x00060004)
Avg. Read burst size: 45
Avg. Write burst size: 16
Read: 430.85 MB/s / Write: 71.31 MB/s Total: 502.16 MB/s

This can lead to very low utilization on the DDR bus since it's going out with a transaction with a burst of 16 bytes instead of 64 bytes.

Member

smallint commented Dec 1, 2014

@fritsch Your ServusTV sample (mbaff):

Set deinterlacing to OFF
Decoding of 660 frames took 11154 ms
Average frame decoding time: 16 ms

Set deinterlacing to FORCE
Decoding of 660 frames took 15918 ms
Average frame decoding time: 24 ms

Set deinterlacing to OFF and render
Decoding of 660 frames took 16782 ms
Average frame decoding time: 25 ms

Set deinterlacing to FORCE and render
Decoding of 659 frames took 24504 ms
Average frame decoding time: 37 ms
Member

fritsch commented Dec 1, 2014

@smallint That's nice and would be enough. Btw. the frame doubling, the api provides, would kill the render unbelievably hard, right?

Member

smallint commented Dec 1, 2014

@fritsch ... and 720p50_h264:

Set deinterlacing to OFF
Decoding of 1554 frames took 11279 ms
Average frame decoding time: 7 ms

Set deinterlacing to FORCE
Decoding of 1554 frames took 14316 ms
Average frame decoding time: 9 ms

Set deinterlacing to OFF and render
Decoding of 1551 frames took 24112 ms
Average frame decoding time: 15 ms

Set deinterlacing to FORCE and render
Decoding of 1551 frames took 31140 ms
Average frame decoding time: 20 ms

15ms is awesome and more than enough for 50fps. My Wandboard renders this stream in Kodi with 50 fps.

Member

fritsch commented Dec 1, 2014

Set deinterlacing to OFF and render
Decoding of 1551 frames took 24112 ms
Average frame decoding time: 15 ms

That's awesome! But yeah, compare that to internal xbmc rendering, that's then something to cry about. It can barely render 44 fps.
Edit: Using cubox 4pro

Member

smallint commented Dec 1, 2014

That's awesome! But yeah, compare that to internal xbmc rendering, that's then something to cry about. It can barely render 44 fps.

No, my Kodi renders this stream at 50 fps. But regardless of my or your number, do you think someone would open that box and attack this issue inside Kodi? So many systems with different configurations are out there and Kodi does already a fantastic job with running on all of them. But maybe this little platform is an opportunity to optimize further to the benefit of all the others.

Member

fritsch commented Dec 1, 2014

I found the numbers @rabeeh has given before quite interesting. Cause that opens another point of optimizing the architecture and the kernel drivers.

Contributor

wolfgar commented Dec 1, 2014

Hi

@rabeeh thanks a lot for this very interesting mmdc_prof tool for profiling bus usage
Before discovering this tool, I used bw_mem (from lmbench) to get an idea of remaining available memory bandwidth for CPU (MPCores prio is only 2 on the bus so it gives the idea)
Generally speaking read bandwidth from Mem to CPU is about 1100MB/s at best and running the bench during rendering + deinterlacing definitively shows that globally about 1/3 of memory bandwidth is still available (~400MB/s)
this result seems coherent with mmdc_prof
Yet for the 1080p50, results are not well aligned : I guess you say there is a bandwidth issue because of mmdc_prof figures but conceptually there is something weird isn't there ?
A progressive decoding + rendering @ 50fps should not load the bus more than a interlaced one (because in this case we have extra step to send fields to IPU and grab the frame back even if we have only one field each 20 and one "full" frame each 40ms)
When I test the "remaining" bandwidth with bw_mem, I also have more the 550MB/s when decoding + rendering the 1080p50 file...
So I am dubious about the mmdc_prof figures for that one. Do you fully understand them ?

Regarding IPU write throughput, it definitively true that having a look at increasing write burst size would be very interesting, Yet as I wrote in this thread, the deinterlacer max output rate is 100Mpixels/sec output rate (stated in §37.4.11.7 of imx6Q RM).
In YUV420 format it would state for a maximal write troughput of 100*6/4=150MB/s, wouldn't it ?
So we cannot expect too high values for IPU write or am I misunderstanding something ?

Member

smallint commented Dec 1, 2014

Sorry for the probably stupid question: is there a guide somewhere on how to cross compile Kodi with e.g. Yocto? My old recipe does not work anymore and adjusting it to the one from Arch leads to complains about "missing FFMPEG". I think things regarding cross compilation changed: any advice on how to proceed? I would like to test this on my "magic" kernel ;)

rabeeh commented Dec 1, 2014

Below are observations for my hacking today.
The following is a patch for the kernel to add vpu@352mhz (thanks @cmichal2 )
https://dl.dropboxusercontent.com/u/72661517/tbr/0001-VPU-at-352mhz-support-on-LK-3.14.x.patch

I'm not very happy with this patch since it removes the ability to lower the frequency of the processors (lately i was able to run HummingBoard with the quad processor completely without any heatsink).

The above patch fixes 720p50_h264.mp4 and 1080i50_h264_mbaff.mp4.

1080p50_recode.mkv is still not - clearly memory bandwidth issue due to two reasons -

  1. One of the items that is suspected (jnettlet pointed out) that we are hitting ERR004300 GPU3D L1 cache performance drop errata since the vpu frame is 1920x1088 which makes GPU accesses to it not 16 byte aligned; so the transaction would be broken into two.
  2. GPU takes 1.5GB/sec out of 2.4GB/sec total memory bandwidth. Unclear if it's shaders optimization or blending or something else. But the 1080p50 is being rendered at 37fps and consumes 1.5GB/sec from the GPU which is insane.

Notes -
@wolfgar - "A progressive decoding + rendering @ 50fps should not load the bus more than a interlaced one " - i think the gpu side is really not optimized at all. what i'm seeing is that when using the gpu more (i.e. 25fps in de-interlaced vs. 50fps in progressive) then much more massive memory bandwidth is required.

"the deinterlacer max output rate is 100Mpixels/sec .... write troughput of 100*6/4=150MB/s," - that should be more than enough. Actually i'v measured the write throughput on the 1080i50 and it's around 100MB/sec.
I'm not worried about 100MP/sec since the output of VDI would be the 25fps progressive content which is 1920x1080x25fps = ~51MP/sec. What i'm really worried about is that the VDI out channel is limited to 16byte memory burst; and this totally abuses the memory controller -
https://github.com/SolidRun/linux-imx6-3.14/blob/linux-linaro-lsk-v3.14-mx6/drivers/mxc/ipu3/ipu_common.c#L1292

The abuse comes from the fact that accessing the DDR with small bursts is extremely inefficient and wastes bus cycles. And looks like this 16 bytes is a limitation in the implementation itself.

On the table now -

  1. a way to provide higher QoS to the IPU - I need to either integrate in u-boot or kernel.
  2. check how to fix the 1920x1088 issue with vpu frame out. Blending from that frame has a penalty for sure (split DDR transactions).
Contributor

wolfgar commented Dec 2, 2014

Thanks a lot for your answers @rabeeh.

  • Your point regarding insane GPU bandwidth @50p is very true.
    Sorry, I am not perfectly clear why it is obvious that our picture out of VPU involves accesses which are non aligned on 16 bytes. At least our kodi code takes care of providing 16 bytes aligned buffers (maybe we forgot something ?).
    You speak about split DDR accesses but why wouldn't it be accessed the nice way (from a 16 bytes boundary and then as multiple of 16 bytes that would not create the penalty of cache errata ?)
  • Regarding IPU I wrote " it is definitively true that having a look at increasing write burst size would be very interesting," So we are in line no problem.
    I only stated that we cannot expect more than 150MB/s, that's it...
    My underlying message was not that the DDR access was optimal but only that the mean value seemed fine as we require about 74MiB/s for 1920x1080x25fps (we have 1.5byte per pixel in our format)
    Could we bypass the VDIC out channel limitation to 16byte burst memory accesses by sending the VDIC output to IC (image converter) issuing a dummy transformation (no change) and writing back to memory from IC which has no limitation I hope : It is cumbersome but it could to be a path that would workaround this annoying limitation don't you think? Of course, it makes sense only if the output from VDIC can be sent to IC through IPU internal memory, I am unsure about it..

Even if VPU @352mhz and increasing VPU prio have positive effect and is enough for some streams it is not yet enough for any streams ...
But at the end, thinking about it, I wonder whether the main biggest issue is not the same as the one that is at play for stream @50p : I mean an abnormal bandwidth usage from GPU : Don't you think ?
(even with 25fps it consumes 1GB/s ...)

I really hope @smallint can bisect the diff with his wandboard configuration because it seems to be a key...

Kind regards

Member

smallint commented Dec 2, 2014

Interesting discussion and information, thanks.

At least our kodi code takes care of providing 16 bytes aligned buffers (maybe we forgot something ?).

I cross checked and it is the case but depends on nAddressAlignment returned by the VPU. I haven't tested return values if they are ever != 16. Anyhow, checking again won't hurt.

Regarding GPU performance: either in my testapp or in Kodi itself you can shrink the output quad to e.g. 50% and you will see that streams render smoothly which did not before. Reducing the GPU load helps but I also tested drawing a full screen quad but mapping only half of the texture (vertically) to it. The speed also improved significantly. I presumed that this is due to lower utilization of color space conversions and less IPU mem reads. My conclusion at that time was that the system itself needs to be balanced, just one part is not enough. That very high GPU mem bandwidth usage you were talking about is it reading/writing or combined? The 1080i50 test you made resulted in a total GPU bandwidth of 764.52 MB/s, right?

Member

smallint commented Dec 2, 2014

@wolfgar

is at play for stream @50p : I mean an abnormal bandwidth usage from GPU : Don't you think ?

As I understood the GPU bandwidth is ~ 760 MB/s for 1080i50 and for 1080p50 it is much higher, am I right? In the latter case we are sending 50 decoded buffers down to the GPU instead of 25 which doesn't sound insane in terms of memory bandwidth and a factor of 2 seems to be OK. But maybe I got the above numbers wrong.

Can you post a comparison of 1080i50 and 1080p50 please?

Member

smallint commented Dec 2, 2014

I really hope @smallint can bisect the diff with his wandboard configuration because it seems to be a key...

The kernel my Wandboard Yocto build uses is https://github.com/smallint/meta-jabe/blob/master/recipes-jabe/linux/linux-wandboard_3.0.35.bbappend. I am going to check again for any local modifications later. I definitely enabled 352M which is not reflected in the committed config.

Contributor

wolfgar commented Dec 2, 2014

Hi smallint
Just a quick answer : Indeed I speak about the 800 MB/s of GPU bandwidth (balanced between 400 read and 400write) it is more than 5 times the required bandwidth to only read and write our texture. Does this factor seems fine ?

Le 2 décembre 2014 08:48:14 CET, smallint notifications@github.com a écrit :

Interesting discussion and information, thanks.

At least our kodi code takes care of providing 16 bytes aligned
buffers (maybe we forgot something ?).

I cross checked and it is the case but depends on nAddressAlignment
returned by the VPU. I haven't tested return values if they are ever !=
16. Anyhow, checking again won't hurt.

Regarding GPU performance: either in my testapp or in Kodi itself you
can shrink the output quad to e.g. 50% and you will see that streams
render smoothly which did not before. Reducing the GPU load helps but I
also tested drawing a full screen quad but mapping only half of the
texture (vertically) to it. The speed also improved significantly. I
presumed that this is due to lower utilization of color space
conversions and less IPU mem reads. My conclusion at that time was that
the system itself needs to be balanced, just one part is not enough.
That very high GPU mem bandwidth usage you were talking about is it
reading/writing or combined? The 1080i50 test you made resulted in a
total GPU bandwidth of 764.52 MB/s, right?


Reply to this email directly or view it on GitHub:
#5805 (comment)

Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma brièveté.

Member

smallint commented Dec 2, 2014

OK, I got it. What I can image:

  • IPU NV12 -> GPU RGB 1.5 R, 3 W
  • fragment shader RGB 3 R 3 W
  • GPU -> DP 3 R 3 W

Total: 7.5 R 6 W = 388.8 MB/s R 311.04 MB/s W

Not sure if it works that way internally.

EDIT: If depth buffer is active it would add to those numbers. Bilinear filtering also needs more read access. I also assume that all internal color values are RGB and not RGBA.

EDIT2: Can you check the numbers with the test app? Here we have more control and knowledge of what is done at GPU level.

pepedog commented Dec 2, 2014

I have just packaged (for myself for now) against this branch featdeintrework-pr
Things still to test, but I had to (re)introduce --disable-openmax
Is this the best branch to use?

Member

smallint commented Dec 2, 2014

@wolfgar

I really hope @smallint can bisect the diff with his wandboard configuration because it seems to be a key...

I checked the test app on my "magic" setup and it did not perform better that your posted numbers. I could not test graphics because I chrooted into the Arch installation with the old kernel and even replacing the VIVANTE driver and GLES libs did not help to produce some colors on the screen. But anyhow, I don't think that that kernel is soo much faster than your latest one with any optimization you came up with. It decoded and deinterlaced the 1080i50 stream (ProSieben HD) with 40ms (no rendering). Probably there is no key in that installation ;) I will check possible optimizations on app level.

Contributor

wolfgar commented Dec 2, 2014

OK @smallint thanks for a lot your results even if a little disappointing : Then it would be all about always being on the safe side with very little margin ?

I have just though about your previous post about GPU bandwidth explanation and maybe we could save the GPU CSC step ?
Indeed if we confirm to be able to output VDIC directly into IC (instead of using VDIC output IDMAC channel 5 with its limitation to memory) and then using IC output channel 21 to output to memory without drawback of 16bytes only burst, we could also use IC to realize the CSC on the fly without additional delay (as IC is faster than VDIC ) and have a RGB deinterlaced frame for GPU ..

If it makes sense, the current figure which is
out of IPU 1.5W then NV12 -> GPU RGB 1.5 R, 3 W by GPU
would become :
out of IPU 3W then ready for the GPU shader step
So we would save 1.5W + 1.5R for each frame on the bus and would get rid of the short burst out of IPU

Hmm time to tweak the IPU driver to check whether it is pure phantasms or if it can work that way.. ;)

Member

smallint commented Dec 2, 2014

@wolfgar that means that we need to allocate two times the buffer size for deinterlaced frames. I do not know if we have enough memory for that. I can't really follow the VDIC bypass ;) Wouldn't that mean that we just define the output format of the IPU task to RGB?

EDIT: glTexDirectVIVMap does not support RGB as input buffer format as far as I could see from the format enumeration.
EDIT2: Forget about the above, we can use standard enums ...

Contributor

wolfgar commented Dec 2, 2014

@smallint : well lets say that DMA buffers size is yet another burden that we can deal with (Especially, I guess that now that we moved to modern kernel 3.14 with CMA available, it should be easier to reserve memory for DMA and not being thigh by the hardcoded DMA zone and its allocator)

yep in fact I guess that if you set the output format of your IPU task to RGB it could exactly behave like I say ! In fact, I have just checked and there is a already path available (MEM_VDI_PRP_VF_MEM) that could just do what I describe. Maybe no much tweaking is required ;) ...

Member

fritsch commented Dec 5, 2014

Hehe :-) Exactly thought the same while typing. Normally "Double" is the default, therefore we can remove the "double" and only tell: VS_INTERLACEMETHOD_IMX_BOB and VS_INTERLACEMETHOD_IMX vs VS_INTERLACEMETHOD_IMX_BOB_HALF and VS_INTERLACEMETHOD_IMX_HALF

and making VS_INTERLACEMETHOD_IMX_HALF the default.

I thought of those when thinking about double rate:

    /* deinterlace frame rate double flags */
#define IPU_DEINTERLACE_RATE_EN     0x80
#define IPU_DEINTERLACE_RATE_FRAME1 0x40
#define IPU_DEINTERLACE_RATE_MASK   \
        (IPU_DEINTERLACE_RATE_EN | IPU_DEINTERLACE_RATE_FRAME1)
#define IPU_DEINTERLACE_MAX_FRAME   2

But yeah in the render it would be a param to the shader.

Thanks again!

Member

smallint commented Dec 5, 2014

I think the VDI splitting itself is not perfectly implemented in the kernel and thats why I bypass it in userland. We need to confirm that kernel issue and find a solution for it. Then all implementation (gstreamer, Qt5) will benefit from it. I removed that commit from my branch again since I don't want to find it its way in its current state into any builds. If ever the kernel is fixed in that regard it is not needed anymore.

Member

fritsch commented Dec 5, 2014

@rabeeh Can you comment about the kernel side? What happens internally, are you doing some "temporary allocs" for saving parts of the too width surface?

@smallint Yeah! Getting those out again will be a huge pain, cause some will just pick it and use it "as it works" without interest to solve the real issue.

I am still a bit buffled about the 34 ms the render currently needs. Most likely I did not fully understand what you exactly measure there. For just swapping a buffer it sounds really, really slow. Could you explain that again?

Member

smallint commented Dec 5, 2014

The double rate thing is handled in the decoder and it just makes another buffer to the player available. An option somewhere, that it should be used is enough. You can even use the double rate for progressive content if you want. Probably not with the IPU deinterlacer but in the renderer itself. Thats why I think that double rate should not be tight to deinterlacing.

Member

fritsch commented Dec 5, 2014

I think we talk slightly into another direction. When I think of doublerate, I don't think of just "doubleing" an already existing progressive frame. I thought in deed of bobbing from an interlacing point of view. Outputting one frame for every bottom and another one for every top field.

But most likely this named option won't do what I had in mind.

Edit: And yeah, my qeustion concerning the render here, I am not sure we get that time down to 20 ms?

Member

smallint commented Dec 5, 2014

The IMX double rate computes an interpolated frame between frame1 and frame2. So you will get 50 frames out of 25 frames. But actually this is what most TVs do nowadays with their motion enhancers (soap effect). So probably not necessarily very important at this stage.

Can you comment about the kernel side? What happens internally, are you doing some "temporary allocs" for saving parts of the too width surface?

No, there are no allocs. I know what the kernel is doing: it is splitting the stripes not correctly and reduces the DMA burst to 8 instead of 16 because the widths of the stripes are not multiple of 16 anymore. The questions is: why does it split that way?

I am still a bit buffled about the 34 ms the render currently needs. Most likely I did not fully understand what you exactly measure there. For just swapping a buffer it sounds really, really slow. Could you explain that again?

Only swapping a buffer? It is decoding, deinterlacing and rendering. All that takes 34 ms. It is like:

time = get_time()
decode(buffer)
deinterlace(buffer)
render(buffer)
print get_time()-time

If it would be "really, really slow" than we would not be able to render at 25fps, don't you think? The 34 ms means 6 ms left for Kodi per frame.

Member

fritsch commented Dec 5, 2014

Ah! thanks for that. This makes the 34ms much more appreciated. Cause that way, we can get some speed out of it, when doing something in parallel, maintaining some buffers in decoder, deinterlacer, render stage.

I'd like to help more on that topic, but yeah ... everyday's job is currently too demanding to get anything done in my freetime.

Edit: 34 * 25 < 1000 - I thought of a parallel architecture.

Member

smallint commented Dec 5, 2014

Cause that way, we can get some speed out of it, when doing something in parallel, maintaining some buffers in decoder, deinterlacer, render stage.

The 34 ms are already for the parallelized version. Decoding, deinterlacing and rendering is done in theirs own threads. This is already the maximum speed we get so far. The test app is using the Kodi IMX6 Codec implementation and mimics the player and render thread of Kodi.

Member

smallint commented Dec 5, 2014

I'd like to help more on that topic, but yeah ... everyday's job is currently too demanding to get anything done in my freetime.

That is actually an issue for all of us. What I am missing while we are talking here a lot about possible implementations is that someone says: "yes, I can do that and will look into it" or "no, I have no time at all for that stuff and you need to figure it out yourself". Eventually we have to implement it and I would like to know if we can get support from you guys (guidance/implementations) or not. Even if not then that is a statement and I know how to plan my further activities, no problem. My main concerns are changing internals like enums or translation IDs and stuff, using shaders in a wrong way and so on.

Member

fritsch commented Dec 5, 2014

Then here to make it official: I currently don't have time to help in detail, but I hope that will change in the near future.

A lot devs, as @FernetMenta for example are looking at this PR and a ping will always give a result, especially with the shaders, when here the first patches arrive, in a small managable part, you will get help of course.

Half of this thread went away to "internals" of kernel, gpu, erratas, documentation no one of us has had a clue before. So if we now concentrate back to the kodi integration it will get better. Do not hessitate to ping when parts affecting render and also other architecture arrives.

Member

smallint commented Dec 5, 2014

Very good, thanks. I think we can move the kernel noise to private discussions with affected devs or back to xbmc-imx6. I will only talk about Kodi integration itself here. @wolfgar, do you agree?

Contributor

wolfgar commented Dec 5, 2014

@smallint I totally agree that we should open another thread to speak about the kernel changes we would like to push in order to have a fix at the right place
Yet I would not recommend xbmc-imx6 as development has to happen upstream.

@fritsch would it make sense that we open an issue to speak about these imx6 kernel changes ? Or is there any other place that would be better fitted ?

Member

fritsch commented Dec 5, 2014

@wolfgar:
Here on github we only use the Issues for Pull Requests. What about opening a forum thread in the public development section of the forum? Though, we will most likely need a moderator that directly deletes messages not on topic or we are spammed with "howto compile" and bugreports?

Btw. having the dicussion on an imx6 issue is not a problem at all, as final code will go in here (xbmc github) anyways. So up to you. Just open whatever you find is best and put a link here.

Owner

koying commented Dec 5, 2014

+1 for the forums. Github is not very friendly for discussions.

Member

smallint commented Dec 5, 2014

To be pedantic: the kernel issue is not related to Kodi at all. It also affects gstreamer and friends. Actually the Freescale forum would be right place, wouldn't it? ... I am fine with any place.

Owner

koying commented Dec 5, 2014

Well, as long as the goal is to have a better Kodi, I don't see a problem
in having kernel discussions in the Kodi forums.
But yeah, wherever getting enlightened help is the most probable...

On 5 December 2014 at 14:02, smallint notifications@github.com wrote:

To be pedantic: the kernel issue is not related to Kodi at all. It also
affects gstreamer and friends. Actually the Freescale forum would be right
place, wouldn't it? ... I am fine with any place.


Reply to this email directly or view it on GitHub
#5805 (comment).

Member

smallint commented Dec 5, 2014

I am going to clean up the patch tonight and push it into this branch. Then you can start testing it and report back. The good thing is that it won't break things even if the kernel is fixed. I will be off for some days and will check back later wherever the discussion continues.

Member

smallint commented Dec 5, 2014

Patch committed. If you are testing it and still have some performance issues please check your uEnv.txt and make sure that it contains:

if=RGB24,bpp=16

to use 16bit fb. Otherwise the performance is worse. My Wandboard even handles 32bpp nicely but I am not sure for the Cubox. To be tested ... have fun

cmichal2 commented Dec 6, 2014

A quick comment with respect to power usage and VPU@352MHz. I plugged my cubox-i4 power supply (the 3A 5V supply sold by Solid Run) into a kill-a-watt style power meter. All measurements made running xbmc gotham.

without the VPU352 patch:
idle: 2W, when decoding HD video (1080i60): 3.5 W (flickers between 3 and 4W).

with the VPU352 patch:
idle: 2W, decoding HD video: 4W.

Its true that enabling this patch means disabling frequency scaling, but that doesn't appear to mean no power savings when idle. This patch has been referred to a couple of times here as 'overclocking the vpu' but running the vpu @ 352MHz is specified in the chip datasheet, and is a supported feature. The difference in power consumption seems pretty minimal to me. Freescale includes kernel patches to enable it. I don't know why it wasn't in 3.14, but I suspect it was because Freescale hadn't added the feature to their 3.10 kernel when the work was done to support these devices upstream.

Member

smallint commented Dec 12, 2014

@cmichal2 Thank you for the figures and clarification. I will try to avoid to call the 325M overclocking from now on. ;)

Has anyone tested the codec patch? Has the discussion continued somewhere else? Have a nice weekend ...

Member

fritsch commented Dec 12, 2014

@smallint discussion stopped during your holidays. I made some tests with my cubox i4 pro with those patches, the 16 bit framebuffer and OpenELEC. I now get arround 21 to 23 fps, so just a tiny bit too slow to not let audio / video drift away.

Nice weekend!

Member

smallint commented Dec 12, 2014

@fritsch with which stream? With my Cubox I get smooth playback with 25fps on ServusTV (e.g. your mbaff sample). Haven't tried 32bpp yet. What kernel are you running on? I am running my stuff on Archlinux with LK 3.10 and 3.14.

Contributor

wolfgar commented Dec 13, 2014

@smallint : sorry I wanted to write a detailed status (including our latest private exchanges regarding the benefit of 16bytes alignment after additional tests) as a first post to initiate the thread about kernel possible improvement but I had a terrible week

Member

fritsch commented Dec 13, 2014

@smallint This is running a 3.14 kernel with a whole lot of patches (most are not imx6 related): https://github.com/OpenELEC/OpenELEC.tv/tree/openelec-5.0/projects/imx6/patches/linux
with this kernel config: https://github.com/OpenELEC/OpenELEC.tv/blob/openelec-5.0/projects/imx6/linux/linux.arm.conf

Is there something missing @sraue should add?

Member

smallint commented Dec 13, 2014

@fritsch showing the overlay with the stats ('o') reduces the framerate significantly in particular with LiveTV. I open the overlay only from time to time to check fps quickly and trust my eyes otherwise. And I also use my test app. Having the fps in the logs would sometimes help but only during testing. Are there other options without drawing expensive overlays?

Member

fritsch commented Dec 13, 2014

Yeah. Of course known. I do the same. I judge from Video to audio ging out
of sync and dvdplayer doing resyncs.
Am 13.12.2014 12:50 schrieb "smallint" notifications@github.com:

@fritsch https://github.com/fritsch showing the overlay with the stats
('o') reduces the framerate significantly in particular with LiveTV. I open
the overlay only from time to time to check fps quickly and trust my eyes
otherwise. And I also use my test app. Having the fps in the logs would
sometimes help but only during testing. Are there other options without
drawing expensive overlays?

Reply to this email directly or view it on GitHub
#5805 (comment).

Member

fritsch commented Dec 13, 2014

Here are my detailled tests with OpenELEC RC2 which includes this PR for testing.

  • 576i50_mpeg2_samples.ts: Rock solid 25 fps (+)
  • 720p50_h264: Really nice, short pressing "o" reveals, that 50 fps are working fine (+)
  • 1080i50_h264.ts: Running between 20 and 23 fps. Not watchable as audio and video is slightly going in different ways. (-)
  • 1080i50_h264_mbaff.mp4: All fine (25 fps) when pressing "o" from time to time. (+)
  • 1080i50_mpeg2 (it's really 29.97): 24 fps, a lot of skipping (-)
  • 1080p50_h264.mkv: Getting arround 40 fps. Audio going out of sync (which is quite clear). (- not really relevant)
  • burosch1.mpg: Rock solid fps wise, but the quality is looking quite bad (we cannot do anything about it) (+-)

Common issue:
when having "Adjust Refreshrate to match video enabled", the screen is likely to not update anymore, audio continues, video is not visible. This is not related to this PR, happens with mainline also. So I set refreshrate to 50 hz fixed to test the above samples. This issue seems like something in renderer is not teared down correctly when the switching is going on.
artifacts-burosch

Member

smallint commented Dec 13, 2014

The bad deinterlacing quality is due to active LOW MOTION for SD. With HIGH motion it looks very good. Either still a setup problem with the buffers for LOW MOTION or the algo itself. I will check that.

Member

fritsch commented Dec 13, 2014

That sample is very demanding. Intel on SNB and IVB is not able to do it correctly (yet), so - I don't think it is a programming error here, but algorithm dependend.

AMD Temporal:
artifacts_temporal

Intel on IVB / SNB:
burosch1-madi

Member

fritsch commented Dec 13, 2014

The only deinterlacers that have really good quality on those samples (in OSS world) are Intel HSW with MCDI and Yadif CPU deinterlacer. I would not care about that as it is really academic.

Member

smallint commented Dec 13, 2014

How do you create screenshots in Kodi and where are they saved? ;)

Member

fritsch commented Dec 13, 2014

Just press "printscreen" or use: xbmc-send -a "TakeScreenshot"

It asks you for the saving location that first time you do that. Btw. one of the above screenshots was produced with a gstreamer chain, cause the intel guys did not believe the xbmc screenshots :-)

Member

smallint commented Dec 13, 2014

Check this out, looks good, doesn't it?

screenshot000

Member

fritsch commented Dec 13, 2014

That's great! MCDI quality. Nice. this is with active HIGH Motion? What about the performance, are we fast enough to enable that for SD content?

Member

smallint commented Dec 13, 2014

Yes, HIGH MOTION, much faster than LOW MOTION and easily doable with SD. You just need to set lowMotion = ... to false. We need settings for that ;)
The same deinterlacer is used for HD which works now better. I propose to use HIGH MOTION always in automatic mode for the time being.

Member

fritsch commented Dec 13, 2014

Yeah +1 to that. What do you mean with "works now better"?.

Edit:
Found it:

bool lowMotion = (inputBuffer->iWidth < 1024) && (inputBuffer->iHeight < 1024);

So yeah, just set it to false always.

Member

smallint commented Dec 13, 2014

What do you mean with "works now better"?.

Regarding the custom splitting with respect to the initial state of this PR. You did not miss anything! ;)

Member

smallint commented Dec 13, 2014

1080i50_mpeg2

I took your sample called 1080i59.94 and played that smoothly with 30fps although the content is ... what is that actually? 😕

That's great! MCDI quality

Now you know why we are eager to optimize towards using that deinterlacer instead of simple bobbing.

Member

fritsch commented Dec 13, 2014

Lol, that I have known before as LiveTV is my primary use case for xbmc and during the last 4 years I have seen many of those algorithms coming especially in VAAPI, which developed very slowly until VPP appeared with BOB for a year, afterwards MADI and MCDI which are quite awesome.

The idea to get real deinterlacing working (e.g. taking more frames into account) is clearly the goal. BOB was meant as a simple alternative for "current users".

You know, I have also seen a lot of good hardware dying, the biggest death was AMD E Series, without real driver support by AMD, very good hardware. Now 3 years later we can start to really use it ... I don't want to see that happen again, therefore I am in favor of getting working code, in the sense of - does the job for current(!) user's content done.

After your breakthrough with the surface splitting, we are probably 10% (speed) away of getting it done really, really fine with a bit of extra power on top to get it done more robust.

Member

smallint commented Dec 13, 2014

I have been working on the double rate in the meantime and pushed that. But I needed to disable it for now because it introduces oscillating scanlines to the final image. I hoped that recent kernels fixed that issue but it seems to be still there. Anyway, the code for double rate and low motion is there but both is deactivated and only high motion is used for now which seems to work fine.

Member

fritsch commented Dec 13, 2014

"Experts" or of different oppinion concerning the "double rate" features. Especially early 400hz TVs calculated insane movements as in between images. So mostly we advice our users to just disable all those Motion functionality on the TV.

Where do you see the pimary usecase?

Member

smallint commented Dec 13, 2014

In recovering 50fps from Xi50 streams which are currently rendered at 25fps. It is not really that motion enhancement I was referring to earlier (I don't like that, too) but deinterlacing as you would do when you work on single fields instead of full frames.

Member

fritsch commented Dec 13, 2014

Ah! Oki, then I was right after reading the documentation, but I let myself be convinced by your findings. Cool. Do you already have some performance figures?

How much fps can we actually render?

Member

smallint commented Dec 13, 2014

How much fps can we actually render?

50 with SD (except this burosch sample which is quite strange). HD does not work with GPU. Raw decoding + deinterlacing + double rate deinterlacing takes around 5ms per frame in average (SD). But there is either still a kernel issue or I am doing something wrong with buffer setup. You can test it by enabling that in the code (doubleRate = ...) and test it with e.g. N24 with that scroller at the bottom.

Member

smallint commented Dec 13, 2014

but I let myself be convinced by your findings.

Sorry for confusion ... I mixed things up ... too many input at that time.

Member

fritsch commented Dec 13, 2014

You are doing an awesome(!!) job. And the people over at solidrun should pay you ... cause you are basically fullfilling their "TV" marketing promise.

Member

smallint commented Dec 13, 2014

@fritsch thank you but I am just having some fun with coding on this box. ;) But yes, I need to ask them for another Cubox to test with, the current is already in "operation" and it is hard to hijack it for development. And all the real job on this has been done by @wolfgar. He deserves "payment".

Owner

Memphiz commented Dec 13, 2014

guys what is the issue with moving the discussion to the forum?

Member

smallint commented Dec 13, 2014

I have no account there ;)

OK, can do that. If someone creates a topic I will jump on.

Member

fritsch commented Dec 16, 2014

Now with the xbmc to src renaming, this needs major "rebase". The easiest method to do so (no kidding), use git format-patch HEAD~27 (number of commits)

Afterwards replace /xbmc to /src and use git am to get them in again, funny isn't it?

Member

smallint commented Dec 16, 2014

Now with the xbmc to src renaming, this needs major "rebase". The easiest method to do so (no kidding), use git format-patch HEAD~27 (number of commits)

What is wrong with "git merge master" and "git rebase master"? I did that and pushed the changes. The old branch is backup'd to featdeintrework-pr-beforerename.

Member

fritsch commented Dec 16, 2014

Would save you a step. But also fine.
Am 16.12.2014 11:10 schrieb "smallint" notifications@github.com:

Now with the xbmc to src renaming, this needs major "rebase". The easiest
method to do so (no kidding), use git format-patch HEAD~27 (number of
commits)

What is wrong with "git merge master" and "git rebase master"? I did that
and pushed the changes. The old branch is backup'd to
featdeintrework-pr-beforerename.

Reply to this email directly or view it on GitHub
#5805 (comment).

Member

fritsch commented Dec 18, 2014

@smallint Yeah - your rebase work needs to be redone - the src commit was reverted. Sorry for the trouble.

smallint added some commits Nov 25, 2014

@smallint smallint [imx] @wolfgar Renamed CDVDVideoCodecIMXBuffer to CDVDVideoCodecIMXVP…
…UBuffer and added typedef for Renderer
3997417
@smallint smallint [imx] @wolfgar Indentation d9371ae
@smallint smallint [imx] @wolfgar Added base class CDVDVideoCodecIMXBuffer that CDVDVide…
…oCodecIMXVPUBuffer derives from
dd7c589
@smallint smallint [imx] @wolfgar Added CDVDVideoCodecIMXIPUBuffer class that wraps IPU …
…allocated picture buffers
f63dfb6
@smallint smallint [imx] @wolfgar Added CDVDVideoCodecIMXIPUBuffers that manages a pool …
…of IPU buffers and implements deinterlacing of VPU buffers to IPU buffers
8b515e4
@smallint smallint [imx] @wolfgar Added mixer thread implementation with capacity limite…
…d input/output queue
26c0505
@smallint smallint [imx] @wolfgar Removed friend declaration 54b042d
@smallint smallint [imx] @wolfgar Added missing attribute to CDVDVideoCodecIMXVPUBuffer,…
… cosmetics (indentation) and allocate buffer according to input format at compile time
bdc52f1
@smallint smallint [imx] @wolfgar Added more VPU decoder configuration and fixed nChroma…
…Interleave depending on input format
49b99f2
@smallint smallint [imx] @wolfgar Added deinterlacer initialization and mixer thread but…
… still unused
63c5f7d
@smallint smallint [imx] @wolfgar Cosmetics and removed duplicate class declaration f40a3af
@smallint smallint [imx6] @wolfgar Activated deinterlacer and mixer thread 897084b
@smallint smallint [imx] @wolfgar return 3 in LinuxRendererGLES::GetOptimalBufferSize fo…
…r IMX render path
c9b70cb
@smallint smallint [imx] @wolfgar Corrected some comments, fixed iFormat along with IMX_…
…OUTPUT_FORMAT_I420
4c49642
@smallint smallint [imx] @wolfgar Enable LOW_MOTION for SD content 8b66932
@smallint smallint [imx] @wolfgar Removed commented lines and removed WRITE flag for rea…
…d only mapped buffers
31a1713
@smallint smallint [imx] @wolfgar Added VDI buffer splitting to bypass kernel splitting …
…algorithms which introduces some DMA memory burst penalties
74daccb
@smallint smallint [imx] @wolfgar Removed IPU task priority setting 1b868e7
@smallint smallint [imx] @wolfgar Cleaned up methods and removed unnecessary parameters 0b505c7
@smallint smallint [imx] @wolfgar Typos d0e8d0c
@smallint smallint [imx] @wolfgar Implemented IPU double rate task but disabled it for n…
…ow as well as low motion even for SD
47fceda
@smallint smallint [imx] @wolfgar Fixed double rate buffer output order and changed prof…
…ile log to output pts instead of buffer addresses
bbf9311
@smallint smallint [imx] @wolfgar Removed unnecessary code bcfceb8
@smallint smallint [imx] @wolfgar Setup 3 field de-interlacing only if low motion is act…
…ivated
5b0079a
@smallint smallint [imx] @wolfgar Fixed setting low motion even if there is no previous …
…buffer and process buffers also with invalid pts
7b1e93c
@smallint smallint [imx] @wolfgar Flush buffers if dropstate is active 6f67765
@smallint smallint [imx] @wolfgar Disabled double rate for SD for now because handling o…
…f invalid pts needs to be sorted out
ab25567
Member

smallint commented Dec 18, 2014

@fritsch Done. Any reason to revert the renaming?

Member

fritsch commented Dec 18, 2014

@smallint Yes. Github does not support --follow and our history "per file" was broken after the merge. Additionally we did not have to rename that out of law reasons ... so loosing history for a bit cosmetics was decided against.

Are you sure this is safe ? We cannot change the pointers which are returned that way I fear
I am unsure about the guaranty of alignment out of IPU_ALLOC (which is a simple wrapper to dma_alloc_coherent) but my guess is, it is page aligned so it is fine
If it was not the case, we would have to over alloc to be able to align wouldn't we ?

Owner

smallint replied Dec 21, 2014

Yep, if not nAlign aligned. I will check it. Shall we always align on pages and just allocate additional 4k?

Member

fritsch commented Jan 19, 2015

@smallint @wolfgar

Do we want to get this in? Or is the rewrite you two are doing superseeding it and will include some of this code?

E.g. The question is, if you rebase your rewrite on this or not :-)

Member

smallint commented Jan 19, 2015

I am currently working on another rework of the proposed method (forum) which integrates much more nicely into the current RenderManager workflow and will make it easier later to implement a GLES fallback. But this needs #6090. Not sure what you prefer. If this one already works for some people go ahead an merge it. But the next PR will remove most of this code again ;)

Member

fritsch commented Jan 20, 2015

That was the question, yes - thanks for the answer. So we keep that one open here until your rework based on #6090 is ready to be merged and then we can close this one.

Thanks very much for this update.

Member

smallint commented Jan 21, 2015

I pushed another branch with a rework that I would call final but still needs some tests. Would you mind checking the code in smallint/xbmc@55c8fbf which implements the screenshot and the yuv422 rgb conversion. I don't know if I used the correct Kodi compliant coefficents. Another issue @Memphiz was pointing us to is RenderCapture for e.g. boblight (funny thing, did not know about it) that needs to be fixed as well. Could someone give more input on that: how often will it be called, what interface to implement ...

Owner

Memphiz commented Jan 21, 2015

For rendercapture see here:

https://github.com/xbmc/xbmc/blob/master/xbmc/cores/VideoRenderers/RenderCapture.h (e.x. the impl of rbpi)

Its called for each frame in the best case ;)

Member

smallint commented Jan 21, 2015

Its called for each frame in the best case

We are just happy that rendering is fast enough. Capturing the display at each output is just mad ;) I will look into this, thanks. What is RenderCapture used for otherwise or how can I activate it for testing?

Owner

Memphiz commented Jan 21, 2015

Its used for creating bookmarks eventually (no performance needed here) and boblight (boblight is the use case where we want as much as possible frames - normally for example on aml we use the hardware to downscale the frame to lets say 80x80 or 160x160 pixel and then grab this downscaled frame which has enough information to drive the leds).

Contributor

wolfgar commented Jan 22, 2015

Thanks a lot for the additional input Memphiz, It may be possible to achieve this at the cost of one additional IPU task for each frame. When deinterlacing is not at play this additional activity will be easy to cope with (As output frame is "small", it is intrinsically cheaper than when we deinterlace). When deinterlacing is at play, I cannot tell for sure.
Another way to do it in an efficient way would be to use the 2D engine too. I will have a look

Owner

Memphiz commented Jan 22, 2015

for interlaced content it might be enough to capture a half frame if this makes it easier/faster

Member

smallint commented Jan 22, 2015

@wolfgar deinterlacing is actually faster than blitting progressive frames according to my tests so one could enable deinterlacing all the time ;)

Member

smallint commented Feb 5, 2015

Superseded by #6351.

smallint closed this Feb 5, 2015

smallint referenced this pull request in xbmc-imx6/xbmc Feb 16, 2015

Open

1080i deinterlace issue #70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment