-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deep PLC performance on Android #332
Comments
Some build output which may be helpful (Note I have added the
I also just updated to latest NDK (26.2) with clang-17 without it making a significant difference. |
There must be something in the build causing the non-Neon version of the DNN code to be used. On a recent phone like the Pixel 7, PLC should be much faster than that. Can you profile and see what are the top functions and what their disassembled inner loop looks like? Alternatively, you can track how compute_linear() gets called. It can either be a direct call, or through a function pointer table (rtcd), in which case the return value of opus_select_arch() would be relevant. |
@jmvalin is DOTPROD missing from CMake? (not sure if it is relevant for PLC) |
DOTPROD would definitely help speed things up and indeed is currently missing from CMake. That being said, I suspect that right now even Neon may not be used and the problem may be caused by scalar code being used. That or maybe optimizations being disabled? (or both?) |
Now, knowing that this is not behaving as expected, I'll dig deeper into it. Thanks, will keep you informed here. |
Current state of investigation:
Didn't manage to set up NDK profiling yet, seems to be bit of a pita on Android devices. |
I never used Meson to compile for Android but might be worth testing and see if it works there. There is also ctest support added for android in ctest that I added a long time ago. (might be broken) but if you don't already have script setup it might help you. |
Something that came up during analysing ... When decoding (or generating PLC frames, so when doing decoding work) many packets at once (e.g. 500 packets), the decoding an PLC speed are about five-to-ten-fold in comparison to decoding individual packets (which is usually done in a live application). Perhaps cache misses? On my PC with an Intel Core i7-6700k, this is also the case. Average 10ms decoding duration or PLC duration:
|
So for the PLC, it's normal for the first frame to have about double the complexity of the subsequent ones because the work that would need to be done on received packets gets deferred to the first lost packet. But as I said, I'd expect about 2x, not 10x. But a few things stand from what you provided above:
|
Regarding kernel times: These are a large part of my first tests which still included true audio playback. Those are waits, etc. so they aren't related to the problem, and they disappeared in later tests where I also threw out actual playback. Regarding one-packet problem: I'm also sure that this is some kind of setup problem, and I'll cut that down further or restart testing with opus_demo.c; just was posting this for the case that there's a known setup/compilation problem causing excessive cache misses perhaps. |
OK it seems that I can reproduce the problem with a standard autoconfigure build on Linux (only the The modifications I did for
The discrepancy between sleeping and not sleeping also applies to the non-PLC case, but in the PLC case, it's quite significant, even on my desktop computer PLC frames take around ~2ms to decode, and this seems to get a real problem on some Android devices (I still suspect cache stalls and, depending on device, memory bandwith issues). These are my tests: Regular decoding, no sleeping
As you can see, decoding only takes 51 microseconds per frame. Regular decoding, sleeping 10ms between frames
Decoding raised from 51us to 311us! 50% loss with deep PLC, no sleeping
Of course, deep PLC takes some longer time than regular decoding 50% loss with deep PLC, sleeping 10ms between frames
Per-frame decoding now takes ~1.3ms; since only each other frame is a PLC frame, PLC seems to take around 2.5ms, which is somewhat huge on a desktop machine. I've attached the testfile I used: derkleineprinz.bit.zip You can find the modified |
I also tried some Without sleeps:
With sleeps (slow PLC):
Perf reports (simply using
Sleep / slow PLC
Please let me know if anything is unclear or if I can analyse something or provide more info somehow. |
There's nothing that really strikes me as broken. Maybe it's the timing itself that gives incorrect results? |
Well,
As I demonstrated in this comment (perhaps you have overlooked it?), even with a standard autoconfigure build and a slightly modified And since this seems to be more or less the case on any system, this seems to lead to the original problem that on the Pixel 7 pro, deep PLC decoding sometimes takes longer than playback. |
I think I got it. The reason seems to be DVFS tuning via cpufreq govenor which lowers the CPU freqency; it takes some time to ramp up, which is why individual decoding takes much longer. Need to think of if there's a viable solution for that on Android; as you suspected it's not an Opus issue per-se. Thanks a lot! |
I'm currently testing deep PLC on Android. Opus 1.5.1 has been compiled as Release with NEON support on. Target device Pixel 7 Pro. OPUS_SET_COMPLEXITY(5) has been set. (Doesn't make any noticeable if I set that to 5 or 10). App is in foreground, device is connected to power.
Generating a 10ms PLC frame often takes more than 10ms, even 15 or 20ms. Is that something I should consider as expected at the current state of implementation? If not, what could I look into?
The text was updated successfully, but these errors were encountered: