Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep PLC performance on Android #332

Closed
dmpriso opened this issue Apr 1, 2024 · 16 comments
Closed

Deep PLC performance on Android #332

dmpriso opened this issue Apr 1, 2024 · 16 comments

Comments

@dmpriso
Copy link

dmpriso commented Apr 1, 2024

I'm currently testing deep PLC on Android. Opus 1.5.1 has been compiled as Release with NEON support on. Target device Pixel 7 Pro. OPUS_SET_COMPLEXITY(5) has been set. (Doesn't make any noticeable if I set that to 5 or 10). App is in foreground, device is connected to power.

Generating a 10ms PLC frame often takes more than 10ms, even 15 or 20ms. Is that something I should consider as expected at the current state of implementation? If not, what could I look into?

@dmpriso
Copy link
Author

dmpriso commented Apr 1, 2024

Some build output which may be helpful (Note I have added the OPUS_PRESUME_NEON output)

-- Found Git: /usr/bin/git (found version "2.34.1") 
-- Opus package version from package_version file: 1.5.1
-- Opus project version: 1.5.1
-- Using Conan toolchain: /home/daniel/.conan/data/opus/1.5.1/_/_/build/e85f5e2ffef26b6c35c5eab5f680cdc34fb2a4fd/build/RelWithDebInfo/generators/conan_toolchain.cmake
-- Conan toolchain: Setting CMAKE_POSITION_INDEPENDENT_CODE=ON (options.fPIC)
-- Conan toolchain: Setting BUILD_SHARED_LIBS = OFF
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /home/daniel/Android/Sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/linux-x86_64/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Performing Test VLA_SUPPORTED
-- Performing Test VLA_SUPPORTED -- success
-- Looking for alloca.h
-- Looking for alloca.h - found
-- Looking for alloca
-- Looking for alloca - found
-- Looking for floor in m
-- Looking for floor in m - found
-- Looking for lrintf
-- Looking for lrintf - found
-- Looking for lrint
-- Looking for lrint - found
-- Check NEON support by compiler
-- Looking for arm_neon.h
-- Looking for arm_neon.h - found
-- Performing Test FAST_MATH_SUPPORTED
-- Performing Test FAST_MATH_SUPPORTED - Success
-- Performing Test STACK_PROTECTOR_SUPPORTED
-- Performing Test STACK_PROTECTOR_SUPPORTED - Success
-- Performing Test HIDDEN_VISIBILITY_SUPPORTED
-- Performing Test HIDDEN_VISIBILITY_SUPPORTED - Success
-- Performing Test Wall_SUPPORTED
-- Performing Test Wall_SUPPORTED - Success
-- Performing Test W_SUPPORTED
-- Performing Test W_SUPPORTED - Success
-- Performing Test Wstrictprototypes_SUPPORTED
-- Performing Test Wstrictprototypes_SUPPORTED - Success
-- Performing Test Wextra_SUPPORTED
-- Performing Test Wextra_SUPPORTED - Success
-- Performing Test Wcastalign_SUPPORTED
-- Performing Test Wcastalign_SUPPORTED - Success
-- Performing Test Wnestedexterns_SUPPORTED
-- Performing Test Wnestedexterns_SUPPORTED - Success
-- Performing Test Wshadow_SUPPORTED
-- Performing Test Wshadow_SUPPORTED - Success
-- The following features have been enabled:

 * OPUS_ENABLE_FLOAT_API, compile with the floating point API (for machines with float library).
 * OPUS_HARDENING, run-time checks that are cheap and safe for use in production.
 * OPUS_INSTALL_PKG_CONFIG_MODULE, install pkg-config module.
 * OPUS_INSTALL_CMAKE_CONFIG_MODULE, install CMake package config module.
 * OPUS_DRED, enable DRED.
 * OPUS_VAR_ARRAYS, use variable length arrays for stack arrays.
 * OPUS_STACK_PROTECTOR, use stack protection.
 * OPUS_FORTIFY_SOURCE, add protection against buffer overflows.

-- The following OPTIONAL packages have been found:

 * Git

-- The following features have been disabled:

 * OPUS_BUILD_SHARED_LIBRARY, build shared library.
 * OPUS_BUILD_TESTING, build tests.
 * OPUS_CUSTOM_MODES, enable non-Opus modes, e.g. 44.1 kHz & 2^n frames.
 * OPUS_BUILD_PROGRAMS, build programs.
 * OPUS_DISABLE_INTRINSICS, disable all intrinsics optimizations.
 * OPUS_FIXED_POINT, compile as fixed-point (for machines without a fast enough FPU).
 * OPUS_FLOAT_APPROX, enable floating point approximations (Ensure your platform supports IEEE 754 before enabling).
 * OPUS_ASSERTIONS, additional software error checking.
 * OPUS_FUZZING, causes the encoder to make random decisions (do not use in production).
 * OPUS_CHECK_ASM, enable bit-exactness checks between optimized and c implementations.
 * OPUS_DNN_FLOAT_DEBUG, Run DNN computations as float for debugging purposes.
 * OPUS_OSCE, enable OSCE.
 * OPUS_FIXED_POINT_DEBUG, debug fixed-point implementation.
 * OPUS_USE_ALLOCA, use alloca for stack arrays (on non-C99 compilers).
 * OPUS_NONTHREADSAFE_PSEUDOSTACK, use a non threadsafe pseudostack when neither variable length arrays or alloca is supported.
 * OPUS_FAST_MATH, enable fast math (unsupported and discouraged use, as code is not well tested with this build option).

-- OPUS_MAY_HAVE_NEON enabling runtime detection
-- OPUS_PRESUME_NEON ON

I also just updated to latest NDK (26.2) with clang-17 without it making a significant difference.

@jmvalin
Copy link
Member

jmvalin commented Apr 2, 2024

There must be something in the build causing the non-Neon version of the DNN code to be used. On a recent phone like the Pixel 7, PLC should be much faster than that. Can you profile and see what are the top functions and what their disassembled inner loop looks like? Alternatively, you can track how compute_linear() gets called. It can either be a direct call, or through a function pointer table (rtcd), in which case the return value of opus_select_arch() would be relevant.

@xnorpx
Copy link
Contributor

xnorpx commented Apr 2, 2024

@jmvalin is DOTPROD missing from CMake? (not sure if it is relevant for PLC)

@jmvalin
Copy link
Member

jmvalin commented Apr 2, 2024

DOTPROD would definitely help speed things up and indeed is currently missing from CMake. That being said, I suspect that right now even Neon may not be used and the problem may be caused by scalar code being used. That or maybe optimizations being disabled? (or both?)

@dmpriso
Copy link
Author

dmpriso commented Apr 2, 2024

Now, knowing that this is not behaving as expected, I'll dig deeper into it. Thanks, will keep you informed here.

@dmpriso
Copy link
Author

dmpriso commented Apr 2, 2024

Current state of investigation:

  • Optimization is turned on, -O2 or -O3 are passed to clang (depending on whether RelWithDebInfo or Release is used with cmake)
  • opus_select_arch_impl() is being called and returns a value of "3"
  • I tried to set OPUS_ARM_PRESUME_DOTPROD but this leads to an unresolved symbol compute_linear_dotprod (which I can't find anywhere)

Didn't manage to set up NDK profiling yet, seems to be bit of a pita on Android devices.

@xnorpx
Copy link
Contributor

xnorpx commented Apr 3, 2024

I never used Meson to compile for Android but might be worth testing and see if it works there.

There is also ctest support added for android in ctest that I added a long time ago. (might be broken) but if you don't already have script setup it might help you.

https://github.com/xiph/opus/blob/main/cmake/RunTest.cmake

@dmpriso
Copy link
Author

dmpriso commented Apr 8, 2024

OK I finally found some time for that problem again. I created a smaller test application out of my app which plays back a fixed prerecorded opus file, dropping every second frame and replacing it with PLC, and compiled the C++ part with RelWithDebInfo and did some profiling using Android Studio.

These are the top functions according to Android Profiler (forget about the first line; it's just the audio thread sleeping)
image

This is part of the flame chart:
image

Further, I also did some "poor man's profiling" without Android Profiler, just by pressing pause from time to time, which may possibly show some helpful details.
image

image

image

Any ideas on that?

@dmpriso
Copy link
Author

dmpriso commented Apr 9, 2024

Something that came up during analysing ...

When decoding (or generating PLC frames, so when doing decoding work) many packets at once (e.g. 500 packets), the decoding an PLC speed are about five-to-ten-fold in comparison to decoding individual packets (which is usually done in a live application). Perhaps cache misses?

On my PC with an Intel Core i7-6700k, this is also the case.

Average 10ms decoding duration or PLC duration:

CPU Frame type Decoding duration Decoded many packets at once?
Intel Regular 30us Yes
Intel PLC 250us Yes
Intel Regular 177us No
Intel PLC 1263us No
Armv8 Regular 26us Yes
Armv8 PLC 520us Yes
Armv8 Regular 390us No
Armv8 PLC 7795us No

@jmvalin
Copy link
Member

jmvalin commented Apr 10, 2024

So for the PLC, it's normal for the first frame to have about double the complexity of the subsequent ones because the work that would need to be done on received packets gets deferred to the first lost packet. But as I said, I'd expect about 2x, not 10x. But a few things stand from what you provided above:

  1. It looks like 2/3 of the time is being spent in the kernel, which is very odd considering Opus does will not do any system call.
  2. Even regular decoding is more than 10x more costly in your "one packet" case. The cost of that should be the same, so something else is going wrong with your setup.

@dmpriso
Copy link
Author

dmpriso commented Apr 10, 2024

Regarding kernel times: These are a large part of my first tests which still included true audio playback. Those are waits, etc. so they aren't related to the problem, and they disappeared in later tests where I also threw out actual playback.

Regarding one-packet problem: I'm also sure that this is some kind of setup problem, and I'll cut that down further or restart testing with opus_demo.c; just was posting this for the case that there's a known setup/compilation problem causing excessive cache misses perhaps.

@dmpriso
Copy link
Author

dmpriso commented Apr 12, 2024

OK it seems that I can reproduce the problem with a standard autoconfigure build on Linux (only the --enable-deep-plc option has been set), and just with a modified opus-demo executable.

The modifications I did for opus-demo are as follows:

  • Measuring pure decoding time
  • Added -sleep_ms option to sleep between decoding individual frames (which is realistic for true realtime audio playback).

The discrepancy between sleeping and not sleeping also applies to the non-PLC case, but in the PLC case, it's quite significant, even on my desktop computer PLC frames take around ~2ms to decode, and this seems to get a real problem on some Android devices (I still suspect cache stalls and, depending on device, memory bandwith issues).

These are my tests:

Regular decoding, no sleeping

daniel@daniel-mint:~/development/opus$ ./opus_demo -d 48000 2 ~/derkleineprinz.bit ~/derkleineprinz_decoded.wav
libopus 1.5.2-dirty
Decoding with 48000 Hz output (2 channels)
average bitrate:              97.835 kb/s
maximum bitrate:             192.800 kb/s
bitrate standard deviation:   26.852 kb/s
elapsed_time: 159648 us per-frame: 51 us count: 3103

As you can see, decoding only takes 51 microseconds per frame.

Regular decoding, sleeping 10ms between frames

daniel@daniel-mint:~/development/opus$ ./opus_demo -d 48000 2 -sleep_ms 10 ~/derkleineprinz.bit ~/derkleineprinz_decoded.wav
libopus 1.5.2-dirty
Decoding with 48000 Hz output (2 channels)
average bitrate:              97.835 kb/s
maximum bitrate:             192.800 kb/s
bitrate standard deviation:   26.852 kb/s
elapsed_time: 967670 us per-frame: 311 us count: 3103

Decoding raised from 51us to 311us!

50% loss with deep PLC, no sleeping

daniel@daniel-mint:~/development/opus$ ./opus_demo -d 48000 2 -dec_complexity 5 -loss 50 ~/derkleineprinz.bit ~/derkleineprinz_decoded.wav
libopus 1.5.2-dirty
Decoding with 48000 Hz output (2 channels)
average bitrate:              97.835 kb/s
maximum bitrate:             192.800 kb/s
bitrate standard deviation:   26.852 kb/s
elapsed_time: 785844 us per-frame: 253 us count: 3103

Of course, deep PLC takes some longer time than regular decoding

50% loss with deep PLC, sleeping 10ms between frames

daniel@daniel-mint:~/development/opus$ ./opus_demo -d 48000 2 -dec_complexity 5 -loss 50 -sleep_ms 10 ~/derkleineprinz.bit ~/derkleineprinz_decoded.wav
libopus 1.5.2-dirty
Decoding with 48000 Hz output (2 channels)
average bitrate:              97.835 kb/s
maximum bitrate:             192.800 kb/s
bitrate standard deviation:   26.852 kb/s
elapsed_time: 3965575 us per-frame: 1277 us count: 3103

Per-frame decoding now takes ~1.3ms; since only each other frame is a PLC frame, PLC seems to take around 2.5ms, which is somewhat huge on a desktop machine.

I've attached the testfile I used: derkleineprinz.bit.zip
Please note that a test run with that file and -sleep_ms 10 takes just over 30 seconds.

You can find the modified opus_demo.c at https://github.com/dmpriso/opus/tree/add-sleep-ms-and-time-measuring

@dmpriso
Copy link
Author

dmpriso commented Apr 17, 2024

I also tried some perf stats with cache misses/hits:

Without sleeps:

daniel@daniel-mint:~/development/opus$ perf stat -B -e cache-misses,cache-references,l2_rqsts.code_rd_hit,l2_rqsts.code_rd_miss,l2_rqsts.demand_data_rd_hit,l2_rqsts.demand_data_rd_miss ./opus_demo -d 48000 2 -dec_complexity 5 -loss 50 ~/derkleineprinz.bit ~/derkleineprinz_decoded.wav
libopus 1.5.2-dirty
Decoding with 48000 Hz output (2 channels)
average bitrate:              97.835 kb/s
maximum bitrate:             192.800 kb/s
bitrate standard deviation:   26.852 kb/s
elapsed_time: 783853 us per-frame: 252 us count: 3103

 Performance counter stats for './opus_demo -d 48000 2 -dec_complexity 5 -loss 50 /home/daniel/derkleineprinz.bit /home/daniel/derkleineprinz_decoded.wav':

           735.147      cache-misses              #    0,126 % of all cache refs      (66,88%)
       584.424.102      cache-references                                              (66,92%)
         2.128.665      l2_rqsts.code_rd_hit                                          (66,92%)
         5.404.887      l2_rqsts.code_rd_miss                                         (67,25%)
        72.108.884      l2_rqsts.demand_data_rd_hit                                     (66,52%)
        34.977.346      l2_rqsts.demand_data_rd_miss                                     (66,32%)

       0,797169087 seconds time elapsed

       0,792949000 seconds user
       0,005934000 seconds sys

With sleeps (slow PLC):

daniel@daniel-mint:~/development/opus$ perf stat -B -e cache-misses,cache-references,l2_rqsts.code_rd_hit,l2_rqsts.code_rd_miss,l2_rqsts.demand_data_rd_hit,l2_rqsts.demand_data_rd_miss ./opus_demo -d 48000 2 -dec_complexity 5 -loss 50 -sleep_ms 10 ~/derkleineprinz.bit ~/derkleineprinz_decoded.wav
libopus 1.5.2-dirty
Decoding with 48000 Hz output (2 channels)
average bitrate:              97.835 kb/s
maximum bitrate:             192.800 kb/s
bitrate standard deviation:   26.852 kb/s
elapsed_time: 4093392 us per-frame: 1319 us count: 3103

 Performance counter stats for './opus_demo -d 48000 2 -dec_complexity 5 -loss 50 -sleep_ms 10 /home/daniel/derkleineprinz.bit /home/daniel/derkleineprinz_decoded.wav':

         2.790.887      cache-misses              #    0,464 % of all cache refs      (67,59%)
       601.107.060      cache-references                                              (65,80%)
         3.344.614      l2_rqsts.code_rd_hit                                          (65,88%)
        15.248.092      l2_rqsts.code_rd_miss                                         (65,86%)
       166.490.110      l2_rqsts.demand_data_rd_hit                                     (66,61%)
        54.267.692      l2_rqsts.demand_data_rd_miss                                     (68,45%)

      35,605237948 seconds time elapsed

       4,008319000 seconds user
       0,256996000 seconds sys

Perf reports (simply using perf record and perf report) look roughly the same for both cases:
No sleep

Samples: 3K of event 'cycles', Event count (approx.): 3290043523
Overhead  Command    Shared Object      Symbol
  41,81%  opus_demo  libopus.so.0.10.1  [.] cgemv8x4
   7,32%  opus_demo  libopus.so.0.10.1  [.] opus_fft_impl
   3,86%  opus_demo  libopus.so.0.10.1  [.] xcorr_kernel_sse
   3,57%  opus_demo  libopus.so.0.10.1  [.] celt_decode_lost
   3,51%  opus_demo  libopus.so.0.10.1  [.] compute_linear_avx2
   3,41%  opus_demo  libopus.so.0.10.1  [.] compute_generic_gru
   2,84%  opus_demo  libopus.so.0.10.1  [.] update_plc_state
   2,36%  opus_demo  libopus.so.0.10.1  [.] silk_burg_analysis
   1,93%  opus_demo  libopus.so.0.10.1  [.] compute_activation_avx2
   1,63%  opus_demo  libopus.so.0.10.1  [.] celt_iir
   1,31%  opus_demo  libopus.so.0.10.1  [.] decode_pulses
   1,15%  opus_demo  libopus.so.0.10.1  [.] compute_conv2d_avx2
   1,15%  opus_demo  libopus.so.0.10.1  [.] celt_pitch_xcorr_avx2
   1,11%  opus_demo  libopus.so.0.10.1  [.] compute_frame_features
   1,05%  opus_demo  libopus.so.0.10.1  [.] opus_fft_c
   1,02%  opus_demo  libopus.so.0.10.1  [.] run_fargan_subframe
   1,02%  opus_demo  libopus.so.0.10.1  [.] exp_rotation1
   0,83%  opus_demo  libc.so.6          [.] __memmove_avx_unaligned_erms
   0,71%  opus_demo  libopus.so.0.10.1  [.] compute_burg_cepstrum.part.0.constprop.0
   0,70%  opus_demo  libopus.so.0.10.1  [.] lpc_from_bands
   0,70%  opus_demo  libm.so.6          [.] __ieee754_pow_fma
   0,67%  opus_demo  libopus.so.0.10.1  [.] compute_glu
   0,67%  opus_demo  libopus.so.0.10.1  [.] quant_partition
   0,64%  opus_demo  opus_demo          [.] main
   0,60%  opus_demo  libopus.so.0.10.1  [.] opus_pcm_soft_clip
   0,57%  opus_demo  libopus.so.0.10.1  [.] dct
   0,57%  opus_demo  libopus.so.0.10.1  [.] deemphasis
   0,51%  opus_demo  libopus.so.0.10.1  [.] forward_transform
   0,48%  opus_demo  libopus.so.0.10.1  [.] pitch_downsample
   0,41%  opus_demo  libopus.so.0.10.1  [.] clt_mdct_backward_c

Sleep / slow PLC

Samples: 16K of event 'cycles', Event count (approx.): 3609950273
Overhead  Command    Shared Object      Symbol
  42,34%  opus_demo  libopus.so.0.10.1  [.] cgemv8x4
   6,20%  opus_demo  libopus.so.0.10.1  [.] opus_fft_impl
   3,49%  opus_demo  libopus.so.0.10.1  [.] celt_decode_lost
   3,39%  opus_demo  libopus.so.0.10.1  [.] xcorr_kernel_sse
   3,25%  opus_demo  libopus.so.0.10.1  [.] compute_linear_avx2
   2,98%  opus_demo  libopus.so.0.10.1  [.] compute_generic_gru
   2,45%  opus_demo  libopus.so.0.10.1  [.] silk_burg_analysis
   2,41%  opus_demo  libopus.so.0.10.1  [.] update_plc_state
   1,86%  opus_demo  libopus.so.0.10.1  [.] compute_activation_avx2
   1,36%  opus_demo  libopus.so.0.10.1  [.] decode_pulses
   1,27%  opus_demo  libopus.so.0.10.1  [.] compute_frame_features
   1,19%  opus_demo  libopus.so.0.10.1  [.] celt_iir
   1,11%  opus_demo  libopus.so.0.10.1  [.] celt_pitch_xcorr_avx2
   0,94%  opus_demo  libopus.so.0.10.1  [.] compute_burg_cepstrum.part.0.constprop.0
   0,86%  opus_demo  libopus.so.0.10.1  [.] opus_fft_c
   0,83%  opus_demo  libc.so.6          [.] __memmove_avx_unaligned_erms
   0,80%  opus_demo  libopus.so.0.10.1  [.] run_fargan_subframe
   0,80%  opus_demo  libopus.so.0.10.1  [.] compute_conv2d_avx2
   0,73%  opus_demo  libopus.so.0.10.1  [.] quant_partition
   0,71%  opus_demo  libopus.so.0.10.1  [.] exp_rotation1
   0,71%  opus_demo  libopus.so.0.10.1  [.] lpc_from_bands
   0,70%  opus_demo  libopus.so.0.10.1  [.] compute_glu
   0,69%  opus_demo  [unknown]          [.] 0000000000000000
   0,65%  opus_demo  libm.so.6          [.] __ieee754_pow_fma
   0,58%  opus_demo  libopus.so.0.10.1  [.] opus_pcm_soft_clip
   0,57%  opus_demo  libopus.so.0.10.1  [.] deemphasis
   0,53%  opus_demo  libopus.so.0.10.1  [.] clt_mdct_backward_c
   0,53%  opus_demo  libopus.so.0.10.1  [.] opus_decode
   0,48%  opus_demo  opus_demo          [.] main
   0,48%  opus_demo  libopus.so.0.10.1  [.] celt_inner_prod_sse
   0,48%  opus_demo  libopus.so.0.10.1  [.] dct
   0,37%  opus_demo  libopus.so.0.10.1  [.] forward_transform
   0,36%  opus_demo  libopus.so.0.10.1  [.] compute_pitchdnn
   0,34%  opus_demo  libopus.so.0.10.1  [.] alg_unquant
   0,31%  opus_demo  libopus.so.0.10.1  [.] compute_theta
   0,29%  opus_demo  libopus.so.0.10.1  [.] quant_band

Please let me know if anything is unclear or if I can analyse something or provide more info somehow.

@jmvalin
Copy link
Member

jmvalin commented Apr 18, 2024

There's nothing that really strikes me as broken. Maybe it's the timing itself that gives incorrect results?

@dmpriso
Copy link
Author

dmpriso commented Apr 18, 2024

Well,

  1. Even regular decoding is more than 10x more costly in your "one packet" case. The cost of that should be the same, so something else is going wrong with your setup.

As I demonstrated in this comment (perhaps you have overlooked it?), even with a standard autoconfigure build and a slightly modified opus_demo, I can show that regular as well as deep PLC decoding is many times more costly when decoding packet-by-packet with pauses in between (as with regular audio playback), when compared to decoding all packets without pauses in between.

And since this seems to be more or less the case on any system, this seems to lead to the original problem that on the Pixel 7 pro, deep PLC decoding sometimes takes longer than playback.

@dmpriso
Copy link
Author

dmpriso commented Apr 20, 2024

I think I got it. The reason seems to be DVFS tuning via cpufreq govenor which lowers the CPU freqency; it takes some time to ramp up, which is why individual decoding takes much longer. Need to think of if there's a viable solution for that on Android; as you suspected it's not an Opus issue per-se. Thanks a lot!

@dmpriso dmpriso closed this as completed Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants