Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM performance #17

Closed
awesie opened this issue Jun 27, 2017 · 8 comments
Closed

ARM performance #17

awesie opened this issue Jun 27, 2017 · 8 comments

Comments

@awesie
Copy link
Contributor

awesie commented Jun 27, 2017

I added a couple of patches to the experimental branch to hopefully improve CPU usage on ARM. It may degrade receiver performance however.

@mrbubble62 In #15 you mentioned your ARM R8 platform was not fast enough, could you test with a new build using these options: cmake -DUSE_THREADS=ON -DUSE_NEON=ON -DUSE_FAST_MATH=ON ..

On a Raspberry Pi 3, using only 1 CPU core, the average CPU usage is 60~70%.

@mrbubble62
Copy link

On Allwinner R8 (C.H.I.P) still more work to do. Definite improvement, playing back sample.xz, dropping out <50% vs >90%
CMAKE_C_FLAGS "-mcpu=cortex-a8 -mfloat-abi=hard -mfpu=neon"
-DUSE_THREADS=ON -DUSE_NEON=ON -DUSE_FAST_MATH=ON

Have to say works brilliantly on i386 :) TY

@awesie
Copy link
Contributor Author

awesie commented Jun 28, 2017

If you are testing with sample.xz, make sure that you decompress it first, and then test the performance. The xz tool itself will use quite a bit of CPU.

@mrbubble62
Copy link

decompressed sample but no detectable difference with nrsc5 -r ../support/sample 0

@awesie
Copy link
Contributor Author

awesie commented Jun 28, 2017

Great to know, thanks!

@awesie
Copy link
Contributor Author

awesie commented Jun 29, 2017

I decreased the number of taps in the filters when USE_FAST_MATH is set. This should shave off another 10~20% of CPU usage. I would be curious if this makes things any better.

Useful metrics for performance would be:

time src/nrsc5 -r sample -o /dev/null -f wav -q 0
time src/nrsc5 -r sample -o /dev/null -f adts -q 0

This will tell how much time is required to process the data, and how much time is required to process the data and decode to audio.

@mrbubble62
Copy link

mrbubble62 commented Jul 1, 2017

results

chip@chip:~/nrsc5/build$ time src/nrsc5 -r sample -o /dev/null -f adts -q 0
real    0m0.238s
user    0m0.215s
sys     0m0.020s
chip@chip:~/nrsc5/build$ time src/nrsc5 -r sample -o /dev/null -f wav -q 0
real    0m0.218s
user    0m0.205s
sys     0m0.015s

@mrbubble62
Copy link

mrbubble62 commented Jul 1, 2017

Performance has definitely improved, from strong signal audio decodes occasionally.

chip@chip:~/nrsc5/build$ nrsc5 -p 12  88500000 0
12:10:30 INFO  main.c:176: [0] Generic RTL2832U OEM
Found Rafael Micro R820T tuner
Exact sample rate is: 1488375.071248 Hz
12:10:31 INFO  main.c:63: Gain: 0.0 dB, CNR: 13.824152 dB
12:10:31 INFO  main.c:63: Gain: 0.9 dB, CNR: 14.034353 dB
12:10:32 INFO  main.c:63: Gain: 1.4 dB, CNR: 14.064837 dB
12:10:32 INFO  main.c:63: Gain: 2.7 dB, CNR: 14.218107 dB
12:10:32 INFO  main.c:63: Gain: 3.7 dB, CNR: 14.165344 dB
12:10:33 INFO  main.c:63: Gain: 7.7 dB, CNR: 13.962760 dB
12:10:33 INFO  main.c:63: Gain: 8.7 dB, CNR: 13.858078 dB
12:10:33 INFO  main.c:63: Gain: 12.5 dB, CNR: 13.359507 dB
12:10:34 INFO  main.c:63: Gain: 14.4 dB, CNR: 13.144488 dB
12:10:34 INFO  main.c:63: Gain: 15.7 dB, CNR: 12.828616 dB
12:10:35 INFO  main.c:63: Gain: 16.6 dB, CNR: 12.347807 dB
12:10:35 INFO  main.c:63: Gain: 19.7 dB, CNR: 10.950316 dB
12:10:35 DEBUG main.c:67: Best gain: 27
12:10:38 INFO  input.c:154: CFO: 1090.118408 Hz (12 ppm)
12:10:38 DEBUG sync.c:244: First block @ 15
12:10:39 INFO  sync.c:222: Synchronized!
12:10:41 INFO  sync.c:298: MER: 7.237570 dB (lower), 7.242609 dB (upper)
12:10:41 INFO  decode.c:74: BER: 0.000027, avg: 0.000027, min: 0.000027, max: 0.000027
12:10:41 DEBUG frame.c:168: pdu_seq: 1, seq: 32, nop: 33
12:10:41 DEBUG frame.c:197: ignoring partial pdu
12:10:43 INFO  sync.c:298: MER: 7.404940 dB (lower), 7.376904 dB (upper)
12:10:44 INFO  decode.c:74: BER: 0.000022, avg: 0.000025, min: 0.000022, max: 0.000027
12:10:44 DEBUG frame.c:168: pdu_seq: 0, seq: 0, nop: 33
12:10:46 INFO  sync.c:298: MER: -1.457466 dB (lower), -3.960696 dB (upper)
12:10:47 INFO  decode.c:74: BER: 0.062330, avg: 0.020793, min: 0.000022, max: 0.062330
12:10:47 DEBUG frame.c:168: pdu_seq: 1, seq: 32, nop: 33
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 DEBUG sync.c:199: lost sync (-1, -1)!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 DEBUG sync.c:244: First block @ 11
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:48 ERROR input.c:265: input buffer overflow!
12:10:49 DEBUG sync.c:244: First block @ 30
12:10:50 DEBUG sync.c:244: First block @ 3
12:10:50 DEBUG sync.c:244: First block @ 1
12:10:51 DEBUG sync.c:244: First block @ 0
12:10:51 INFO  sync.c:222: Synchronized!
12:10:52 INFO  acquire.c:98: Timing offset: 642.187500, slope: -4.199219 (adjust)
12:10:52 INFO  sync.c:298: MER: 6.963532 dB (lower), 6.934787 dB (upper)
12:10:53 INFO  decode.c:74: BER: 0.000022, avg: 0.015600, min: 0.000022, max: 0.062330
12:10:53 DEBUG frame.c:168: pdu_seq: 0, seq: 0, nop: 33
12:10:53 ERROR output.c:125: Decode error: Array index out of range
12:10:54 ERROR input.c:265: input buffer overflow!
12:10:54 ERROR input.c:265: input buffer overflow!
12:10:54 ERROR input.c:265: input buffer overflow!
12:10:54 ERROR input.c:265: input buffer overflow!
12:10:54 ERROR input.c:265: input buffer overflow!
12:10:54 ERROR input.c:265: input buffer overflow!
12:10:56 INFO  sync.c:298: MER: 0.483052 dB (lower), -0.037717 dB (upper)
12:10:56 INFO  decode.c:74: BER: 0.207394, avg: 0.053959, min: 0.000022, max: 0.207394

@argilo
Copy link
Collaborator

argilo commented Nov 14, 2017

#95, #106 and #107 have made significant improvements in ARM performance, and it looks like USE_FAST_MATH is no longer required. 15-minute load average is around 0.55 on a Raspberry Pi 3 with USE_NEON. I have some further improvement in mind, but I think I'll close this issue for now as ARM performance seems to be adequate already.

@argilo argilo closed this as completed Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants