Bufferless True Peak-analysis #36

rawler · 2020-11-04T20:49:25Z

This is the second of the two TruePeak analysis optimizations. The key optimization here, is avoiding extra memory-copying by not keeping input and output from the upsamling. Every new input-frame is fed immediately to the interpolator, generating 2 or 4 new frames which are immediately checked for new max before being discarded.

The net gain according to my benchmark:

true_peak: 48kHz 2ch f32/Rust
/Interleaved: 545.50 -> 395.99 (-27.4%)
/Planar: 556.07 -> 424.97 (-23.6%)

true_peak: 48kHz 2ch i16/Rust
/Interleaved: 550.95 -> 436.63 (-20.7%)
/Planar: 579.58 -> 476.53 (-17.8%)

As a nice bonus, it also cleans up a lot of code from the previous step of optimization, causing a significant net reduction of code.

8 files changed, 462 insertions(+), 363 deletions(-)

Some long-running quickcheck-runs showed that the difference between f64 and f32 calculations can be as high as 0.00000386. Increase the allowed error-margin to avoid spurious failures

Allows rapidly iterating the sample-buffers, one dasp::Frame at a time

sdroege · 2020-11-05T07:25:34Z

Thanks, this seems great. I'll take a proper look this weekend :)

sdroege · 2020-11-07T11:42:15Z

Is it ready for review now? I saw you fixed up/improved various things in the meantime :)

src/interp.rs

rawler · 2020-11-07T17:29:04Z

Fair point. :) I'll look into it. Den lör 7 nov. 2020 kl 15:56 skrev Sebastian Dröge <notifications@github.com

…

: ***@***.**** commented on this pull request. ------------------------------ In src/interp.rs <#36 (comment)>: > - let imp: Box<dyn Interpolator> = match (taps, factor, channels) { - (49, 2, 1) => Box::new(specialized::Interp2F::<[f32; 1]>::new()), - (49, 2, 2) => Box::new(specialized::Interp2F::<[f32; 2]>::new()), - (49, 2, 4) => Box::new(specialized::Interp2F::<[f32; 4]>::new()), - (49, 2, 6) => Box::new(specialized::Interp2F::<[f32; 6]>::new()), - (49, 2, 8) => Box::new(specialized::Interp2F::<[f32; 8]>::new()), - (49, 4, 1) => Box::new(specialized::Interp4F::<[f32; 1]>::new()), - (49, 4, 2) => Box::new(specialized::Interp4F::<[f32; 2]>::new()), - (49, 4, 4) => Box::new(specialized::Interp4F::<[f32; 4]>::new()), - (49, 4, 6) => Box::new(specialized::Interp4F::<[f32; 6]>::new()), - (49, 4, 8) => Box::new(specialized::Interp4F::<[f32; 8]>::new()), - (taps, factor, channels) => Box::new(generic::Interp::new(taps, factor, channels)), - }; - Self(imp) + pub fn new(_taps: usize, _factor: usize, _channels: u32) -> Self { + unimplemented!() This suggests that this commit should be squashed with another one :) This alone doesn't seem runnable as-is. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#36 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADLXCEWE7724P6UJQBMDTSOVNZBANCNFSM4TKQV5ZQ> .

rawler · 2020-11-07T17:29:56Z

I'd say it's ready for review. I'm still looking for ways to improve performance further, but this is good to merge as-is (I'll look into the squashing-topic though)

src/interp.rs

src/ebur128.rs

src/interp.rs

src/utils.rs

src/interp.rs

sdroege

Generally looks good to me, thanks a lot :)

Can you add some details to the last commit with the rolling buffer about what kind of optimizations this allows, i.e. what you saw happening in practice here? I assume it simply allows auto-vectorization to kick in at all for this code or is there more to it?

rawler · 2020-11-09T21:53:58Z

You're welcome. Thanks for all the other code I did not have to write :) I think all the feedback is addressed now. Please have a look again.

- Split interp::Frame into utils::FrameAcc based on dasp::Frame and utils::Samples::foreach_frame - Push incoming frame:s directly onto the interpolator, one at a time, and check sample-max on resulting frames immediately. This removes the need for input and output-buffering. - Cleanup the unused parts of interp.rs

Save samples with shadow-buffering to enable continous fixed-length view into the buffer. For any offset, there will be a correct continous view of the entire circular buffer. This turns the inner loop of filter application from N*4 + M*4, into a predictable 12*4 operation. This avoids some branching, and gives the LLVM optimizer better information to work with. (For example, allowing 512-bit operations)

sdroege · 2020-11-10T07:50:48Z

You forgot to update benches/interp.rs for push() -> interpolate(). I've updated that now, seems all good to me otherwise and I'll merge once the CI is also happy :)

src/interp.rs

rawler added 2 commits November 4, 2020 20:58

Increase error-margin due to f32 interpolation

d683311

Some long-running quickcheck-runs showed that the difference between f64 and f32 calculations can be as high as 0.00000386. Increase the allowed error-margin to avoid spurious failures

src/utils: Replace AsF32/AsF64 with dasp::Sample-based traits

10f2792

rawler force-pushed the bufferless-true-peak branch from fa7cc07 to 134751c Compare November 4, 2020 20:54

src/util: Add Samples::foreach_frame

7969ce4

Allows rapidly iterating the sample-buffers, one dasp::Frame at a time

rawler force-pushed the bufferless-true-peak branch 3 times, most recently from 739007a to 21a1ed6 Compare November 4, 2020 22:59

rawler force-pushed the bufferless-true-peak branch from 3a3fc20 to 7ed48ec Compare November 6, 2020 14:30

sdroege reviewed Nov 7, 2020

View reviewed changes

src/interp.rs Outdated Show resolved Hide resolved