CPP Implementation of lfilter for CPU #290

engineerchuan · 2019-09-18T20:50:08Z

To compare with the current python implementation of lfilter (implementation 0), we developed 2 cpp implementations.
1.) an element wise implementation using accessors (implementation 1)
2.) a matrix based implementation (implementation 2)

We test these implementations against each other using random inputs, and a variety of signal lengths and filter orders:

Performance

Element Wise CPP (implementation 1) is typically much faster, on the order of >800x for lower order filters and 400x for higher order filters. This is an argument for including this into the codebase. We should in the future also compile on GPU.

Correctness

For all data sizes and order filters, both matrix based implementations (0 and 2) return the same results at <3e-4 tolerance.
For higher order filters and longer inputs, the element wise implementation can diverge from the matrix. This makes sense as floating point round offs will typically accumulate. If the order of operations in matrix method and element wise method are not identical, these results would not expect to match. Generally, the user should be expected to supply filter coefficients that are stable.
This has been tested against by using a 2 order filter compared to SoX biquad results.

Questions / Comments

Is there a better way to "slice" that is more efficient? @cpuhrsch
@yf225 Thank you for your suggestion on a previous PR to parallelize across channel. I think that generally will reduce time by at most 4x because audio waveforms have usually 2 channels. So I don't think it'd be meaningful in choosing between matrix
@vincentqb Any thoughts?

engineerchuan · 2019-09-18T20:50:35Z

Raw performance test results:

--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 5
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.413754 s
CPP Element Wise Runtime       :   0.000630 s
CPP Matrix Runtime             :   0.273378 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     657.10 x
✓ - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 5
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   5.420492 s
CPP Element Wise Runtime       :   0.006082 s
CPP Matrix Runtime             :   3.899572 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     891.26 x
✓ - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 5
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  58.134758 s
CPP Element Wise Runtime       :   0.044133 s
CPP Matrix Runtime             :  34.690434 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :    1317.27 x
✓ - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 8
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.633596 s
CPP Element Wise Runtime       :   0.000647 s
CPP Matrix Runtime             :   0.390352 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     979.18 x
✓ - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 8
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   6.352927 s
CPP Element Wise Runtime       :   0.005172 s
CPP Matrix Runtime             :   3.909453 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :    1228.44 x
✓ - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 8
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  63.547796 s
CPP Element Wise Runtime       :   0.050219 s
CPP Matrix Runtime             :  28.999176 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :    1265.42 x
✓ - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 18
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.416699 s
CPP Element Wise Runtime       :   0.000992 s
CPP Matrix Runtime             :   0.281268 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     420.03 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 18
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   4.140644 s
CPP Element Wise Runtime       :   0.009811 s
CPP Matrix Runtime             :   2.813545 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     422.05 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 18
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  41.416457 s
CPP Element Wise Runtime       :   0.093863 s
CPP Matrix Runtime             :  28.061018 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     441.25 x
.
----------------------------------------------------------------------
Ran 1 test in 284.240s

vincentqb · 2019-09-18T21:17:46Z

We talked about order 40. Is this too high to test? :)

engineerchuan · 2019-09-18T21:42:42Z

@vincentqb, here are the results for 40'th, code in commit. The python is now about 200x slower than the elementwise, so the CPP is about 100x times slower.

--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 40
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.417130 s
CPP Element Wise Runtime       :   0.002043 s
CPP Matrix Runtime             :   0.292215 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     204.20 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 40
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   4.260019 s
CPP Element Wise Runtime       :   0.019245 s
CPP Matrix Runtime             :   2.902780 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     221.35 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 40
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  41.547153 s
CPP Element Wise Runtime       :   0.183620 s
CPP Matrix Runtime             :  29.410577 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     226.27 x
.
----------------------------------------------------------------------

engineerchuan · 2019-09-19T19:00:52Z

Hi,

I refactored the CPP implementations to use a Python wrapper. Basically I tried to write the bare minimum necessary in CPP.

Asserts are being compiled out in production mode for CPP code
Pythonic interface a little more mature and expressive
Trying to keep all memory allocation outside of the cpp code for cleanliness
Matrix and Element Wise implementation can share error checking code etc..

Performance results are similar as before. For small data, may be affected by Python wrapper addition but its a fixed cost that does not scale with data size.

Chuan

--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 5
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.433291 s
CPP Element Wise Runtime       :   0.000642 s
CPP Matrix Runtime             :   0.283377 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     674.59 x
PASS - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 5
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   4.302623 s
CPP Element Wise Runtime       :   0.006038 s
CPP Matrix Runtime             :   2.805403 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     712.54 x
PASS - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 5
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  57.354847 s
CPP Element Wise Runtime       :   0.044153 s
CPP Matrix Runtime             :  29.991314 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :    1299.01 x
PASS - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 8
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.438865 s
CPP Element Wise Runtime       :   0.000667 s
CPP Matrix Runtime             :   0.282346 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     657.64 x
PASS - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 8
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   4.281600 s
CPP Element Wise Runtime       :   0.006768 s
CPP Matrix Runtime             :   2.810288 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     632.65 x
PASS - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 8
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  45.460900 s
CPP Element Wise Runtime       :   0.048582 s
CPP Matrix Runtime             :  28.323625 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     935.76 x
PASS - all outputs are identical
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 18
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.433779 s
CPP Element Wise Runtime       :   0.001148 s
CPP Matrix Runtime             :   0.283811 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     377.78 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 18
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   4.324719 s
CPP Element Wise Runtime       :   0.010699 s
CPP Matrix Runtime             :   2.854490 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     404.21 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 18
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  43.343172 s
CPP Element Wise Runtime       :   0.085304 s
CPP Matrix Runtime             :  28.245840 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     508.10 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 8000], Filter Order: 40
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   0.435738 s
CPP Element Wise Runtime       :   0.002127 s
CPP Matrix Runtime             :   0.291140 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     204.82 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 80000], Filter Order: 40
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:   4.386663 s
CPP Element Wise Runtime       :   0.018836 s
CPP Matrix Runtime             :   2.883111 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     232.89 x
--------------------------------------------------------------------------------
lfilter perf - Data Size: [2 x 800000], Filter Order: 40
--------------------------------------------------------------------------------
Python Matrix Runtime [current]:  43.455556 s
CPP Element Wise Runtime       :   0.174953 s
CPP Matrix Runtime             :  28.797021 s
--------------------------------------------------------------------------------
Ratio Python / CPP ElementWise :     248.38 x

…to lfilter_cpp_try_2

vincentqb

Thanks for looking into this! The gain in performance may be worth offering this C++ implementation to the user, so I'm ok with offering it. I'd like having a simple mechanism to offer/remove the interface to the implementation.

What's the status with GPU? Is it only for CPU in its current form? Should we have a forking mechanism when on CUDA and use the python implementation instead?

This PR has a lot of commits from other PRs (e.g. deltas, SpecAugment) that are tangle with it, and so needs to be rebased to isolate what is meant to be part of this PR.

torchaudio/functional.py

vincentqb · 2019-10-15T19:35:54Z

torchaudio/functional.py

+    # Perform sanity checks, input check, and memory allocation in python
+
+    # Current limitations to be removed in future
+    assert(waveform.dtype == torch.float32 or waveform.dtype == torch.float64)


Is this implementation in C++ restricting the functionality w/r to the python one?

vincentqb · 2019-10-15T19:36:17Z

torchaudio/functional.py

 def biquad(waveform, b0, b1, b2, a0, a1, a2):
    # type: (Tensor, float, float, float, float, float, float) -> Tensor
    r"""Performs a biquad filter of input tensor.  Initial conditions set to 0.
    https://en.wikipedia.org/wiki/Digital_biquad_filter

    Args:
-        waveform (torch.Tensor): audio waveform of dimension of `(n_channel, n_frames)`
+        waveform (torch.Tensor): Audio waveform of dimension of `(n_channel, n_frames)`.
+                                 Currently only supports float32. Normalized [-1, 1]


What about other float types?

torchaudio/functional.py

test/test_transforms.py

test/test_functional.py

test/test_functional_filtering.py

torchaudio/augmentations.py

torchaudio/functional.py

torchaudio/transforms.py

vincentqb · 2019-11-01T19:03:03Z

By the way, it'd be nice to make a comparison with jit compiled python code, #326.

vincentqb · 2021-02-04T17:28:24Z

cc #1238

mthrok · 2021-03-06T05:53:34Z

Superseded by #1238, #1318, #1319 and #1310

engineerchuan added 4 commits September 18, 2019 12:25

adding test

a6703e0

added cpp lfilter implementation

16b7cb2

slight refactoring of cpp lfilter implementations

0115ec2

Tweaked docs for lowpass_biquad, highpass_biquad, biquad

6b872e9

engineerchuan added 3 commits September 18, 2019 14:43

Adding 40th order filter

f8efc98

adding 40th order filter

35fcf4a

Applied cpplint to torchaudio/filtering.cpp

e5fb427

engineerchuan added 11 commits September 19, 2019 12:01

refactor to use python wrapper

01a52ee

adding test

62a2dfd

added cpp lfilter implementation

7d08183

slight refactoring of cpp lfilter implementations

a6fefbe

Tweaked docs for lowpass_biquad, highpass_biquad, biquad

d0e694a

Adding 40th order filter

4127096

adding 40th order filter

6437091

Applied cpplint to torchaudio/filtering.cpp

a056409

refactor to use python wrapper

bc64940

Merge branch 'lfilter_cpp_try_2' of github.com:engineerchuan/audio in…

8ac85ee

…to lfilter_cpp_try_2

rebased and added template for element wise lfilter

a982c5f

vincentqb suggested changes Oct 15, 2019

View reviewed changes

Merge branch 'master' into lfilter_cpp_try_2

6fb500a

vincentqb mentioned this pull request Mar 17, 2020

torch transforms inspired by sox effects #260

Closed

53 tasks

mthrok mentioned this pull request May 16, 2020

Add windows binary jobs #642

Merged

mthrok closed this Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPP Implementation of lfilter for CPU #290

CPP Implementation of lfilter for CPU #290

engineerchuan commented Sep 18, 2019 •

edited by vincentqb

engineerchuan commented Sep 18, 2019

vincentqb commented Sep 18, 2019

engineerchuan commented Sep 18, 2019

engineerchuan commented Sep 19, 2019

vincentqb left a comment

vincentqb Oct 15, 2019

vincentqb Oct 15, 2019

vincentqb commented Nov 1, 2019

vincentqb commented Feb 4, 2021

mthrok commented Mar 6, 2021

CPP Implementation of lfilter for CPU #290

CPP Implementation of lfilter for CPU #290

Conversation

engineerchuan commented Sep 18, 2019 • edited by vincentqb

engineerchuan commented Sep 18, 2019

vincentqb commented Sep 18, 2019

engineerchuan commented Sep 18, 2019

engineerchuan commented Sep 19, 2019

vincentqb left a comment

Choose a reason for hiding this comment

vincentqb Oct 15, 2019

Choose a reason for hiding this comment

vincentqb Oct 15, 2019

Choose a reason for hiding this comment

vincentqb commented Nov 1, 2019

vincentqb commented Feb 4, 2021

mthrok commented Mar 6, 2021

engineerchuan commented Sep 18, 2019 •

edited by vincentqb