Skip to content

Conversation

yoyolicoris
Copy link
Contributor

@yoyolicoris yoyolicoris commented Feb 24, 2021

This merge solve issue #704.

It moves the original python implementation of lfilter into c++ backend, and register a custom autograd kernel to support torchscript as @vincentqb mentioned in #704 .

A simple test case is added to test whether the gradient is valid or not.

Notes

Some differences to the old lfilter:

  • The old implementation use direct-form I; the new one use direct-form II.
  • A mix of indexing and matmul operation at
    window_idxs = torch.arange(n_sample, device=device).unsqueeze(0) + torch.arange(

is replaced by a single conv1d function call.
https://github.com/yoyololicon/audio/blob/4e2ff32b50d56ce168fcee872c95ffc6cde82eaa/torchaudio/csrc/lfilter.cpp#L123

@facebook-github-bot
Copy link
Contributor

Hi @yoyololicon!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@yoyolicoris yoyolicoris changed the title Lfilter autograd support Suuport backprop through lfilter parameters Feb 24, 2021
@yoyolicoris yoyolicoris changed the title Suuport backprop through lfilter parameters Support backprop through lfilter parameters Feb 24, 2021
@mthrok
Copy link
Contributor

mthrok commented Feb 24, 2021

Hi @yoyololicon

Thanks for the contribution. This looks a great improvement.

Questions:

  • Does the new implementation yield the same result as the previous one?
  • What do you think of the coverage of the input value of the autograd check test? Is it enough or should we testing more in different locations?
  • Would it be possible to provide a Python version (ideally as a separate PR)? The thing is that currently we do not build C++ extension on Windows so this approach will break the Windows package. We are planning to resolve this in a coming weeks (say, two weeks), until then we cannot merge this PR. Meanwhile, if you can make the original Python version differentiable, then we can merge it until the Windows support situation is resolved. After that we can discuss turning the rest of lfilter implementation into C++. (if you are fine being blocked on the Windows support, that's okay too)
  • What is the performance imprecation of conv1d replacement?

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Feb 24, 2021

Hi @yoyololicon

Thanks for the contribution. This looks a great improvement.

Questions:

  • Does the new implementation yield the same result as the previous one?

It passed the tests listed in test/torchaudio_unittest/sox_compatibility_test.py so I think the output should be the same.

  • What do you think of the coverage of the input value of the autograd check test? Is it enough or should we testing more in different locations?

I think currently a gradient test on second order filter might be enough, cuz it's very common and is also the basis of biquad filters.

  • Would it be possible to provide a Python version (ideally as a separate PR)? The thing is that currently we do not build C++ extension on Windows so this approach will break the Windows package. We are planning to resolve this in a coming weeks (say, two weeks), until then we cannot merge this PR. Meanwhile, if you can make the original Python version differentiable, then we can merge it until the Windows support situation is resolved. After that we can discuss turning the rest of lfilter implementation into C++. (if you are fine being blocked on the Windows support, that's okay too)

A custom autograd function in Python frontend would break the torchscript support ability.
I suggest we can fallback to use the original version (which not fully autograd support) when c++ extension is not available on Windows. If this is not the case, I'm fine with waiting for a few weeks.

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Feb 25, 2021

What is the performance imprecation of conv1d replacement?

My initial idea is to make the source code much more readable, but seems like it can also bring some performance improvement.

Below is the profiling results on this specific part of lfilter:

indexing + matmul
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
           aten::copy_        39.23%     236.839ms        39.23%     236.839ms       1.184ms           0 b           0 b           0 b           0 b           200  
            aten::take        29.64%     178.937ms        29.64%     178.937ms       1.789ms     807.46 Mb     807.46 Mb           0 b           0 b           100  
          aten::matmul         0.39%       2.377ms        28.79%     173.818ms       1.738ms     269.15 Mb    -807.46 Mb           0 b           0 b           100  
      aten::contiguous         0.08%     481.909us        20.59%     124.307ms     621.537us     807.46 Mb           0 b           0 b           0 b           200  
          aten::repeat         0.20%       1.214ms        19.47%     117.533ms       1.175ms       1.58 Gb           0 b           0 b           0 b           100  
            aten::add_        19.26%     116.257ms        19.26%     116.257ms       1.163ms           0 b           0 b           0 b           0 b           100  
              aten::mm         7.26%      43.848ms         7.30%      44.086ms     440.863us     269.15 Mb           0 b           0 b           0 b           100  
          aten::arange         1.27%       7.642ms         2.62%      15.809ms      26.349us      67.32 Mb           0 b           0 b           0 b           600  
             aten::add         0.97%       5.873ms         1.00%       6.010ms      60.097us     100.93 Mb           0 b           0 b           0 b           100  
           aten::empty         0.32%       1.934ms         0.32%       1.934ms       2.417us       2.73 Gb       2.73 Gb           0 b           0 b           800  
       aten::unsqueeze         0.24%       1.450ms         0.28%       1.698ms       4.246us           0 b           0 b           0 b           0 b           400  
          aten::unfold         0.16%     948.707us         0.25%       1.481ms       4.937us           0 b           0 b           0 b           0 b           300  
            aten::view         0.18%       1.087ms         0.18%       1.087ms       2.718us           0 b           0 b           0 b           0 b           400  
       aten::transpose         0.12%     738.172us         0.17%       1.007ms       5.036us           0 b           0 b           0 b           0 b           200  
         aten::reshape         0.06%     373.146us         0.16%     979.680us       4.898us           0 b           0 b           0 b           0 b           200  
      aten::empty_like         0.06%     343.304us         0.16%     961.195us       9.612us     807.46 Mb           0 b           0 b           0 b           100  
      aten::as_strided         0.15%     894.105us         0.15%     894.105us       0.813us           0 b           0 b           0 b           0 b          1100  
             aten::mul         0.10%     615.297us         0.12%     701.129us       7.011us      12.50 Kb           0 b           0 b           0 b           100  
         aten::resize_         0.08%     458.102us         0.08%     458.102us       1.527us      33.66 Mb      33.66 Mb           0 b           0 b           300  
    aten::_unsafe_view         0.05%     298.902us         0.07%     422.488us       4.225us           0 b           0 b           0 b           0 b           100  
               aten::t         0.05%     281.180us         0.06%     384.573us       3.846us           0 b           0 b           0 b           0 b           100  
          aten::expand         0.04%     237.944us         0.06%     354.969us       1.775us           0 b           0 b           0 b           0 b           200  
          aten::stride         0.05%     272.478us         0.05%     272.478us       0.182us           0 b           0 b           0 b           0 b          1500  
       aten::expand_as         0.01%      84.950us         0.04%     230.088us       2.301us           0 b           0 b           0 b           0 b           100  
              aten::to         0.02%     132.702us         0.02%     132.702us       1.327us           0 b           0 b           0 b           0 b           100  
           aten::alias         0.02%      92.570us         0.02%      92.570us       0.926us           0 b           0 b           0 b           0 b           100  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 603.711ms

conv1d
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                aten::conv1d         0.06%     257.202us        99.72%     441.788ms       4.418ms     269.15 Mb           0 b           0 b           0 b           100  
           aten::convolution         0.07%     312.007us        99.66%     441.530ms       4.415ms     269.15 Mb           0 b           0 b           0 b           100  
          aten::_convolution         0.33%       1.447ms        99.59%     441.218ms       4.412ms     269.15 Mb           0 b           0 b           0 b           100  
    aten::mkldnn_convolution        98.70%     437.288ms        98.87%     438.045ms       4.380ms     269.15 Mb           0 b           0 b           0 b           100  
               aten::squeeze         0.24%       1.063ms         0.28%       1.237ms       6.186us           0 b           0 b           0 b           0 b           200  
             aten::unsqueeze         0.23%       1.006ms         0.26%       1.147ms       3.824us           0 b           0 b           0 b           0 b           300  
                 aten::empty         0.12%     551.754us         0.12%     551.754us       5.518us     269.15 Mb     269.15 Mb           0 b           0 b           100  
                  aten::view         0.11%     475.994us         0.11%     475.994us       4.760us           0 b           0 b           0 b           0 b           100  
            aten::as_strided         0.07%     315.852us         0.07%     315.852us       0.632us           0 b           0 b           0 b           0 b           500  
           aten::as_strided_         0.05%     204.563us         0.05%     204.563us       2.046us           0 b           0 b           0 b           0 b           100  
            aten::contiguous         0.03%     117.147us         0.03%     117.147us       0.390us           0 b           0 b           0 b           0 b           300  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 443.038ms

The new implementation reduce number of function calls, makes the majority of computation rely on a single mkldnn_convolution call; it also greatly reduce memory usage.

The profile script:

import torch
import torch.nn.functional as F
import torch.autograd.profiler as profiler
import torch.utils.benchmark as benchmark


def index_matmul(x, b):
    n_channel, n_sample = x.shape
    dtype = x.dtype
    device = x.device
    n_order = b.shape[0]

    window_idxs = torch.arange(n_sample - n_order + 1, device=device).unsqueeze(
        0) + torch.arange(n_order, device=device).unsqueeze(1)
    window_idxs = window_idxs.repeat(n_channel, 1, 1)
    window_idxs += (
        torch.arange(
            n_channel, device=device).unsqueeze(-1).unsqueeze(-1) * n_sample
    )
    window_idxs = window_idxs.long()
    input_signal_windows = torch.matmul(
        b, torch.take(x, window_idxs)
    )
    return input_signal_windows

def conv1d(x, b):
    n_channel, n_sample = x.shape
    dtype = x.dtype
    device = x.device
    n_order = b.shape[0]
    return F.conv1d(x.unsqueeze(1), b.view(1, 1, n_order)).squeeze(1)


if __name__ == '__main__':
    torch.random.manual_seed(2434)
    b = torch.rand(3)
    x = torch.randn(16, 44100)
    x /= x.abs().max()
    
    with profiler.profile(profile_memory=True) as prof:
        for _ in range(100):
            index_matmul(x, b)
    
    print("indexing + matmul")
    print(prof.key_averages().table(sort_by="cpu_time_total"))


    with profiler.profile(profile_memory=True) as prof:
        for _ in range(100):
            conv1d(x, b)
    
    print("conv1d")
    print(prof.key_averages().table(sort_by="cpu_time_total"))

    y1 = index_matmul(x, b)
    y2 = conv1d(x, b)
    assert torch.allclose(y1, y2, atol=1e-7)

@cpuhrsch
Copy link
Contributor

Thank you for this great contribution @yoyololicon! I'm taking this on to take some workload off of Moto :)

There are so many good things in here, I'd suggest we split this into three pieces so we can land more quickly
a) Landing the conv1d optimization within Python only
b) Landing a full C++ forward implementation
c) Landing the C++ backward pass which also adds autograd support

a) and b) are fairly straightforward and should be quick to land, c) will need a bit more discussion due to numerical stability details etc.

While you're doing this we might already have solved Windows support for C++ extension, but your work is not blocked on it.

Thanks!

  • Christian

@cpuhrsch cpuhrsch self-requested a review February 25, 2021 17:38
Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment, please add me as the reviewer :)

Copy link
Contributor Author

@yoyolicoris yoyolicoris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, but how to add reviewers?

@yoyolicoris yoyolicoris requested a review from cpuhrsch February 26, 2021 00:14
@yoyolicoris
Copy link
Contributor Author

a) and b) are fairly straightforward and should be quick to land, c) will need a bit more discussion due to numerical stability details etc.

Should we create seperate request for each one?

@yoyolicoris yoyolicoris requested a review from mthrok February 26, 2021 00:33
@mthrok
Copy link
Contributor

mthrok commented Feb 26, 2021

a) and b) are fairly straightforward and should be quick to land, c) will need a bit more discussion due to numerical stability details etc.

Should we create seperate request for each one?

I recommend opening separate PRs for a) and b). Then after a) and b) are done, you can either close this one or update it to c). For now this one has a good discussion and benchmark numbers, so let's leave it open.

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Feb 26, 2021

Profiling results running on a P620 gpu with same parameters:

indexing + matmul
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
          aten::matmul        10.66%       2.024ms        39.36%       7.470ms      74.696us           0 b           0 b     149.41 Mb    -403.76 Mb           100  
          aten::arange        10.72%       2.034ms        24.99%       4.743ms       7.906us           0 b           0 b      67.58 Mb           0 b           600  
          aten::repeat         3.82%     725.022us        18.14%       3.443ms      34.431us           0 b           0 b     807.47 Mb           0 b           100  
           aten::copy_         8.69%       1.650ms         8.69%       1.650ms       8.248us           0 b           0 b           0 b           0 b           200  
      aten::contiguous         1.34%     254.661us         8.33%       1.581ms       7.903us           0 b           0 b     403.76 Mb           0 b           200  
              aten::mm         6.44%       1.223ms         7.65%       1.451ms      14.512us           0 b           0 b     149.41 Mb           0 b           100  
          aten::unfold         4.67%     886.657us         7.16%       1.359ms       4.528us           0 b           0 b           0 b           0 b           300  
            aten::view         7.09%       1.346ms         7.09%       1.346ms       3.365us           0 b           0 b           0 b           0 b           400  
       aten::unsqueeze         5.72%       1.086ms         6.77%       1.285ms       3.212us           0 b           0 b           0 b           0 b           400  
           aten::empty         6.39%       1.213ms         6.39%       1.213ms       1.516us           0 b           0 b       1.43 Gb       1.43 Gb           800  
             aten::add         4.93%     935.917us         5.88%       1.116ms      11.162us           0 b           0 b     100.98 Mb           0 b           100  
            aten::take         5.57%       1.058ms         5.57%       1.058ms      10.578us           0 b           0 b     403.76 Mb     403.76 Mb           100  
             aten::mul         4.59%     870.274us         5.55%       1.054ms      10.539us           0 b           0 b      50.00 Kb           0 b           100  
    aten::_unsafe_view         0.98%     185.492us         4.24%     805.323us       8.053us           0 b           0 b           0 b           0 b           100  
         aten::reshape         1.01%     191.859us         3.65%     692.589us       3.463us           0 b           0 b           0 b           0 b           200  
            aten::add_         3.61%     685.445us         3.61%     685.445us       6.854us           0 b           0 b           0 b           0 b           100  
      aten::as_strided         3.05%     578.604us         3.05%     578.604us       0.526us           0 b           0 b           0 b           0 b          1100  
         aten::resize_         2.73%     518.079us         2.73%     518.079us       1.727us           0 b           0 b      33.79 Mb      33.79 Mb           300  
       aten::transpose         1.85%     350.566us         2.47%     467.854us       2.339us           0 b           0 b           0 b           0 b           200  
      aten::empty_like         0.88%     167.664us         2.38%     452.493us       4.525us           0 b           0 b     403.76 Mb           0 b           100  
          aten::stride         1.96%     371.699us         1.96%     371.699us       0.206us           0 b           0 b           0 b           0 b          1800  
               aten::t         1.17%     222.400us         1.72%     326.600us       3.266us           0 b           0 b           0 b           0 b           100  
          aten::expand         1.01%     191.119us         1.53%     289.546us       1.448us           0 b           0 b           0 b           0 b           200  
       aten::expand_as         0.35%      67.201us         1.02%     193.950us       1.939us           0 b           0 b           0 b           0 b           100  
           aten::alias         0.45%      84.716us         0.45%      84.716us       0.847us           0 b           0 b           0 b           0 b           100  
              aten::to         0.30%      57.575us         0.30%      57.575us       0.576us           0 b           0 b           0 b           0 b           100  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 18.980ms

conv1d
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
               aten::conv1d         2.09%     115.218us        87.63%       4.837ms      48.367us           0 b           0 b     134.62 Mb           0 b           100  
          aten::convolution         2.07%     114.072us        85.55%       4.721ms      47.214us           0 b           0 b     134.62 Mb           0 b           100  
         aten::_convolution        11.17%     616.444us        83.48%       4.607ms      46.074us           0 b           0 b     134.62 Mb           0 b           100  
    aten::cudnn_convolution        47.62%       2.628ms        56.20%       3.102ms      31.017us           0 b           0 b     134.62 Mb           0 b           100  
            aten::unsqueeze        10.80%     596.204us        13.80%     761.660us       2.539us           0 b           0 b           0 b           0 b           300  
              aten::squeeze         8.38%     462.517us         9.91%     546.892us       2.734us           0 b           0 b           0 b           0 b           200  
                aten::empty         5.03%     277.499us         5.03%     277.499us       1.387us           0 b           0 b     134.62 Mb     134.62 Mb           200  
           aten::as_strided         4.53%     249.831us         4.53%     249.831us       0.500us           0 b           0 b           0 b           0 b           500  
                 aten::view         3.45%     190.142us         3.45%     190.142us       1.901us           0 b           0 b           0 b           0 b           100  
           aten::contiguous         2.38%     131.520us         2.38%     131.520us       0.329us           0 b           0 b           0 b           0 b           400  
               aten::stride         1.32%      72.752us         1.32%      72.752us       0.182us           0 b           0 b           0 b           0 b           400  
              aten::resize_         1.18%      64.994us         1.18%      64.994us       0.325us           0 b           0 b           0 b           0 b           200  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.519ms

The speed improvements is a lot more obvious.

update to latest version
@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Mar 9, 2021

@cpuhrsch A demonstration test is added.
I use a 9th order butterworth hi-pass filter to filter an unit impulse, the output should be the impulse response of filter.
With enough precision, the impulse response should not have absolute value greater than 1.
The pass band frequency was adjusted so when using double precision the response is accurate enough, but when using single precision, the precision errors would accumulate enough to make the response have value greater than 1.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from scipy dependency question to generate test asset (@mthrok) this looks good to go.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@yoyolicoris yoyolicoris requested a review from mthrok March 11, 2021 04:17
@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Mar 14, 2021

Hi @mthrok , the stability test case was updated, could you help me review it?

Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yoyololicon

Sorry for my late response. I was away from the laptop. Please refer to my comments regarding test readability and maintainability.

@mthrok mthrok merged commit 2a3d52f into pytorch:master Mar 15, 2021
@mthrok
Copy link
Contributor

mthrok commented Mar 15, 2021

@yoyololicon Thanks for this great contribution!

@mthrok mthrok changed the title Support backprop through lfilter parameters Add backprop support to lfilter Mar 15, 2021
@yoyolicoris
Copy link
Contributor Author

@mthrok @cpuhrsch Thank you guys for helping me submit this merge!
I'm really appreciated for your help and take time reviewing it. It's a wonderful experience.

@yoyolicoris yoyolicoris deleted the lfilter_autograd_support branch March 15, 2021 17:29
@mthrok mthrok mentioned this pull request Apr 1, 2021
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants