Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPP extension for overdrive effect in functional #580

Closed

Conversation

bhargavkathivarapu
Copy link
Contributor

Hi ,

I have tried implementing existing fucntional overdrive using cpp as the python version is slow compared to Sox version . ( #260 ( Reducing dependency in Sox) )
Though cpp version is slight slower than sox version . It is much better compated to python version

  • comparing Sox with new overdrive(with cpp ext )

Screenshot 2020-04-24 at 7 52 00 PM

  • old python overdrive effect took 80000 ms

Screenshot 2020-04-24 at 8 28 46 PM

Almost >700X speed compared to python implementation

  • sox comapatibilty - passed
  • batch test - passed
  • Torch script - not passed . I think cpp extension cannot be converted to torch script
$ python test/test_torchscript_consistency.py
.............................E.....sssssssssssssssssssssssssssssssssss................ssssssssssssssss
======================================================================
ERROR: test_overdrive (__main__.TestFunctionalCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_torchscript_consistency.py", line 474, in test_overdrive
    self._assert_consistency(func, waveform)
  File "test/test_torchscript_consistency.py", line 38, in _assert_consistency
    return _assert_functional_consistency(func, tensor, self.device, shape_only=shape_only)
  File "test/test_torchscript_consistency.py", line 14, in _assert_functional_consistency
    ts_func = torch.jit.script(func)
  File "/Users/ka387861/pytorch/torch/jit/__init__.py", line 1296, in script
    fn = torch._C._jit_script_compile(qualified_name, ast, _rcb, get_default_args(obj))
  File "/Users/ka387861/pytorch/torch/jit/_recursive.py", line 559, in try_compile_fn
    return torch.jit.script(fn, _rcb=rcb)
  File "/Users/ka387861/pytorch/torch/jit/__init__.py", line 1296, in script
    fn = torch._C._jit_script_compile(qualified_name, ast, _rcb, get_default_args(obj))
RuntimeError: 
Python builtin <built-in method _overdrive_helper of PyCapsule object at 0x143fe2270> is currently not supported in Torchscript:
  File "/Users/ka387861/audio/torchaudio/functional.py", line 1291

    # TODO: Implement a torch CPP extension
    _overdrive_helper(waveform, temp, last_in, last_out, output_waveform)
    ~~~~~~~~~~~~~~~~~ <--- HERE

    return output_waveform.clamp(min=-1, max=1).view(actual_shape)
'overdrive' is being compiled since it was called from 'func'
  File "test/test_torchscript_consistency.py", line 472
            gain = 30.
            colour = 50.
            return F.overdrive(tensor, gain, colour)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE


----------------------------------------------------------------------
Ran 102 tests in 1839.777s

FAILED (errors=1, skipped=51)
  • should we create a new folder to organize cpp codes in torchaudio ?
    • Cuda kernels and cpp files for other functions might clutter the torchaudio folder
  • Planning to write a cuda kernel , but i am getting some weird errors when running existing torch script test on remote GPU docker
    ENV : pytorch - 1.4 , CUDA 10 , Multi GPU

Below is a part of the log

======================================================================
ERROR: test_Spectrogram (__main__.TestTransformsCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_torchscript_consistency.py", line 486, in test_Spectrogram
    self._assert_consistency(T.Spectrogram(), tensor)
  File "test/test_torchscript_consistency.py", line 482, in _assert_consistency
    _assert_transforms_consistency(transform, tensor, self.device)
  File "test/test_torchscript_consistency.py", line 27, in _assert_transforms_consistency
    ts_transform = torch.jit.script(transform)
  File "/miniconda/envs/python36/lib/python3.6/site-packages/torch/jit/__init__.py", line 1255, in script
    return torch.jit._recursive.recursive_script(obj)
  File "/miniconda/envs/python36/lib/python3.6/site-packages/torch/jit/_recursive.py", line 534, in recursive_script
    return create_script_module(nn_module, infer_methods_to_compile(nn_module))
  File "/miniconda/envs/python36/lib/python3.6/site-packages/torch/jit/_recursive.py", line 296, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, cpp_module, stubs)
  File "/miniconda/envs/python36/lib/python3.6/site-packages/torch/jit/_recursive.py", line 340, in create_script_module_impl
    create_methods_from_stubs(concrete_type, stubs)
  File "/miniconda/envs/python36/lib/python3.6/site-packages/torch/jit/_recursive.py", line 259, in create_methods_from_stubs
    concrete_type._create_methods(defs, rcbs, defaults)
RuntimeError: Can't redefine method: forward on class: __torch__.torchaudio.transforms.Spectrogram (addMethod at /opt/conda/conda-bld/pytorch_1579027003190/work/torch/csrc/jit/script/class_type.cpp:73)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f55078ed627 in /miniconda/envs/python36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::ClassType::addMethod(torch::jit::Function*) + 0x1d9 (0x7f550d426f69 in /miniconda/envs/python36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: torch::jit::script::CompilationUnit::define(c10::optional<c10::QualifiedName> const&, torch::jit::script::Def const&, std::shared_ptr<torch::jit::script::Resolver> const&, torch::jit::script::Self const*, std::unordered_map<std::string, torch::jit::Function*, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, torch::jit::Function*> > > const&, bool) const + 0x6d7 (0x7f550d3b3f47 in /miniconda/envs/python36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::script::CompilationUnit::define(c10::optional<c10::QualifiedName> const&, std::vector<torch::jit::script::Def, std::allocator<torch::jit::script::Def> > const&, std::vector<std::shared_ptr<torch::jit::script::Resolver>, std::allocator<std::shared_ptr<torch::jit::script::Resolver> > > const&, torch::jit::script::Self const*, bool) + 0x17d (0x7f550d3b468d in /miniconda/envs/python36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x7897af (0x7f5538dd57af in /miniconda/envs/python36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28c076 (0x7f55388d8076 in /miniconda/envs/python36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>


======================================================================
FAIL: test_TimeStretch (__main__.TestTransformsCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_torchscript_consistency.py", line 533, in test_TimeStretch
    tensor,
  File "test/test_torchscript_consistency.py", line 482, in _assert_consistency
    _assert_transforms_consistency(transform, tensor, self.device)
  File "test/test_torchscript_consistency.py", line 30, in _assert_transforms_consistency
    torch.testing.assert_allclose(ts_output, output)
  File "/miniconda/envs/python36/lib/python3.6/site-packages/torch/testing/__init__.py", line 59, in assert_allclose
    count - 1, 100 * count / actual.numel()))
AssertionError: Not within tolerance rtol=0.0001 atol=1e-05 at input[7, 0, 254, 5, 0] (-0.030245909467339516 vs. -0.029713068157434464) and 101 other locations (0.00%)

----------------------------------------------------------------------
Ran 102 tests in 254.452s

FAILED (failures=1, errors=14, skipped=2)

Any idea of this error above "Can't redefine method: forward on class"

@mthrok or @vincentqb could review these changes

Signed-off-by: Bhargav Kathivarapu <bhargavkathivarapu31@gmail.com>
Signed-off-by: Bhargav Kathivarapu <bhargavkathivarapu31@gmail.com>
Signed-off-by: Bhargav Kathivarapu <bhargavkathivarapu31@gmail.com>
@mthrok
Copy link
Collaborator

mthrok commented Apr 24, 2020

Hi @bhargavkathivarapu

Thanks for the PR. This is exciting. I will take a look into it.

  1. You can run only the related test with pytest test -v -k overdrive
  2. Can you run python -m torch.utils.collect_env and paste the result here.
  3. In my env, both TimeStretch and Spectrogram tests run fine.
Environment (Docker `nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04`)
$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.6.0a0+de4d2e9
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.6.0a0+de4d2e9
[pip] torchaudio==0.6.0a0+954d512
[conda] blas                      1.0                         mkl
[conda] magma-cuda101             2.5.2                         1    pytorch
[conda] mkl                       2020.0                      166
[conda] mkl-include               2020.0                      166
[conda] mkl-service               2.3.0            py38he904b0f_0
[conda] mkl_fft                   1.0.15           py38ha843d7b_0
[conda] mkl_random                1.1.0            py38h962f231_0
[conda] numpy                     1.18.1           py38h4f9e942_0
[conda] numpy-base                1.18.1           py38hde5b4d6_1
[conda] torch                     1.6.0a0+de4d2e9           dev_0    <develop>
[conda] torchaudio                0.6.0a0+954d512           dev_0    <develop>
TimeStretch
$ pytest test -v -k TimeStretch
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.8.2, pytest-5.4.1, py-1.8.1, pluggy-0.13.1 -- /home/moto/conda/envs/PY3.8-cuda101/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/scratch/moto/torchaudio/.hypothesis/examples')
rootdir: /scratch/moto/torchaudio
plugins: hypothesis-5.8.3
collected 253 items / 250 deselected / 3 selected

test/test_batch_consistency.py::TestTransforms::test_batch_TimeStretch PASSED                                                                                                                                                          [ 33%]
test/test_torchscript_consistency.py::TestTransformsCPU::test_TimeStretch PASSED                                                                                                                                                       [ 66%]
test/test_torchscript_consistency.py::TestTransformsCUDA::test_TimeStretch PASSED                                                                                                                                                      [100%]

===================================================================================================== 3 passed, 250 deselected in 5.91s ======================================================================================================
Spectrogram
$ pytest test -v -k Spectrogram
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.8.2, pytest-5.4.1, py-1.8.1, pluggy-0.13.1 -- /home/moto/conda/envs/PY3.8-cuda101/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/scratch/moto/torchaudio/.hypothesis/examples')
rootdir: /scratch/moto/torchaudio
plugins: hypothesis-5.8.3
collected 253 items / 243 deselected / 10 selected

test/test_batch_consistency.py::TestTransforms::test_batch_melspectrogram PASSED                                                                                                                                                       [ 10%]
test/test_batch_consistency.py::TestTransforms::test_batch_spectrogram PASSED                                                                                                                                                          [ 20%]
test/test_compliance_kaldi.py::Test_Kaldi::test_spectrogram PASSED                                                                                                                                                                     [ 30%]
test/test_torchscript_consistency.py::TestFunctionalCPU::test_spectrogram PASSED                                                                                                                                                       [ 40%]
test/test_torchscript_consistency.py::TestFunctionalCUDA::test_spectrogram PASSED                                                                                                                                                      [ 50%]
test/test_torchscript_consistency.py::TestTransformsCPU::test_MelSpectrogram PASSED                                                                                                                                                    [ 60%]
test/test_torchscript_consistency.py::TestTransformsCPU::test_Spectrogram PASSED                                                                                                                                                       [ 70%]
test/test_torchscript_consistency.py::TestTransformsCUDA::test_MelSpectrogram PASSED                                                                                                                                                   [ 80%]
test/test_torchscript_consistency.py::TestTransformsCUDA::test_Spectrogram PASSED                                                                                                                                                      [ 90%]
test/test_transforms.py::Tester::test_melspectrogram_load_save PASSED                                                                                                                                                                  [100%]

============================================================================================================== warnings summary ==============================================================================================================
test/test_compliance_kaldi.py::Test_Kaldi::test_spectrogram
  ../torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=============================================================================================== 10 passed, 243 deselected, 1 warning in 7.01s ================================================================================================

@mthrok mthrok requested review from vincentqb and mthrok and removed request for vincentqb April 24, 2020 15:38
@bhargavkathivarapu
Copy link
Contributor Author

2. Can you run `python -m torch.utils.collect_env` and paste the result here.
My remote docker GPU environment Collecting environment information...

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 410.79
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] torch==1.4.0
[pip] torchaudio==0.6.0a0+fddbded
[pip] torchvision==0.2.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.0.130 0
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py36he904b0f_0
[conda] mkl_fft 1.0.14 py36ha843d7b_0
[conda] mkl_random 1.1.0 py36hd6b4f25_0
[conda] numpy 1.17.2 py36haad9e8e_0
[conda] numpy-base 1.17.2 py36hde5b4d6_0
[conda] pytorch 1.4.0 py3.6_cuda10.0.130_cudnn7.6.3_0 pytorch
[conda] torchaudio 0.6.0a0+fddbded dev_0
[conda] torchvision 0.2.2 py_3 pytorch

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding C++ extensions is an ongoing topic of discussion for torchaudio/torchtext/torchvision (cc @fmassa), and requires some alignment, since we need to do so in a pytorch way to maintain jitability, and gpu-support. For instance:

  • We would love to see torchaudio.load linked in a way that maintains jitability.
  • In CPP Implementation of lfilter for CPU #290, we decided to delay merging a C++ implementation of lfilter for these reasons even though we know that the pytorch implementation is slower.

Since this is of interest to you, and we are indeed interested in offering something like this, we can provide guidance to align this with our plans though it will take a little bit more time to get this merged properly.

@vincentqb
Copy link
Contributor

vincentqb commented Apr 24, 2020

For completeness, in terms of performance, it'd be nice to see a comparison with the jitted version available in pytorch. :)

@bhargavkathivarapu
Copy link
Contributor Author

For completeness, in terms of performance, it'd be nice to see a comparison with the jitted version available in pytorch. I don't expect a significance difference though. :)

comparsion between jit and python versions for overdrive implemented using python.

Screenshot 2020-04-25 at 2 50 17 PM

@bhargavkathivarapu
Copy link
Contributor Author

Adding C++ extensions is an ongoing topic of discussion for torchaudio/torchtext/torchvision (cc @fmassa), and requires some alignment, since we need to do so in a pytorch way to maintain jitability, and gpu-support. For instance:

* We would love to see `torchaudio.load` linked in a way that maintains jitability.

* In #290, we decided to delay merging a C++ implementation of `lfilter` for these reasons even though we know that the pytorch implementation is slower.

Since this is of interest to you, and we are indeed interested in offering something like this, we can provide guidance to align this with our plans though it will take a little bit more time to get this merged properly.

ok . Once the approach for the integrating cpp, cuda extensions is finalized may be you can put an github issue for optimizing existing codes, I would like to contribute and learn some torch cpp and cuda internals in the process
Meanwhile I will try implementing other sox effects in python for reducing sox dependency

int64_t n_frames = waveform_accessor.size(1);
int64_t n_channels = waveform_accessor.size(0);

for (int64_t i_channel = 0; i_channel < n_channels; ++i_channel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the amount of work you might benefit from using parallel_for.

Most PyTorch CPU operators are parallelized, unless there's no obvious need due to memory-boundedness.

Another issue with pure C for C++ extensions, for now, is autovectorization. We can't ship avx2 code without a CPU capability based dispatch. That means for C code in extensions like this we're for now restricted to SSE and related.

Of course this is taken care of when you call into at:: operations directly, since they each take advantage being part of libtorch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch , yeah parallelization can be applied only for the channels loop . I was not sure how the parallel_for treats the inner sequential loop , so kept it without the parallel_for. A parallel thread won't interfere with other parallel thread's inner loop working right ?

Copy link
Contributor

@cpuhrsch cpuhrsch May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhargavkathivarapu - I'm not sure what you mean by "interfere with" exactly. As in, shared variables or creating integers etc.? In this particular case it seems that the inner loops are independent of each other given that they differ in i_channel. The pointers and such will still be picked up as shared variables, but as long as you don't write to a single memory location from multiple threads concurrently etc., there's no issue.

By default PyTorch uses openmp which yields this implementation. Look into openmp's omp parallel (here is what looks like a good explanation) for some more detail on what that means.

for (int64_t i_frame = 0; i_frame < n_frames; ++i_frame) {
last_out_accessor[i_channel] = temp_accessor[i_channel][i_frame] -
last_in_accessor[i_channel] + 0.995 * last_out_accessor[i_channel];
last_in_accessor[i_channel] = temp_accessor[i_channel][i_frame];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're setting the value of last_in to the value of temp for the current iteration so that the next iteration those values may be used . But instead you could just read from temp all the time (except for the first iteration) right? I added a similar comment for the Python code above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But instead you could just read from temp all the time (except for the first iteration) right?

Yes , the first iteration needs to be handled then we can remove the last_in variable

@mthrok
Copy link
Collaborator

mthrok commented Feb 14, 2021

Hi @bhargavkathivarapu

Sorry for taking such long time to get back to this, but we finally cleaned up the build process and how we can write C++ extension. Recently, @parmeet has added C++ loop for lfilter, in #1244. This PR can follow the exact same pattern to achieve this. If you are still around and interested, would you like to give another shot? If not, we can take your commits and update them while keeping your credit. Let me know what you think.

@bhargavkathivarapu
Copy link
Contributor Author

Hi @mthrok , Thanks for sharing the update on this. I will try to implement overdrive CPP extension similar to lfilter

@bhargavkathivarapu
Copy link
Contributor Author

Closing this PR as there is new version of this PR (new - #1299 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants