Skip to content

v0.9.0

Compare
Choose a tag to compare
@mthrok mthrok released this 15 Jun 15:32
· 4 commits to release/0.9 since this release

torchaudio 0.9.0 Release Note

Highlights

torchaudio 0.9.0 release includes:

  • Lots of performance improvements. (filtering, resampling, spectral operation)
  • Popular wav2vec2.0 model architecture.
  • Improved autograd support.

[Beta] Wav2Vec2.0 Model

This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.

# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model

original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)
# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model

Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    ["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)
# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base

model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())

# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")

Filtering Improvement

The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.

The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.

torchaudio version

256

512

1024

0.9

0.282

0.381

0.564

0.8

0.493

0.780

1.37

0.7

5.42

10.8

22.3

Unit: msec

Complex Tensor Migration

torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.

Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.

CPU
torchaudio version Spectrogram TimeStretch GriffinLim
0.9

0.229

12.6

3320

0.8

0.283

126

5320

Unit: msec

CUDA
torchaudio version Spectrogram TimeStretch GriffinLim
0.9

0.195

0.599

36

0.8

0.219

0.687

60.2

Unit: msec

Improved Autograd Support

Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

Functionals
  • lfilter
  • allpass_biquad
  • biquad
  • band_biquad
  • bandpass_biquad
  • bandrefect_biquad
  • bass_biquad
  • equalizer_biquad
  • treble_biquad
  • highpass_biquad
  • lowpass_biquad
Transforms
  • AmplitudeToDB
  • ComputeDeltas
  • Fade
  • GriffinLim
  • TimeMasking
  • FrequencyMasking
  • MFCC
  • MelScale
  • MelSpectrogram
  • Resample
  • SpectralCentroid
  • Spectrogram
  • SlidingWindowCmn
  • TimeStretch*
  • Vol

NOTE:

  1. Autograd test for transforms also covers the following functionals.
    • amplitude_to_DB
    • spectrogram
    • griffinlim
    • resample
    • phase_vocoder*
    • mask_along_axis_iid
    • mask_along_axis
    • gain
    • spectral_centroid
  2. torchaudio.transforms.TimeStretch and torchaudio.functional.phase_vocoder call atan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

[Beta] Resampling Improvement

In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

  • Kaiser window has been added for a wider range of resampling quality.
  • rolloff parameter has been added for anti-aliasing control.
  • torchaudio.transforms.Resample precomputes the kernel using float64 precision and caches it for even faster operation.
  • New entry point, torchaudio.functional.resample has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform is deprecated.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.

CPU
torchaudio version 8k → 16k [Hz] 16k → 8k 16k → 44.1k 44.1k → 16k
0.9

0.192

0.559

0.478

0.467

0.8

0.537

0.753

43.9

17.6

Unit: msec

CUDA
torchaudio version 8k → 16k 16k → 8k 16k → 44.1k 44.1k → 16k
0.9

0.203

0.172

0.213

0.212

0.8

0.860

0.559

116

46.7

Unit: msec

Improved Windows Support

torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.

This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io” backend and torchaudio.functional.compute_kaldi_pitch are not included.

I/O Functions Migration

Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox” to “sox_io”, and the similar API change has been applied to “soundfile” backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.

Backward Incompatible Changes

I/O

  • Deprecated backends and functions were removed (#1311, #1329, #1362)
    • Please see #903 for the migration.
  • Added validation of the number of channels when saving GSM (#1384)
    • Please make sure that signal has only one channel when saving into GSM.

Ops

  • Removed deprecated normalized argument from torchaudio.functional.griffinlim (#1369)
    • This argument was never used. Please remove the argument from your call.
  • Renamed torchaudio.functional.sliding_window_cmn arg for correctness (#1347)
    • The first argument is supposed to spectrogram. If you have used keyword argument waveform=..., please change it to specgram=...
  • Changed torchaudio.transforms.Resample to precompute and cache the resampling kernel. (#1499, #1514)
    • To use the transform in devices other than CPU, please move the instantiated object to the target device.
      resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
      resampler.to(torch.device("cuda"))

Dataset

  • Removed deprecated arguments from CommonVoice (#1534)
    • torchaudio no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.

Deprecations

  • Deprecated the use of pseudo complex type (#1445, #1492)
    • torchaudio is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.
  • Deprecated torchaudio.compliance.kaldi.resample_waveform (#1533)
    • Please use torchaudio.functional.resample.
  • torchaudio.transforms.MelScale now expects valid n_stft value (#1515)
    • Please provide a valid value to n_stft.

New Features

[Beta] Wav2Vec2.0

  • Added wav2vec2.0 model (#1529)
  • Added wav2vec2.0 HuggingFace importer (#1530)
  • Added wav2vec2.0 fairseq importer (#1531)
  • Added speech recognition C++ example (#1538)

Filtering

  • Added C++ implementation of torchaudio.functional.lfilter (#1319)
  • Added autograd support to torchaudio.functional.lfilter (#1310, #1441)

[Beta] Resampling

  • Added torchaudio.functional.resample (#1402)
  • Added rolloff parameter (#1488)
  • Added kaiser window support to resampling (#1509)
  • Added kernel caching mechanism in torchaudio.transforms.Resample (#1499, #1514, #1556)
  • Skip resampling when sampling rate is not changed (#1537)

Native Complex Tensor

  • Added complex tensor support to torchaudio.functional.phase_vocoder and torchaudio.transforms.TimeStretch (#1410)
  • Added return_complex to torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram (#1366, #1551)

Improvements

I/O

  • Added file path to I/O error messages (#1523)
  • Added __str__ override to AudioMetaData for easy print (#1339)
  • Fixed uninitialized variable in sox/utils.cpp (#1306)
  • Replaced UB sox conversion macros with tensor op (#1370)
  • Removed check_length from validate_input_file (#1312)

Ops

  • Added warning for non-integer resampling frequencies (#1490)
  • Adopted native complex tensors in torchaudio.functional.griffinlim (#1368)
  • Prohibited scripting torchaudio.transforms.MelScale when n_stft is invalid (#1505)
  • Added input dimension check to VAD (#1513)
  • Added HTK-compatible option to Mel-scale conversion (#593)

Models

  • Added vanilla DeepSpeech model (#1399)

Datasets

  • Fixed checksum for the YESNO dataset (#1405)

Misc

  • Added missing transforms to __all__ (#1458)
  • Removed reference_cast in make_boxed_from_unboxed_functor (#1300)
  • Removed unused normalized constant from torchaudio.transforms.GriffinLim (#1433)
  • Removed unused helper function (#1396)

Examples

  • Added libtorchaudio C++ example (#1349)
  • Refactored libtorchaudio example (#1486)
  • Replaced librosa's Mel scale conversion with torchaudio’s in WaveRNN example (#1444)

Build

  • Updated config.guess to support source build in recent architectures (#1484)
  • Explicitly disabled wavpack when building SoX (#1462)
  • Added ROCm support to source build (#1411)
  • Added Windows C++ binary build (#1345, #1371)
  • Made kaldi selective in build (#1342)
  • Made sox selective (#1338)

Testing

  • Added autograd test for torchaudio.functional.lfilter and biquad variants (#1400, #1438)
  • Added autograd test for transforms (overview: #1414)
    • torchaudio.transforms.FrequencyMasking (#1498)
    • torchaudio.transforms.SlidingWindowCmn (#1482)
    • torchaudio.transforms.MelScale (#1467)
    • torchaudio.transforms.Vol (#1460)
    • torchaudio.transforms.TimeStretch (#1420)
    • torchaudio.transforms.AmplitudeToDB (#1447)
    • torchaudio.transforms.GriffinLim (#1421)
    • torchaudio.transforms.SpectralCentroid (#1425)
    • torchaudio.transforms.ComputeDeltas (#1422)
    • torchaudio.transforms.Fade (#1424)
    • torchaudio.transforms.Resample (#1416)
    • torchaudio.transforms.MFCC (#1415)
    • torchaudio.transforms.Spectrogram / MelSpectrogram (#1340)
  • Added test for a batch of different items in the functional batch consistency test. (#1315)
  • Added test for validating torchaudio.functional.lfilter shape (#1360)
  • Added TorchScript test for torchaudio.functional.resample (#1516)
  • Added TorchScript test for torchaudio.functional.phase_vocoder (#1379)
  • Added steps to save and load the scripted object in TorchScript (#1446)
  • Added GPU support to functional tests (#1475)
  • Added GPU support to transform librosa compatibility test (#1439)
  • Added GPU support to functional librosa compatibility test (#1436)
  • Improved HTTP fetch test reliability (#1512)
  • Refactored functional batch consistency test (#1341)
  • Refactored test classes for complex (#1491)
  • Refactored sox_io load test (#1394)
  • Refactored Kaldi compatibility tests (#1359)
  • Refactored functional test (#1435, #1463)
  • Refactored transform tests (#1356)
  • Refactored librosa compatibility test (#1350)
  • Refactored sox compatibility test (#1344)
  • Refactored librosa compatibility test (#1259)
  • Removed the use I/O functions in batch consistency test (#1521)
  • Removed skipIfNoSoxBackend (#1390)
  • Removed VAD from batch consistency tests (#1451)
  • Replaced deprecated floor_divide with div (#1455)
  • Replaced torch.assert_allclose with assertEqual (#1387)
  • Shortened torchaudio.functional.lfilter autograd tests input size (#1443)
  • Updated torchaudio.transforms.InverseMelScale comparison test (#1437)

Bug Fixes

  • Updated torchaudio.transforms.TimeMasking and torchaudio.transforms.FrequencyMasking to perform out-of-place masking (#1481)
  • Annotate power of torchaudio.transforms.MelSpectrogram as float only (#1572)

Performance

  • Adopted torch.nn.functional.conv1d in torchaudio.functional.lfilter (#1318)
  • Added C++ implementation of torchaudio.functional.overdrive (#1299)

Documentation

  • Update docs (#1550)
  • Reformat resample docs (#1548)
  • Updated resampling documentation (#1519)
  • Added the clarification that sox_effects.apply_effects_tensor is CPU-only (#1459)
  • Removed instructions on using external sox (#1365, #1281)
  • Added navigation with left/right arrow keys (#1336)
  • Fixed docstring of sliding_window_cmn (#1383)
  • Update contributing guide (#1372)
  • Fix broken links in contribution guide (#1361)
  • Added Windows build instructions (#1440)
  • Fixed typo (#1471, #1397, #1293)
  • Added WER to readme in wav2letter pipeline (#1470)
  • Fixed wav2letter usage example (#1060)
  • Added Google Analytics support (#1466)