v0.9.0
torchaudio 0.9.0 Release Note
Highlights
torchaudio 0.9.0 release includes:
- Lots of performance improvements. (filtering, resampling, spectral operation)
- Popular wav2vec2.0 model architecture.
- Improved autograd support.
[Beta] Wav2Vec2.0 Model
This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq
and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.
# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model
original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)
# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model
Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)
# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base
model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())
# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")
Filtering Improvement
The internal implementation of lfilter
has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad
variants.
The following table illustrates the performance improvements compared against the previous releases. lfilter
was applied on float32
tensors with one channel and different number of frames.
torchaudio version | 256 |
512 |
1024 |
0.9 | 0.282 |
0.381 |
0.564 |
0.8 | 0.493 |
0.780 |
1.37 |
0.7 | 5.42 |
10.8 |
22.3 |
Unit: msec
Complex Tensor Migration
torchaudio
has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio
adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat
and torch.cdouble
were introduced to represent complex values natively. (In the following, we refer to torchaudio
’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)
As the native complex types have become mature and stable, torchaudio
has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.
Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32
Tensor with two channels and 256 frames.
CPU
torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
0.9 | 0.229 |
12.6 |
3320 |
0.8 | 0.283 |
126 |
5320 |
Unit: msec
CUDA
torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
0.9 | 0.195 |
0.599 |
36 |
0.8 | 0.219 |
0.687 |
60.2 |
Unit: msec
Improved Autograd Support
Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.
Functionals
lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad
Transforms
AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch
*Vol
NOTE:
- Autograd test for transforms also covers the following functionals.
amplitude_to_DB
spectrogram
griffinlim
resample
phase_vocoder
*mask_along_axis_iid
mask_along_axis
gain
spectral_centroid
torchaudio.transforms.TimeStretch
andtorchaudio.functional.phase_vocoder
callatan2
, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.
[Beta] Resampling Improvement
In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.
- Kaiser window has been added for a wider range of resampling quality.
rolloff
parameter has been added for anti-aliasing control.torchaudio.transforms.Resample
precomputes the kernel usingfloat64
precision and caches it for even faster operation.- New entry point,
torchaudio.functional.resample
has been added and the original entry point,torchaudio.compliance.kaldi.resample_waveform
is deprecated.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample
to complete the operation on float32
tensor with two channels and one-second duration.
CPU
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
0.9 | 0.192 |
0.559 |
0.478 |
0.467 |
0.8 | 0.537 |
0.753 |
43.9 |
17.6 |
Unit: msec
CUDA
torchaudio version | 8k → 16k | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
0.9 | 0.203 |
0.172 |
0.213 |
0.212 |
0.8 | 0.860 |
0.559 |
116 |
46.7 |
Unit: msec
Improved Windows Support
torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.
This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io”
backend and torchaudio.functional.compute_kaldi_pitch
are not included.
I/O Functions Migration
Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox”
to “sox_io”
, and the similar API change has been applied to “soundfile”
backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.
Backward Incompatible Changes
I/O
- Deprecated backends and functions were removed (#1311, #1329, #1362)
- Please see #903 for the migration.
- Added validation of the number of channels when saving GSM (#1384)
- Please make sure that signal has only one channel when saving into GSM.
Ops
- Removed deprecated
normalized
argument fromtorchaudio.functional.griffinlim
(#1369)- This argument was never used. Please remove the argument from your call.
- Renamed
torchaudio.functional.sliding_window_cmn
arg for correctness (#1347)- The first argument is supposed to spectrogram. If you have used keyword argument
waveform=...
, please change it tospecgram=...
- The first argument is supposed to spectrogram. If you have used keyword argument
- Changed
torchaudio.transforms.Resample
to precompute and cache the resampling kernel. (#1499, #1514)- To use the transform in devices other than CPU, please move the instantiated object to the target device.
resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100) resampler.to(torch.device("cuda"))
- To use the transform in devices other than CPU, please move the instantiated object to the target device.
Dataset
- Removed deprecated arguments from CommonVoice (#1534)
torchaudio
no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.
Deprecations
- Deprecated the use of pseudo complex type (#1445, #1492)
torchaudio
is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.
- Deprecated
torchaudio.compliance.kaldi.resample_waveform
(#1533)- Please use
torchaudio.functional.resample
.
- Please use
torchaudio.transforms.MelScale
now expects validn_stft
value (#1515)- Please provide a valid value to
n_stft
.
- Please provide a valid value to
New Features
[Beta] Wav2Vec2.0
- Added wav2vec2.0 model (#1529)
- Added wav2vec2.0 HuggingFace importer (#1530)
- Added wav2vec2.0 fairseq importer (#1531)
- Added speech recognition C++ example (#1538)
- Please refer to C++ example for the detail.
Filtering
- Added C++ implementation of
torchaudio.functional.lfilter
(#1319) - Added autograd support to
torchaudio.functional.lfilter
(#1310, #1441)
[Beta] Resampling
- Added
torchaudio.functional.resample
(#1402) - Added
rolloff
parameter (#1488) - Added kaiser window support to resampling (#1509)
- Added kernel caching mechanism in
torchaudio.transforms.Resample
(#1499, #1514, #1556) - Skip resampling when sampling rate is not changed (#1537)
Native Complex Tensor
- Added complex tensor support to
torchaudio.functional.phase_vocoder
andtorchaudio.transforms.TimeStretch
(#1410) - Added
return_complex
totorchaudio.functional.spectrogram
andtorchaudio.transforms.Spectrogram
(#1366, #1551)
Improvements
I/O
- Added file path to I/O error messages (#1523)
- Added
__str__
override toAudioMetaData
for easy print (#1339) - Fixed uninitialized variable in
sox/utils.cpp
(#1306) - Replaced UB sox conversion macros with tensor op (#1370)
- Removed
check_length
fromvalidate_input_file
(#1312)
Ops
- Added warning for non-integer resampling frequencies (#1490)
- Adopted native complex tensors in
torchaudio.functional.griffinlim
(#1368) - Prohibited scripting
torchaudio.transforms.MelScale
whenn_stft
is invalid (#1505) - Added input dimension check to VAD (#1513)
- Added HTK-compatible option to Mel-scale conversion (#593)
Models
- Added vanilla DeepSpeech model (#1399)
Datasets
- Fixed checksum for the YESNO dataset (#1405)
Misc
- Added missing transforms to
__all__
(#1458) - Removed
reference_cast
inmake_boxed_from_unboxed_functor
(#1300) - Removed unused normalized constant from
torchaudio.transforms.GriffinLim
(#1433) - Removed unused helper function (#1396)
Examples
- Added libtorchaudio C++ example (#1349)
- Refactored libtorchaudio example (#1486)
- Replaced
librosa
's Mel scale conversion withtorchaudio
’s in WaveRNN example (#1444)
Build
- Updated
config.guess
to support source build in recent architectures (#1484) - Explicitly disabled wavpack when building SoX (#1462)
- Added ROCm support to source build (#1411)
- Added Windows C++ binary build (#1345, #1371)
- Made kaldi selective in build (#1342)
- Made sox selective (#1338)
Testing
- Added autograd test for
torchaudio.functional.lfilter
andbiquad
variants (#1400, #1438) - Added autograd test for transforms (overview: #1414)
torchaudio.transforms.FrequencyMasking
(#1498)torchaudio.transforms.SlidingWindowCmn
(#1482)torchaudio.transforms.MelScale
(#1467)torchaudio.transforms.Vol
(#1460)torchaudio.transforms.TimeStretch
(#1420)torchaudio.transforms.AmplitudeToDB
(#1447)torchaudio.transforms.GriffinLim
(#1421)torchaudio.transforms.SpectralCentroid
(#1425)torchaudio.transforms.ComputeDeltas
(#1422)torchaudio.transforms.Fade
(#1424)torchaudio.transforms.Resample
(#1416)torchaudio.transforms.MFCC
(#1415)torchaudio.transforms.Spectrogram
/MelSpectrogram
(#1340)
- Added test for a batch of different items in the functional batch consistency test. (#1315)
- Added test for validating
torchaudio.functional.lfilter
shape (#1360) - Added TorchScript test for
torchaudio.functional.resample
(#1516) - Added TorchScript test for
torchaudio.functional.phase_vocoder
(#1379) - Added steps to save and load the scripted object in TorchScript (#1446)
- Added GPU support to functional tests (#1475)
- Added GPU support to transform librosa compatibility test (#1439)
- Added GPU support to functional librosa compatibility test (#1436)
- Improved HTTP fetch test reliability (#1512)
- Refactored functional batch consistency test (#1341)
- Refactored test classes for complex (#1491)
- Refactored sox_io load test (#1394)
- Refactored Kaldi compatibility tests (#1359)
- Refactored functional test (#1435, #1463)
- Refactored transform tests (#1356)
- Refactored librosa compatibility test (#1350)
- Refactored sox compatibility test (#1344)
- Refactored librosa compatibility test (#1259)
- Removed the use I/O functions in batch consistency test (#1521)
- Removed skipIfNoSoxBackend (#1390)
- Removed VAD from batch consistency tests (#1451)
- Replaced deprecated
floor_divide
withdiv
(#1455) - Replaced
torch.assert_allclose
withassertEqual
(#1387) - Shortened
torchaudio.functional.lfilter
autograd tests input size (#1443) - Updated
torchaudio.transforms.InverseMelScale
comparison test (#1437)
Bug Fixes
- Updated
torchaudio.transforms.TimeMasking
andtorchaudio.transforms.FrequencyMasking
to perform out-of-place masking (#1481) - Annotate
power
oftorchaudio.transforms.MelSpectrogram
as float only (#1572)
Performance
- Adopted
torch.nn.functional.conv1d
intorchaudio.functional.lfilter
(#1318) - Added C++ implementation of
torchaudio.functional.overdrive
(#1299)
Documentation
- Update docs (#1550)
- Reformat resample docs (#1548)
- Updated resampling documentation (#1519)
- Added the clarification that
sox_effects.apply_effects_tensor
is CPU-only (#1459) - Removed instructions on using external sox (#1365, #1281)
- Added navigation with left/right arrow keys (#1336)
- Fixed docstring of
sliding_window_cmn
(#1383) - Update contributing guide (#1372)
- Fix broken links in contribution guide (#1361)
- Added Windows build instructions (#1440)
- Fixed typo (#1471, #1397, #1293)
- Added WER to readme in wav2letter pipeline (#1470)
- Fixed wav2letter usage example (#1060)
- Added Google Analytics support (#1466)