Improve inference interfaces audio loading #2360

asumagic · 2023-01-16T15:04:02Z

asumagic
Jan 16, 2023
Maintainer

🚀The feature

At the moment, most inference interface methods (as in interfaces.py) require passing an audio_file argument, which is then passed to the generic Pretrained.load_audio method, which fetches the path, downloads or symlinks the file locally, and applies audio normalization (if present).

1. Interfaces generally only accept (local or remote) paths, not tensors

The problem is that it is rather common to have an audio tensor instead of an actual file on disk.
The hacky, in-memory way to use BytesIO does not work because it expects a path to a physical/remote file.
Unfortunately, this makes the inference interfaces very clumsy to use in this case, and you often have to navigate through the source code to copy a patched version in your codebase. For some interfaces, this is trivial, for others (like VAD) it's a mess.
Some of the classes provide methods that accept tensors directly. However, those are usually lower level, and often you need to replicate a bunch of code from the higher level methods anyway, which should not be necessary.

I believe this problem is rather common, because it has appeared at least once in discussions, several times in my personal use of SB, and several times by colleagues. Solving this would probably be beneficial to user friendliness.

2. Some methods use `torchaudio.load` instead of `Pretrained.load_audio`

In all cases but one, this is because these methods want to load specific chunks of the file. It would not be a problem in itself but:

It means solving issue 1 in load_audio will not suffice to fix all interfaces.
This bypasses audio_normalization and isn't documented. I have not checked if the affected interfaces actually make use of these (it might not even make sense when loading parts of the file), but either way, that behavior is not documented.
This is inconsistent with the split_path logic in load_audio which is unused in all of these methods.

Solution outline

That is open to however the implementer of a PR wants to fix this issue.

For the first issue, I believe it could make sense to allow passing either a path or a simple (data)class representing an audio and metadata, i.e. the tensor itself + a sample rate.
If doing so, the implementation should be careful to still allow str (local or remote) paths as well as pathlib Paths for compatibility purposes.
Generally speaking it should also be careful about how multi-channel audio is handled, and should ensure that this is properly documented. SB docs do mention the audio shape convention in use somewhere, but interfaces are so high-level and user-facing that it should be obvious for documentation purposes.

For the second issue, it should be checked how much of an issue it is in practice. It is however related to the first issue, because using torchaudio.load also prevents from using an existing tensor.

Additional context

No response

anautsch · 2023-01-16T15:28:03Z

anautsch
Jan 16, 2023
Collaborator

@asumagic on it... but no PR yet
https://github.com/anautsch/speechbrain/tree/pretrained-ddp/tests/templates/fetching_ddp_dynbatch_finetuning

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inference interfaces audio loading #2360

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Improve inference interfaces audio loading #2360

asumagic Jan 16, 2023 Maintainer

🚀The feature

1. Interfaces generally only accept (local or remote) paths, not tensors

2. Some methods use torchaudio.load instead of Pretrained.load_audio

Solution outline

Additional context

Replies: 1 comment

anautsch Jan 16, 2023 Collaborator

asumagic
Jan 16, 2023
Maintainer

2. Some methods use `torchaudio.load` instead of `Pretrained.load_audio`

anautsch
Jan 16, 2023
Collaborator