Improve inference interfaces audio loading #2360
asumagic
started this conversation in
Feature Request
Replies: 1 comment
-
@asumagic on it... but no PR yet |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
🚀The feature
At the moment, most inference interface methods (as in
interfaces.py
) require passing anaudio_file
argument, which is then passed to the genericPretrained.load_audio
method, which fetches the path, downloads or symlinks the file locally, and applies audio normalization (if present).1. Interfaces generally only accept (local or remote) paths, not tensors
The problem is that it is rather common to have an audio tensor instead of an actual file on disk.
The hacky, in-memory way to use
BytesIO
does not work because it expects a path to a physical/remote file.Unfortunately, this makes the inference interfaces very clumsy to use in this case, and you often have to navigate through the source code to copy a patched version in your codebase. For some interfaces, this is trivial, for others (like VAD) it's a mess.
Some of the classes provide methods that accept tensors directly. However, those are usually lower level, and often you need to replicate a bunch of code from the higher level methods anyway, which should not be necessary.
I believe this problem is rather common, because it has appeared at least once in discussions, several times in my personal use of SB, and several times by colleagues. Solving this would probably be beneficial to user friendliness.
2. Some methods use
torchaudio.load
instead ofPretrained.load_audio
In all cases but one, this is because these methods want to load specific chunks of the file. It would not be a problem in itself but:
load_audio
will not suffice to fix all interfaces.audio_normalization
and isn't documented. I have not checked if the affected interfaces actually make use of these (it might not even make sense when loading parts of the file), but either way, that behavior is not documented.split_path
logic inload_audio
which is unused in all of these methods.Solution outline
That is open to however the implementer of a PR wants to fix this issue.
For the first issue, I believe it could make sense to allow passing either a path or a simple (data)class representing an audio and metadata, i.e. the tensor itself + a sample rate.
If doing so, the implementation should be careful to still allow
str
(local or remote) paths as well as pathlibPath
s for compatibility purposes.Generally speaking it should also be careful about how multi-channel audio is handled, and should ensure that this is properly documented. SB docs do mention the audio shape convention in use somewhere, but interfaces are so high-level and user-facing that it should be obvious for documentation purposes.
For the second issue, it should be checked how much of an issue it is in practice. It is however related to the first issue, because using
torchaudio.load
also prevents from using an existing tensor.Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions