-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fix Pretrained.load_audio #1303
base: develop
Are you sure you want to change the base?
Conversation
79f8dbb
to
625cda7
Compare
Thank you for this fix! It was incredibly frustrating having the "transcribe_file" function of my pretrained model spit out hundreds of sym-links. |
@anautsch, any update on this PR? |
Audio & model loading is a larger complex - relates PRs & issues are shown in CI/CD -> Related PRs: #1254 #1268 #1303 This discussion touches on how we want to handle pretrained models, mini-datasets, etc. as well. Right now, this topic complex is triaged for pushing v0.6 PRs (resolving the 1-year old ctc PR). |
@anautsch Thanks for looking into this merge request. As I described above, there is a bug in the function load_audio. Lets assume you have two files with the same name but located in different folders. |
Hi @goexle yes, from what I understand, your point is:
This effect occurs when local paths are corrupted - which is improper to handle invalid local paths. Your PR proposes to drop this call savedir = pathlib.Path(savedir)
savedir.mkdir(parents=True, exist_ok=True)
sourcefile = f"{source}/{filename}"
destination = savedir / save_filename
if destination.exists() and not overwrite:
MSG = f"Fetch {filename}: Using existing file/symlink in {str(destination)}."
logger.info(MSG)
return destination
if str(source).startswith("http:") or str(source).startswith("https:"):
[...]
elif pathlib.Path(source).is_dir():
# Interpret source as local directory path
# Just symlink
sourcepath = pathlib.Path(sourcefile).absolute()
MSG = f"Fetch {filename}: Linking to local file in {str(sourcepath)}."
logger.info(MSG)
_missing_ok_unlink(destination)
destination.symlink_to(sourcepath)
else:
[...]
return destination The spared out parts are creating symlinks to downloaded HF data. If one gives a local path, a symlink to that local path is created. If the local path is not valid from the perspective of the python program and from where it is run, HF downloads are triggered instead. Another form of warning messages would be more appropriate, indeed. Yet, since this fetch does what it should do for local files, but also checks on HF downloads ("if the local path is invalid, then it must be a download" strategy), simply removing it, deactivates other functions that are desired in SpeechBrain. It's a bigger topic because the reason for this particular issue to occur is part of a larger whole - and how this is approached is on our todo list (after the ctc PR which is about a year open by now). |
Your argument on the two files is about the filename itself, right?
Then, the symlink would be simply This fetch logic only handles well if source=destination through @Gastron @TParcollet why |
@anautsch thanks for explaining, I understand its more complex. So you want to be able to fetch audio files from huggingface with the same method load_audio?
Yes exactly, that is the problem. The symlink would just be the filename, which is not a unique name. |
Trying to understand now, what |
@goexle are you using Windows by any chance? |
@anautsch no, I am using Ubuntu |
Then, the previously created symlink should be simply dropped:
no idea though how this plays out on parallel processing (thinking of possible sources of error) |
Good point, thanks for bringing it up. |
@anautsch I debugged into it.
which causes the symlink to be created. The method path.unlink() you mentioned is called before, so it does not unlink before it exits.
I added a unit test. With this bugfix, it is passing. Without, it will fail. |
audio2 = verification.load_audio( | ||
"samples/voxceleb_samples/wav/id10001/1zcIwhmdeo4/00001.wav" | ||
) | ||
assert not torch.equal(audio1, audio2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a final empty line
run ./tests/.run-linters.sh
to see if there's more to be done regarding formatting
""" | ||
source, fl = split_path(path) | ||
path = fetch(fl, source=source, savedir=savedir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing it will avoid this problem but also deactivate other features ;-)
the bug fix needs to happen elsewhere, as this fix will break other things (e.g. resolving HuggingFace URLs)
Thanks! Then, we need to fix that line.
Wondering if
As mentioned, there are other issues created through this fix, e.g. resolving other links from HuggingFace. It's better to target the symlink issue directly. |
tests/unittests/test_interfaces.py
Outdated
def test_load_audio(tmpdir): | ||
from speechbrain.pretrained.interfaces import Pretrained | ||
|
||
verification = Pretrained.from_hparams( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still wondering about loading a 100+MB pretrained model here. I deleted my earlier comments on using a shorter pretrained model. All that is relevant here are these lines
source, fl = split_path(path)
path = fetch(fl, source=source, savedir=savedir)
signal, sr = torchaudio.load(path, channels_first=False)
That's what needs to be tested for the fetch in pretrained interfaces.
Sorry for the back & forth - I see that you are updating while I'm changing my mind... |
@goexle tryign to make up for my flipsy mind... import torch, torchaudio
from speechbrain.pretrained.fetching import fetch
from speechbrain.utils.data_utils import split_path
def test_load_audio(tmpdir):
def load_audio(path):
source, fl = split_path(path)
path = fetch(fl, source=source, savedir=tmpdir)
signal, sr = torchaudio.load(path, channels_first=False)
return signal
audio1 = load_audio(
"samples/voxceleb_samples/wav/id10002/xTV-jFAUKcw/00001.wav"
)
audio2 = load_audio(
"samples/voxceleb_samples/wav/id10001/1zcIwhmdeo4/00001.wav"
)
assert not torch.equal(audio1, audio2) Please let me know if this test gets to the issue. Trying to reproduce the error on my end. |
The calling paths might be off in the "samples" -> "../../samples" depending on from where the test is called. The second audio never goes into the branch to actually unlink the symlink if destination.exists() and not overwrite:
MSG = f"Fetch {filename}: Using existing file/symlink in {str(destination)}." it jumps right there without validating that the symlink points to the same target or not. |
The pretrained fetch function has an
which is not accessed through the pretrained interface...
So the test I wrote above would not be able to capture what your valid complaint is either... There needs to be some extension from def load_audio(self, path, savedir="."):
"""Load an audio file with this model"s input spec
When using a speech model, it is important to use the same type of data,
as was used to train the model. This means for example using the same
sampling rate and number of channels. It is, however, possible to
convert a file from a higher sampling rate to a lower one (downsampling).
Similarly, it is simple to downmix a stereo file to mono.
The path can be a local path, a web url, or a link to a huggingface repo.
"""
source, fl = split_path(path)
path = fetch(fl, source=source, savedir=savedir)
signal, sr = torchaudio.load(path, channels_first=False)
return self.audio_normalizer(signal, sr) to def load_audio(self, path, savedir=".", overwrite=False):
"""Load an audio file with this model"s input spec
When using a speech model, it is important to use the same type of data,
as was used to train the model. This means for example using the same
sampling rate and number of channels. It is, however, possible to
convert a file from a higher sampling rate to a lower one (downsampling).
Similarly, it is simple to downmix a stereo file to mono.
The path can be a local path, a web url, or a link to a huggingface repo.
"""
source, fl = split_path(path)
path = fetch(fl, source=source, savedir=savedir, overwrite=overwrite)
signal, sr = torchaudio.load(path, channels_first=False)
return self.audio_normalizer(signal, sr) alike with the other fetch features: Yet, all derived interfaces will not bother that change, since they only do a
What would be a good way to fix this adequately... |
@Gastron I've no clue how to approach this PR the best way. The pretrained inteface needs to change - but how? Aside the bigger refactoring, what would be a working solution for @goexle ? |
I would be happy to remove the magic audio fetching in However, I am sure we have a bunch of example code where this convenience is used, and that change will break those examples. The modification that we should do in the examples is of course minimal, fetching the example file, making sure we import fetch, something like: from speechbrain.pretrained.interfaces import EncoderDecoderASR, fetch
...
path = fetch("the.examples.file.name", source="the/example's/source", savedir=".")
asr.transcribe_file(path) |
@goexle How about the following? Let's say we would move over the next months to a v0.6 major revision and through that we w/could refactor pretrained interfaces as well, especially the fetch function. As mentioned above, this is not the only PR/issue criticising this. We might revise how/where YAML files are used. In YAML files, one can instantiate objects - see the tokenizer templater for comparison. tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
model_dir: !ref <output_folder>
vocab_size: !ref <token_output>
annotation_train: !ref <train_annotation>
annotation_read: !ref <annotation_read>
model_type: !ref <token_type> # ["unigram", "bpe", "char"]
character_coverage: !ref <character_coverage>
annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]
annotation_format: json How about a partial function definition through a pretrained model yaml file? I'm thinking about some pretrained_asr: !new: speechbrain.pretrained.Interfaces.Pretrained.EncoderASR
[...] # some other parameters
# custom data I/O
load_audio: !partial: speechbrain.pretrained.Interfaces.Pretrained.load_audio
savedir: "./symlink_stash" # we need to refactor how we handle symlinks, too
overwrite: False
save_filename: None
use_auth_token: False
no_symlinks: True # a suggestion # really wondering why we need symlinks at all, third-party?
local_paths_only: True # a suggestion which would pre-define most parameters of load_audio but the one that is mandatory (the audio path). I just copied some parameters from the fetch function that could be put to the load_audio interface with default value. This could be one way; another could be to also partially outline the Yet, I'd be in favor of partially definind @pplantinga is it possible to make an extension to HyperPyYAML that borrows from Eventually, users control how the SB interfaces are used through the YAML files (core functions should be flexible enough to facilitate a plethora of use cases). YAML files could be a better way to feature both. This way could be better than running into needing to maintain lots of different interfaces. (Would just add a flag for whether/not trying to fetch from online at all.) @Gastron why do we need symlinks in the first place? Are there some third-party constraints? (Being verbose on a particular problem to better see the implications of possible solutions to find a solid re-design.) |
I believe what you are looking for is the |
@anautsch let me try to argue from a different perspective why the current load_audio method should be changed: From a user perspective, I just want to load_audio in the way the pretrained model interface expects it. So I dont want to know that I need to call torchaudio.load with the parameter From a user perspective, I expect a simple behaviour of such a method. And as long as the unit test in this PR is failing, the load_audio has a major bug. And it needs to be fixed, this weird behaviour is not what you expect from this method. Now to the remote fetching part: This could look like that
or a speratae method
What do you think? |
Hi @goexle thank you for getting back on this PR! The remote fetch is critical to our HuggingFace examples. from speechbrain.pretrained import EncoderASR
asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-librispeech", savedir="pretrained_models/asr-wav2vec2-librispeech")
asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-en/example.wav") see: https://huggingface.co/speechbrain/asr-wav2vec2-librispeech The file There are some 40+ SpeechBrain repos on HuggingFace. Your second propsal reads cleaner, as functions are separated by purpose. When it comes to integrating them, however, the first one might be more suitable - as the The issue that is now in my head:
The next issue are the general handling of symlinks and a re-iteration with how HuggingFace is integrated (will be a bigger next-next topic over the next months). There, I simply lack information to make a good call here. A solution might be reverted later on. But that's maybe just how it is with OpenSource & continuous integration/development. How would you like to move forward? |
I am pretty sure that PR #1817 did not add the |
There is a bug in the function load_audio:
Given the following settup:
There are two files in two different folders which have the same name but different content.
If the function load_audio is executed with these two files, it will always return the content of the file which was loaded first no matter whats inside of the second file.
This is not the expected behavior. Also, I do not see the reason for the symbolic link. It will create a lot of symbolic links in the folder from which the script was executed. This is unexpected as well.
Therefore, I propose to remove the symbolic link.