bug fix Pretrained.load_audio #1303

goexle · 2022-02-16T16:23:50Z

There is a bug in the function load_audio:

Given the following settup:
There are two files in two different folders which have the same name but different content.

If the function load_audio is executed with these two files, it will always return the content of the file which was loaded first no matter whats inside of the second file.

This is not the expected behavior. Also, I do not see the reason for the symbolic link. It will create a lot of symbolic links in the folder from which the script was executed. This is unexpected as well.

Therefore, I propose to remove the symbolic link.

MClarkTurner · 2022-02-23T20:47:29Z

Thank you for this fix! It was incredibly frustrating having the "transcribe_file" function of my pretrained model spit out hundreds of sym-links.

anautsch · 2022-04-05T08:55:59Z

There are similarities with #1352 #1268 #1254 - I'm on it :)

mravanelli · 2022-05-18T19:38:08Z

@anautsch, any update on this PR?

anautsch · 2022-05-19T09:33:24Z

Audio & model loading is a larger complex - relates PRs & issues are shown in CI/CD -> Performance & housekeeping, among others.

Related PRs: #1254 #1268 #1303
Related issues: #1070 #1177 #1253 #1278 #1291 #1341 #1352 #1358

This discussion touches on how we want to handle pretrained models, mini-datasets, etc. as well.
Data household and a recipe testing strategy need also be considered here. This is more of a systemic question than a bunch of single special cases.

Right now, this topic complex is triaged for pushing v0.6 PRs (resolving the 1-year old ctc PR).

goexle · 2022-05-19T09:42:41Z

@anautsch Thanks for looking into this merge request.
But I dont understand why this mr is about model loading etc.

As I described above, there is a bug in the function load_audio.

Lets assume you have two files with the same name but located in different folders.
First, you call the load_audio function with the first file. Then, you call the load_audio function with the second file. But this wold return the content of the first file because of the symbolic link. So you would get the wrong content for the second file.

anautsch · 2022-05-19T10:25:30Z

Hi @goexle yes, from what I understand, your point is:

using local audios only causes undesired effects bc HuggingFace/HF should not matter then
yet, HF downloads are triggered, so remove them

This effect occurs when local paths are corrupted - which is improper to handle invalid local paths.
The core point of your remark is absolutely valid !
Yet, there is also the need to provide HF download functionality - which is attempted here through the same function interface of SpeechBrain.

Your PR proposes to drop this call path = fetch(fl, source=source, savedir=savedir) - let's take a look at this fetch

savedir = pathlib.Path(savedir)
savedir.mkdir(parents=True, exist_ok=True)
sourcefile = f"{source}/{filename}"
destination = savedir / save_filename
if destination.exists() and not overwrite:
    MSG = f"Fetch {filename}: Using existing file/symlink in {str(destination)}."
    logger.info(MSG)
    return destination
if str(source).startswith("http:") or str(source).startswith("https:"):
    [...]
elif pathlib.Path(source).is_dir():
    # Interpret source as local directory path
    # Just symlink
    sourcepath = pathlib.Path(sourcefile).absolute()
    MSG = f"Fetch {filename}: Linking to local file in {str(sourcepath)}."
    logger.info(MSG)
    _missing_ok_unlink(destination)
    destination.symlink_to(sourcepath)
else:
    [...]
return destination

The spared out parts are creating symlinks to downloaded HF data. If one gives a local path, a symlink to that local path is created. If the local path is not valid from the perspective of the python program and from where it is run, HF downloads are triggered instead. Another form of warning messages would be more appropriate, indeed. Yet, since this fetch does what it should do for local files, but also checks on HF downloads ("if the local path is invalid, then it must be a download" strategy), simply removing it, deactivates other functions that are desired in SpeechBrain.

It's a bigger topic because the reason for this particular issue to occur is part of a larger whole - and how this is approached is on our todo list (after the ctc PR which is about a year open by now).

anautsch · 2022-05-19T10:29:45Z

Your argument on the two files is about the filename itself, right?

first: abc/audio.wav
second: def/audio.wav

Then, the symlink would be simply audio.wav, right?
Which always references the first audio, never the second. That only works if all audios are unique in their filename - across all datasets. Pretty strong assumption....

This fetch logic only handles well if source=destination through destination.exists() -> return destination - otherwise, it attempts to create symlinks which are not unique identifiers by path - but by filename.

@Gastron @TParcollet why destination.symlink_to(sourcepath) and not simply destination = sourcepath ?

goexle · 2022-05-19T10:37:24Z

@anautsch thanks for explaining, I understand its more complex. So you want to be able to fetch audio files from huggingface with the same method load_audio?

Your argument on the two files is about the filename itself, right?
* first: `abc/audio.wav`

* second: `def/audio.wav`
Then, the symlink would be simply audio.wav, right? Which always references the first audio, never the second. That only works if all audios are unique in their filename - across all datasets. Pretty strong assumption....

This fetch logic only handles well if source=destination through destination.exists() -> return destination - otherwise, it attempts to create symlinks which are not unique identifiers by path - but by filename.

@Gastron @TParcollet why destination.symlink_to(sourcepath) and not simply destination = sourcepath ?

Yes exactly, that is the problem. The symlink would just be the filename, which is not a unique name.

anautsch · 2022-05-19T10:38:26Z

Trying to understand now, what _missing_ok_unlink(destination) is doing. My gut feeling is still this is about invalid filenames in the first place.

anautsch · 2022-05-19T10:41:19Z

@goexle are you using Windows by any chance?

goexle · 2022-05-19T10:43:52Z

@anautsch no, I am using Ubuntu

anautsch · 2022-05-19T10:46:37Z

Then, the previously created symlink should be simply dropped:

speechbrain/speechbrain/pretrained/fetching.py

Line 21 in 6c948d6

path.unlink()

no idea though how this plays out on parallel processing (thinking of possible sources of error)

goexle · 2022-05-19T10:51:34Z

Good point, thanks for bringing it up.
Strange that this does not seem to work.
I will try to debug it once I find time

goexle · 2022-05-24T22:14:30Z

@anautsch I debugged into it.
It is line

speechbrain/speechbrain/pretrained/fetching.py

Line 125 in 6c948d6

destination.symlink_to(sourcepath)

which causes the symlink to be created. The method path.unlink() you mentioned is called before, so it does not unlink before it exits.

I added a unit test. With this bugfix, it is passing. Without, it will fail.

anautsch · 2022-05-25T08:01:00Z

tests/unittests/test_interfaces.py

+    audio2 = verification.load_audio(
+        "samples/voxceleb_samples/wav/id10001/1zcIwhmdeo4/00001.wav"
+    )
+    assert not torch.equal(audio1, audio2)


please add a final empty line
run ./tests/.run-linters.sh to see if there's more to be done regarding formatting

anautsch · 2022-05-25T08:02:03Z

speechbrain/pretrained/interfaces.py

        """
-        source, fl = split_path(path)
-        path = fetch(fl, source=source, savedir=savedir)


removing it will avoid this problem but also deactivate other features ;-)
the bug fix needs to happen elsewhere, as this fix will break other things (e.g. resolving HuggingFace URLs)

anautsch · 2022-05-25T08:16:22Z

I debugged into it. It is line

speechbrain/speechbrain/pretrained/fetching.py

Line 125 in 6c948d6

destination.symlink_to(sourcepath)

Thanks! Then, we need to fix that line.

which causes the symlink to be created. The method path.unlink() you mentioned is called before, so it does not unlink before it exits.

Wondering if path.unlink() is buggy which I can't believe - could it be possible there's something else going on? Like permissions or the file is still in use (from the eyes of the machine)?

I added a unit test. With this bugfix, it is passing. Without, it will fail.

As mentioned, there are other issues created through this fix, e.g. resolving other links from HuggingFace. It's better to target the symlink issue directly.

anautsch · 2022-05-25T08:23:08Z

tests/unittests/test_interfaces.py

+def test_load_audio(tmpdir):
+    from speechbrain.pretrained.interfaces import Pretrained
+
+    verification = Pretrained.from_hparams(


Still wondering about loading a 100+MB pretrained model here. I deleted my earlier comments on using a shorter pretrained model. All that is relevant here are these lines

source, fl = split_path(path) path = fetch(fl, source=source, savedir=savedir) signal, sr = torchaudio.load(path, channels_first=False)

That's what needs to be tested for the fetch in pretrained interfaces.

anautsch · 2022-05-25T08:26:50Z

Sorry for the back & forth - I see that you are updating while I'm changing my mind...

anautsch · 2022-05-25T08:40:46Z

@goexle tryign to make up for my flipsy mind...

import torch, torchaudio
from speechbrain.pretrained.fetching import fetch
from speechbrain.utils.data_utils import split_path


def test_load_audio(tmpdir):
    def load_audio(path):
        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=tmpdir)
        signal, sr = torchaudio.load(path, channels_first=False)
        return signal

    audio1 = load_audio(
        "samples/voxceleb_samples/wav/id10002/xTV-jFAUKcw/00001.wav"
    )
    audio2 = load_audio(
        "samples/voxceleb_samples/wav/id10001/1zcIwhmdeo4/00001.wav"
    )
    assert not torch.equal(audio1, audio2)

Please let me know if this test gets to the issue. Trying to reproduce the error on my end.

anautsch · 2022-05-25T08:46:42Z

The calling paths might be off in the "samples" -> "../../samples" depending on from where the test is called.

The second audio never goes into the branch to actually unlink the symlink

    if destination.exists() and not overwrite:
        MSG = f"Fetch {filename}: Using existing file/symlink in {str(destination)}."

it jumps right there without validating that the symlink points to the same target or not.

anautsch · 2022-05-25T08:53:57Z

The pretrained fetch function has an overwrite parameter

speechbrain/speechbrain/pretrained/fetching.py

Line 30 in d6736ad

overwrite=False,

which is not accessed through the pretrained interface...

speechbrain/speechbrain/pretrained/interfaces.py

Line 233 in d6736ad

path = fetch(fl, source=source, savedir=savedir)

So the test I wrote above would not be able to capture what your valid complaint is either...

There needs to be some extension from

    def load_audio(self, path, savedir="."):
        """Load an audio file with this model"s input spec

        When using a speech model, it is important to use the same type of data,
        as was used to train the model. This means for example using the same
        sampling rate and number of channels. It is, however, possible to
        convert a file from a higher sampling rate to a lower one (downsampling).
        Similarly, it is simple to downmix a stereo file to mono.
        The path can be a local path, a web url, or a link to a huggingface repo.
        """
        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=savedir)
        signal, sr = torchaudio.load(path, channels_first=False)
        return self.audio_normalizer(signal, sr)

to

    def load_audio(self, path, savedir=".", overwrite=False):
        """Load an audio file with this model"s input spec

        When using a speech model, it is important to use the same type of data,
        as was used to train the model. This means for example using the same
        sampling rate and number of channels. It is, however, possible to
        convert a file from a higher sampling rate to a lower one (downsampling).
        Similarly, it is simple to downmix a stereo file to mono.
        The path can be a local path, a web url, or a link to a huggingface repo.
        """
        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=savedir, overwrite=overwrite)
        signal, sr = torchaudio.load(path, channels_first=False)
        return self.audio_normalizer(signal, sr)

alike with the other fetch features: save_filename=None, use_auth_token=False

Yet, all derived interfaces will not bother that change, since they only do a

speechbrain/speechbrain/pretrained/interfaces.py

Line 866 in d6736ad

waveform = self.load_audio(path)

What would be a good way to fix this adequately...
One way is to extend upon the function parameters, and make that change everywhere - another way could be to put it into the hparams file so it becomes a field of any pretrained interface instead of a function argument. I don't see a clear picture yet which of these would be a good design choice. What's your take?

anautsch · 2022-05-25T09:13:36Z

@Gastron I've no clue how to approach this PR the best way. The pretrained inteface needs to change - but how?
Some users will want to use the same symlink over and over again - others will want to overwrite them. The rationale to keep them is to keep intact an active symlink. Should the overwrite parameter be determined before the if-else block if there's a symlink and its origin is different from the new source path? How to handle distributed learning where one node might not yet have had processed that symlinked audio but another node wants to change the symlink already?

Aside the bigger refactoring, what would be a working solution for @goexle ?

Gastron · 2022-05-25T11:22:07Z

I would be happy to remove the magic audio fetching in load_audio altogether. I think the convenience is only there for example code - but not really in actual usage.

However, I am sure we have a bunch of example code where this convenience is used, and that change will break those examples. The modification that we should do in the examples is of course minimal, fetching the example file, making sure we import fetch, something like:

from speechbrain.pretrained.interfaces import EncoderDecoderASR, fetch
...
path = fetch("the.examples.file.name", source="the/example's/source", savedir=".")
asr.transcribe_file(path)

anautsch · 2022-06-03T08:08:27Z

@goexle How about the following?

Let's say we would move over the next months to a v0.6 major revision and through that we w/could refactor pretrained interfaces as well, especially the fetch function. As mentioned above, this is not the only PR/issue criticising this. We might revise how/where YAML files are used. In YAML files, one can instantiate objects - see the tokenizer templater for comparison.

tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
   model_dir: !ref <output_folder>
   vocab_size: !ref <token_output>
   annotation_train: !ref <train_annotation>
   annotation_read: !ref <annotation_read>
   model_type: !ref <token_type> # ["unigram", "bpe", "char"]
   character_coverage: !ref <character_coverage>
   annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]
   annotation_format: json

How about a partial function definition through a pretrained model yaml file?

I'm thinking about some

pretrained_asr: !new: speechbrain.pretrained.Interfaces.Pretrained.EncoderASR
   [...]  # some other parameters
   # custom data I/O
   load_audio: !partial: speechbrain.pretrained.Interfaces.Pretrained.load_audio
      savedir: "./symlink_stash"  # we need to refactor how we handle symlinks, too
      overwrite: False
      save_filename: None
      use_auth_token: False
      no_symlinks: True  # a suggestion  # really wondering why we need symlinks at all, third-party?
      local_paths_only: True  # a suggestion

which would pre-define most parameters of load_audio but the one that is mandatory (the audio path).

I just copied some parameters from the fetch function that could be put to the load_audio interface with default value. This could be one way; another could be to also partially outline the speechbrain.pretrained.fetching.fetch function in the yaml file.

Yet, I'd be in favor of partially definind load_audio which is a member of the pretrained class from which other pretrained interaces inherit, so it can be addressed with getattr and setattr builtins, whereas the fetch function is an import (it should be possible to play with it too yet I doubt this would yield an elegant solution).

@pplantinga is it possible to make an extension to HyperPyYAML that borrows from functools and makes partial function initialisations available to classes which inherit this function from a parent interface? This could be done with wrappers if needs be. Just wondering if this is a road to go over the next weeks. Does the !new already provide what I suggest with !partial?

Eventually, users control how the SB interfaces are used through the YAML files (core functions should be flexible enough to facilitate a plethora of use cases). YAML files could be a better way to feature both. This way could be better than running into needing to maintain lots of different interfaces. (Would just add a flag for whether/not trying to fetch from online at all.)

@Gastron why do we need symlinks in the first place? Are there some third-party constraints?

(Being verbose on a particular problem to better see the implications of possible solutions to find a solid re-design.)

pplantinga · 2022-06-03T12:48:41Z

@pplantinga is it possible to make an extension to HyperPyYAML that borrows from functools and makes partial function initialisations available to classes which inherit this function from a parent interface? This could be done with wrappers if needs be. Just wondering if this is a road to go over the next weeks. Does the !new already provide what I suggest with !partial?

I believe what you are looking for is the !name: tag, which uses functools.partial on the backend

goexle · 2022-06-24T21:53:42Z

@anautsch let me try to argue from a different perspective why the current load_audio method should be changed:

From a user perspective, I just want to load_audio in the way the pretrained model interface expects it. So I dont want to know that I need to call torchaudio.load with the parameter channels_first=False or that I need to call the normalizer afterwards. So its perfect that there is a method for that, and as a user, I just need to provide the path to the audio.

From a user perspective, I expect a simple behaviour of such a method. And as long as the unit test in this PR is failing, the load_audio has a major bug. And it needs to be fixed, this weird behaviour is not what you expect from this method.

Now to the remote fetching part:
Again, as a user, I would consider this as an extra feature. This is nothing an everyday method responsible for loading audio is capable of. So from a user perspective, I would expect that I need to call a different method which does that or set an extra parameter for it.

This could look like that
Proposal 1

def load_audio(self, path,fetch_from_remote=False,savedir="."):
   if fetch_from_remote:
       source, fl = split_path(path)
       path = fetch(fl, source=source, savedir=savedir)
   signal, sr = torchaudio.load(str(path), channels_first=False)
   return self.audio_normalizer(signal, sr)

or a speratae method
Proposal 2

def load_audio(self, path):
   signal, sr = torchaudio.load(str(path), channels_first=False)
   return self.audio_normalizer(signal, sr)

def load_remote_audio(self, path, savedir="."):
   source, fl = split_path(path)
   path = fetch(fl, source=source, savedir=savedir)
   return load_audio(path)

What do you think?

anautsch · 2022-06-28T11:47:15Z

Hi @goexle thank you for getting back on this PR!
We are on the same page that it needs to be changed - how is the issue.

The remote fetch is critical to our HuggingFace examples.

from speechbrain.pretrained import EncoderASR

asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-librispeech", savedir="pretrained_models/asr-wav2vec2-librispeech")
asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-en/example.wav")

see: https://huggingface.co/speechbrain/asr-wav2vec2-librispeech

The file speechbrain/asr-wav2vec2-commonvoice-en/example.wav points to:
https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-en/blob/main/example.wav

There are some 40+ SpeechBrain repos on HuggingFace.

Your second propsal reads cleaner, as functions are separated by purpose. When it comes to integrating them, however, the first one might be more suitable - as the fetch_from_remote allows to address this directly. Eventually, redundant boiler plate code is ok when functional safety can be reached better - that would also speak for your second proposal.

The issue that is now in my head:

one design paradigm for SpeechBrain is to have all necessary definitions at one place, i.e., in a big YAML file (or two: one for recipe & one for pretrained)
compared to recipes, the pretrained interface is at a rudamentary level and needs more redesign (in general) - it's at a cross-road between different user categories incl. industry (and there's a part where it would be great to see more contributions from industry too)
users should interface with the same functions using differently specified yaml files and/or function calls
-> see Peter's feedback
how does it impact the integration throughout all pretrained interfaces? derived classes from pretrained might need to follow a specific signature (to be able to make that call between local/remote fetching)

The next issue are the general handling of symlinks and a re-iteration with how HuggingFace is integrated (will be a bigger next-next topic over the next months). There, I simply lack information to make a good call here. A solution might be reverted later on. But that's maybe just how it is with OpenSource & continuous integration/development.

How would you like to move forward?

Adel-Moumen · 2024-01-16T12:55:52Z

Hello @Gastron,

Wondering if you know if #1817 is covering this PR as well ? Thanks.

Best,
Adel

Gastron · 2024-01-18T21:49:32Z

Hello @Gastron,

Wondering if you know if #1817 is covering this PR as well ? Thanks.

Best, Adel

I am pretty sure that PR #1817 did not add the FetchSource explicit capability to Pretrained.load_audio. I suggest again that we should remove the magic of load_audio (no fetch) - instead, the audio should first be fetched with the fetch function.

bug fix Pretrained.load_audio

625cda7

goexle force-pushed the bugfix_load_audio branch from 79f8dbb to 625cda7 Compare February 16, 2022 16:59

anautsch self-requested a review April 5, 2022 08:56

anautsch added this to In progress in CI/CD via automation Apr 21, 2022

anautsch moved this from In progress to Performance & housekeeping in CI/CD Apr 21, 2022

goexle added 5 commits May 24, 2022 23:59

unit test for load_audio

54aebc7

Merge branch 'develop' into bugfix_load_audio

8aaff1a

minor

a09d206

toch.equal

cce5732

black

b8991b0

anautsch reviewed May 25, 2022

View reviewed changes

update unit test

b75370f

anautsch reviewed May 25, 2022

View reviewed changes

mravanelli force-pushed the develop branch from 25d399a to 421fb46 Compare May 31, 2022 16:58

anautsch changed the base branch from develop to develop-v2 June 1, 2022 15:41

anautsch mentioned this pull request Sep 28, 2022

Impossible to transcribe audio file with the same name #1586

Open

anautsch mentioned this pull request Dec 7, 2022

wrong file loading by separate_file function #1351

Open

anautsch mentioned this pull request Jan 9, 2023

[Bug]: fine tuning using multi-GPU #1797

Closed

anautsch mentioned this pull request Jan 25, 2023

Extending pretrained interface for different fetching modes #1817

Merged

8 tasks

asumagic mentioned this pull request Apr 8, 2024

Allow not using symlinks when fetching files #2476

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fix Pretrained.load_audio #1303

bug fix Pretrained.load_audio #1303

goexle commented Feb 16, 2022

MClarkTurner commented Feb 23, 2022

anautsch commented Apr 5, 2022

mravanelli commented May 18, 2022

anautsch commented May 19, 2022

goexle commented May 19, 2022

anautsch commented May 19, 2022

anautsch commented May 19, 2022 •

edited

goexle commented May 19, 2022 •

edited

anautsch commented May 19, 2022

anautsch commented May 19, 2022

goexle commented May 19, 2022

anautsch commented May 19, 2022

goexle commented May 19, 2022

goexle commented May 24, 2022

anautsch May 25, 2022

anautsch May 25, 2022

anautsch commented May 25, 2022

anautsch May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

Gastron commented May 25, 2022

anautsch commented Jun 3, 2022

pplantinga commented Jun 3, 2022

goexle commented Jun 24, 2022

anautsch commented Jun 28, 2022

Adel-Moumen commented Jan 16, 2024

Gastron commented Jan 18, 2024

bug fix Pretrained.load_audio #1303

Are you sure you want to change the base?

bug fix Pretrained.load_audio #1303

Conversation

goexle commented Feb 16, 2022

MClarkTurner commented Feb 23, 2022

anautsch commented Apr 5, 2022

mravanelli commented May 18, 2022

anautsch commented May 19, 2022

goexle commented May 19, 2022

anautsch commented May 19, 2022

anautsch commented May 19, 2022 • edited

goexle commented May 19, 2022 • edited

anautsch commented May 19, 2022

anautsch commented May 19, 2022

goexle commented May 19, 2022

anautsch commented May 19, 2022

goexle commented May 19, 2022

goexle commented May 24, 2022

anautsch May 25, 2022

Choose a reason for hiding this comment

anautsch May 25, 2022

Choose a reason for hiding this comment

anautsch commented May 25, 2022

anautsch May 25, 2022

Choose a reason for hiding this comment

anautsch commented May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

anautsch commented May 25, 2022

Gastron commented May 25, 2022

anautsch commented Jun 3, 2022

pplantinga commented Jun 3, 2022

goexle commented Jun 24, 2022

anautsch commented Jun 28, 2022

Adel-Moumen commented Jan 16, 2024

Gastron commented Jan 18, 2024

anautsch commented May 19, 2022 •

edited

goexle commented May 19, 2022 •

edited