Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix Pretrained.load_audio #1303

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

goexle
Copy link

@goexle goexle commented Feb 16, 2022

There is a bug in the function load_audio:

Given the following settup:
There are two files in two different folders which have the same name but different content.

If the function load_audio is executed with these two files, it will always return the content of the file which was loaded first no matter whats inside of the second file.

This is not the expected behavior. Also, I do not see the reason for the symbolic link. It will create a lot of symbolic links in the folder from which the script was executed. This is unexpected as well.

Therefore, I propose to remove the symbolic link.

@MClarkTurner
Copy link

Thank you for this fix! It was incredibly frustrating having the "transcribe_file" function of my pretrained model spit out hundreds of sym-links.

@anautsch
Copy link
Collaborator

anautsch commented Apr 5, 2022

There are similarities with #1352 #1268 #1254 - I'm on it :)

@anautsch anautsch self-requested a review April 5, 2022 08:56
@anautsch anautsch added this to In progress in CI/CD via automation Apr 21, 2022
@anautsch anautsch moved this from In progress to Performance & housekeeping in CI/CD Apr 21, 2022
@mravanelli
Copy link
Collaborator

@anautsch, any update on this PR?

@anautsch
Copy link
Collaborator

Audio & model loading is a larger complex - relates PRs & issues are shown in CI/CD -> Performance & housekeeping, among others.

Related PRs: #1254 #1268 #1303
Related issues: #1070 #1177 #1253 #1278 #1291 #1341 #1352 #1358

This discussion touches on how we want to handle pretrained models, mini-datasets, etc. as well.
Data household and a recipe testing strategy need also be considered here. This is more of a systemic question than a bunch of single special cases.

Right now, this topic complex is triaged for pushing v0.6 PRs (resolving the 1-year old ctc PR).

@goexle
Copy link
Author

goexle commented May 19, 2022

@anautsch Thanks for looking into this merge request.
But I dont understand why this mr is about model loading etc.

As I described above, there is a bug in the function load_audio.

Lets assume you have two files with the same name but located in different folders.
First, you call the load_audio function with the first file. Then, you call the load_audio function with the second file. But this wold return the content of the first file because of the symbolic link. So you would get the wrong content for the second file.

@anautsch
Copy link
Collaborator

Hi @goexle yes, from what I understand, your point is:

  • using local audios only causes undesired effects bc HuggingFace/HF should not matter then
  • yet, HF downloads are triggered, so remove them

This effect occurs when local paths are corrupted - which is improper to handle invalid local paths.
The core point of your remark is absolutely valid !
Yet, there is also the need to provide HF download functionality - which is attempted here through the same function interface of SpeechBrain.

Your PR proposes to drop this call path = fetch(fl, source=source, savedir=savedir) - let's take a look at this fetch

savedir = pathlib.Path(savedir)
savedir.mkdir(parents=True, exist_ok=True)
sourcefile = f"{source}/{filename}"
destination = savedir / save_filename
if destination.exists() and not overwrite:
    MSG = f"Fetch {filename}: Using existing file/symlink in {str(destination)}."
    logger.info(MSG)
    return destination
if str(source).startswith("http:") or str(source).startswith("https:"):
    [...]
elif pathlib.Path(source).is_dir():
    # Interpret source as local directory path
    # Just symlink
    sourcepath = pathlib.Path(sourcefile).absolute()
    MSG = f"Fetch {filename}: Linking to local file in {str(sourcepath)}."
    logger.info(MSG)
    _missing_ok_unlink(destination)
    destination.symlink_to(sourcepath)
else:
    [...]
return destination

The spared out parts are creating symlinks to downloaded HF data. If one gives a local path, a symlink to that local path is created. If the local path is not valid from the perspective of the python program and from where it is run, HF downloads are triggered instead. Another form of warning messages would be more appropriate, indeed. Yet, since this fetch does what it should do for local files, but also checks on HF downloads ("if the local path is invalid, then it must be a download" strategy), simply removing it, deactivates other functions that are desired in SpeechBrain.

It's a bigger topic because the reason for this particular issue to occur is part of a larger whole - and how this is approached is on our todo list (after the ctc PR which is about a year open by now).

@anautsch
Copy link
Collaborator

anautsch commented May 19, 2022

Your argument on the two files is about the filename itself, right?

  • first: abc/audio.wav
  • second: def/audio.wav

Then, the symlink would be simply audio.wav, right?
Which always references the first audio, never the second. That only works if all audios are unique in their filename - across all datasets. Pretty strong assumption....

This fetch logic only handles well if source=destination through destination.exists() -> return destination - otherwise, it attempts to create symlinks which are not unique identifiers by path - but by filename.

@Gastron @TParcollet why destination.symlink_to(sourcepath) and not simply destination = sourcepath ?

@goexle
Copy link
Author

goexle commented May 19, 2022

@anautsch thanks for explaining, I understand its more complex. So you want to be able to fetch audio files from huggingface with the same method load_audio?

Your argument on the two files is about the filename itself, right?

* first: `abc/audio.wav`

* second: `def/audio.wav`

Then, the symlink would be simply audio.wav, right? Which always references the first audio, never the second. That only works if all audios are unique in their filename - across all datasets. Pretty strong assumption....

This fetch logic only handles well if source=destination through destination.exists() -> return destination - otherwise, it attempts to create symlinks which are not unique identifiers by path - but by filename.

@Gastron @TParcollet why destination.symlink_to(sourcepath) and not simply destination = sourcepath ?

Yes exactly, that is the problem. The symlink would just be the filename, which is not a unique name.

@anautsch
Copy link
Collaborator

Trying to understand now, what _missing_ok_unlink(destination) is doing. My gut feeling is still this is about invalid filenames in the first place.

@anautsch
Copy link
Collaborator

@goexle are you using Windows by any chance?

@goexle
Copy link
Author

goexle commented May 19, 2022

@anautsch no, I am using Ubuntu

@anautsch
Copy link
Collaborator

Then, the previously created symlink should be simply dropped:


no idea though how this plays out on parallel processing (thinking of possible sources of error)

@goexle
Copy link
Author

goexle commented May 19, 2022

Good point, thanks for bringing it up.
Strange that this does not seem to work.
I will try to debug it once I find time

@goexle
Copy link
Author

goexle commented May 24, 2022

@anautsch I debugged into it.
It is line

destination.symlink_to(sourcepath)

which causes the symlink to be created. The method path.unlink() you mentioned is called before, so it does not unlink before it exits.

I added a unit test. With this bugfix, it is passing. Without, it will fail.

audio2 = verification.load_audio(
"samples/voxceleb_samples/wav/id10001/1zcIwhmdeo4/00001.wav"
)
assert not torch.equal(audio1, audio2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a final empty line
run ./tests/.run-linters.sh to see if there's more to be done regarding formatting

"""
source, fl = split_path(path)
path = fetch(fl, source=source, savedir=savedir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing it will avoid this problem but also deactivate other features ;-)
the bug fix needs to happen elsewhere, as this fix will break other things (e.g. resolving HuggingFace URLs)

@anautsch
Copy link
Collaborator

I debugged into it. It is line

destination.symlink_to(sourcepath)

Thanks! Then, we need to fix that line.

which causes the symlink to be created. The method path.unlink() you mentioned is called before, so it does not unlink before it exits.

Wondering if path.unlink() is buggy which I can't believe - could it be possible there's something else going on? Like permissions or the file is still in use (from the eyes of the machine)?

I added a unit test. With this bugfix, it is passing. Without, it will fail.

As mentioned, there are other issues created through this fix, e.g. resolving other links from HuggingFace. It's better to target the symlink issue directly.

def test_load_audio(tmpdir):
from speechbrain.pretrained.interfaces import Pretrained

verification = Pretrained.from_hparams(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering about loading a 100+MB pretrained model here. I deleted my earlier comments on using a shorter pretrained model. All that is relevant here are these lines

        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=savedir)
        signal, sr = torchaudio.load(path, channels_first=False)

That's what needs to be tested for the fetch in pretrained interfaces.

@anautsch
Copy link
Collaborator

Sorry for the back & forth - I see that you are updating while I'm changing my mind...

@anautsch
Copy link
Collaborator

@goexle tryign to make up for my flipsy mind...

import torch, torchaudio
from speechbrain.pretrained.fetching import fetch
from speechbrain.utils.data_utils import split_path


def test_load_audio(tmpdir):
    def load_audio(path):
        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=tmpdir)
        signal, sr = torchaudio.load(path, channels_first=False)
        return signal

    audio1 = load_audio(
        "samples/voxceleb_samples/wav/id10002/xTV-jFAUKcw/00001.wav"
    )
    audio2 = load_audio(
        "samples/voxceleb_samples/wav/id10001/1zcIwhmdeo4/00001.wav"
    )
    assert not torch.equal(audio1, audio2)

Please let me know if this test gets to the issue. Trying to reproduce the error on my end.

@anautsch
Copy link
Collaborator

The calling paths might be off in the "samples" -> "../../samples" depending on from where the test is called.

The second audio never goes into the branch to actually unlink the symlink

    if destination.exists() and not overwrite:
        MSG = f"Fetch {filename}: Using existing file/symlink in {str(destination)}."

it jumps right there without validating that the symlink points to the same target or not.

@anautsch
Copy link
Collaborator

The pretrained fetch function has an overwrite parameter

which is not accessed through the pretrained interface...

path = fetch(fl, source=source, savedir=savedir)

So the test I wrote above would not be able to capture what your valid complaint is either...

There needs to be some extension from

    def load_audio(self, path, savedir="."):
        """Load an audio file with this model"s input spec

        When using a speech model, it is important to use the same type of data,
        as was used to train the model. This means for example using the same
        sampling rate and number of channels. It is, however, possible to
        convert a file from a higher sampling rate to a lower one (downsampling).
        Similarly, it is simple to downmix a stereo file to mono.
        The path can be a local path, a web url, or a link to a huggingface repo.
        """
        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=savedir)
        signal, sr = torchaudio.load(path, channels_first=False)
        return self.audio_normalizer(signal, sr)

to

    def load_audio(self, path, savedir=".", overwrite=False):
        """Load an audio file with this model"s input spec

        When using a speech model, it is important to use the same type of data,
        as was used to train the model. This means for example using the same
        sampling rate and number of channels. It is, however, possible to
        convert a file from a higher sampling rate to a lower one (downsampling).
        Similarly, it is simple to downmix a stereo file to mono.
        The path can be a local path, a web url, or a link to a huggingface repo.
        """
        source, fl = split_path(path)
        path = fetch(fl, source=source, savedir=savedir, overwrite=overwrite)
        signal, sr = torchaudio.load(path, channels_first=False)
        return self.audio_normalizer(signal, sr)

alike with the other fetch features: save_filename=None, use_auth_token=False

Yet, all derived interfaces will not bother that change, since they only do a

waveform = self.load_audio(path)

What would be a good way to fix this adequately...
One way is to extend upon the function parameters, and make that change everywhere - another way could be to put it into the hparams file so it becomes a field of any pretrained interface instead of a function argument. I don't see a clear picture yet which of these would be a good design choice. What's your take?

@anautsch
Copy link
Collaborator

@Gastron I've no clue how to approach this PR the best way. The pretrained inteface needs to change - but how?
Some users will want to use the same symlink over and over again - others will want to overwrite them. The rationale to keep them is to keep intact an active symlink. Should the overwrite parameter be determined before the if-else block if there's a symlink and its origin is different from the new source path? How to handle distributed learning where one node might not yet have had processed that symlinked audio but another node wants to change the symlink already?

Aside the bigger refactoring, what would be a working solution for @goexle ?

@Gastron
Copy link
Collaborator

Gastron commented May 25, 2022

I would be happy to remove the magic audio fetching in load_audio altogether. I think the convenience is only there for example code - but not really in actual usage.

However, I am sure we have a bunch of example code where this convenience is used, and that change will break those examples. The modification that we should do in the examples is of course minimal, fetching the example file, making sure we import fetch, something like:

from speechbrain.pretrained.interfaces import EncoderDecoderASR, fetch
...
path = fetch("the.examples.file.name", source="the/example's/source", savedir=".")
asr.transcribe_file(path)

@anautsch anautsch changed the base branch from develop to develop-v2 June 1, 2022 15:41
@anautsch
Copy link
Collaborator

anautsch commented Jun 3, 2022

@goexle How about the following?

Let's say we would move over the next months to a v0.6 major revision and through that we w/could refactor pretrained interfaces as well, especially the fetch function. As mentioned above, this is not the only PR/issue criticising this. We might revise how/where YAML files are used. In YAML files, one can instantiate objects - see the tokenizer templater for comparison.

tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
   model_dir: !ref <output_folder>
   vocab_size: !ref <token_output>
   annotation_train: !ref <train_annotation>
   annotation_read: !ref <annotation_read>
   model_type: !ref <token_type> # ["unigram", "bpe", "char"]
   character_coverage: !ref <character_coverage>
   annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]
   annotation_format: json

How about a partial function definition through a pretrained model yaml file?

I'm thinking about some

pretrained_asr: !new: speechbrain.pretrained.Interfaces.Pretrained.EncoderASR
   [...]  # some other parameters
   # custom data I/O
   load_audio: !partial: speechbrain.pretrained.Interfaces.Pretrained.load_audio
      savedir: "./symlink_stash"  # we need to refactor how we handle symlinks, too
      overwrite: False
      save_filename: None
      use_auth_token: False
      no_symlinks: True  # a suggestion  # really wondering why we need symlinks at all, third-party?
      local_paths_only: True  # a suggestion

which would pre-define most parameters of load_audio but the one that is mandatory (the audio path).

I just copied some parameters from the fetch function that could be put to the load_audio interface with default value. This could be one way; another could be to also partially outline the speechbrain.pretrained.fetching.fetch function in the yaml file.

Yet, I'd be in favor of partially definind load_audio which is a member of the pretrained class from which other pretrained interaces inherit, so it can be addressed with getattr and setattr builtins, whereas the fetch function is an import (it should be possible to play with it too yet I doubt this would yield an elegant solution).

@pplantinga is it possible to make an extension to HyperPyYAML that borrows from functools and makes partial function initialisations available to classes which inherit this function from a parent interface? This could be done with wrappers if needs be. Just wondering if this is a road to go over the next weeks. Does the !new already provide what I suggest with !partial?

Eventually, users control how the SB interfaces are used through the YAML files (core functions should be flexible enough to facilitate a plethora of use cases). YAML files could be a better way to feature both. This way could be better than running into needing to maintain lots of different interfaces. (Would just add a flag for whether/not trying to fetch from online at all.)

@Gastron why do we need symlinks in the first place? Are there some third-party constraints?

(Being verbose on a particular problem to better see the implications of possible solutions to find a solid re-design.)

@pplantinga
Copy link
Collaborator

@pplantinga is it possible to make an extension to HyperPyYAML that borrows from functools and makes partial function initialisations available to classes which inherit this function from a parent interface? This could be done with wrappers if needs be. Just wondering if this is a road to go over the next weeks. Does the !new already provide what I suggest with !partial?

I believe what you are looking for is the !name: tag, which uses functools.partial on the backend

@goexle
Copy link
Author

goexle commented Jun 24, 2022

@anautsch let me try to argue from a different perspective why the current load_audio method should be changed:

From a user perspective, I just want to load_audio in the way the pretrained model interface expects it. So I dont want to know that I need to call torchaudio.load with the parameter channels_first=False or that I need to call the normalizer afterwards. So its perfect that there is a method for that, and as a user, I just need to provide the path to the audio.

From a user perspective, I expect a simple behaviour of such a method. And as long as the unit test in this PR is failing, the load_audio has a major bug. And it needs to be fixed, this weird behaviour is not what you expect from this method.

Now to the remote fetching part:
Again, as a user, I would consider this as an extra feature. This is nothing an everyday method responsible for loading audio is capable of. So from a user perspective, I would expect that I need to call a different method which does that or set an extra parameter for it.

This could look like that
Proposal 1

def load_audio(self, path,fetch_from_remote=False,savedir="."):
   if fetch_from_remote:
       source, fl = split_path(path)
       path = fetch(fl, source=source, savedir=savedir)
   signal, sr = torchaudio.load(str(path), channels_first=False)
   return self.audio_normalizer(signal, sr)
             

or a speratae method
Proposal 2

def load_audio(self, path):
   signal, sr = torchaudio.load(str(path), channels_first=False)
   return self.audio_normalizer(signal, sr)

def load_remote_audio(self, path, savedir="."):
   source, fl = split_path(path)
   path = fetch(fl, source=source, savedir=savedir)
   return load_audio(path)
             

What do you think?

@anautsch
Copy link
Collaborator

Hi @goexle thank you for getting back on this PR!
We are on the same page that it needs to be changed - how is the issue.

The remote fetch is critical to our HuggingFace examples.

from speechbrain.pretrained import EncoderASR

asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-librispeech", savedir="pretrained_models/asr-wav2vec2-librispeech")
asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-en/example.wav")

see: https://huggingface.co/speechbrain/asr-wav2vec2-librispeech

The file speechbrain/asr-wav2vec2-commonvoice-en/example.wav points to:
https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-en/blob/main/example.wav

There are some 40+ SpeechBrain repos on HuggingFace.

Your second propsal reads cleaner, as functions are separated by purpose. When it comes to integrating them, however, the first one might be more suitable - as the fetch_from_remote allows to address this directly. Eventually, redundant boiler plate code is ok when functional safety can be reached better - that would also speak for your second proposal.

The issue that is now in my head:

  • one design paradigm for SpeechBrain is to have all necessary definitions at one place, i.e., in a big YAML file (or two: one for recipe & one for pretrained)
  • compared to recipes, the pretrained interface is at a rudamentary level and needs more redesign (in general) - it's at a cross-road between different user categories incl. industry (and there's a part where it would be great to see more contributions from industry too)
  • users should interface with the same functions using differently specified yaml files and/or function calls
    -> see Peter's feedback
  • how does it impact the integration throughout all pretrained interfaces? derived classes from pretrained might need to follow a specific signature (to be able to make that call between local/remote fetching)

The next issue are the general handling of symlinks and a re-iteration with how HuggingFace is integrated (will be a bigger next-next topic over the next months). There, I simply lack information to make a good call here. A solution might be reverted later on. But that's maybe just how it is with OpenSource & continuous integration/development.

How would you like to move forward?

@Adel-Moumen
Copy link
Collaborator

Hello @Gastron,

Wondering if you know if #1817 is covering this PR as well ? Thanks.

Best,
Adel

@Gastron
Copy link
Collaborator

Gastron commented Jan 18, 2024

Hello @Gastron,

Wondering if you know if #1817 is covering this PR as well ? Thanks.

Best, Adel

I am pretty sure that PR #1817 did not add the FetchSource explicit capability to Pretrained.load_audio. I suggest again that we should remove the magic of load_audio (no fetch) - instead, the audio should first be fetched with the fetch function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
CI/CD
Performance & housekeeping
Development

Successfully merging this pull request may close these issues.

None yet

7 participants