Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using same file path for input & output #27

Closed
justinsalamon opened this issue Oct 27, 2016 · 11 comments
Closed

Support using same file path for input & output #27

justinsalamon opened this issue Oct 27, 2016 · 11 comments

Comments

@justinsalamon
Copy link
Contributor

Sox doesn't support using the same file as both input and output - doing this will result in an empty, invalid audio file. While this is sox behavior and not pysox, it would be nice if pysox took care of this behind the scenes. Right now the user needs to worry about this logic themselves, e.g. like this:

import tempfile
import shutil
from scaper.util import _close_temp_files

audio_infile = '/Users/justin/Downloads/trimtest.wav'
audio_outfile = '/Users/justin/Downloads/trimtest.wav'
start_time = 2
end_time = 3

tfm = sox.Transformer()
tfm.trim(start_time, end_time)
if audio_outfile != audio_infile:
    tfm.build(audio_infile, audio_outfile)
else:
    # must use temp file in order to save to same file
    tmpfiles = []
    with _close_temp_files(tmpfiles):
        # Create tmp file
        tmpfiles.append(
            tempfile.NamedTemporaryFile(
                suffix='.wav', delete=True))
        # Save trimmed result to temp file
        tfm.build(audio_infile, tmpfiles[-1].name)
        # Copy result back to original file
        shutil.copyfile(tmpfiles[-1].name, audio_outfile)

Pysox does issue a warning when a file is about to be overwritten, which is even more confusing under this scenario since the user (who might be unfamiliar with the quirks of sox) has no reason to think that the overwritten file will be invalid.

@rabitt
Copy link
Collaborator

rabitt commented Oct 27, 2016

I see your point, but my personal preference is to avoid having input_file = output_file, because it's harder to find and reproduce bugs. Why not just do:

audio_infile = '/Users/justin/Downloads/trimtest.wav'
audio_outfile = '/Users/justin/Downloads/trimtest_out.wav'

and if you no longer care about audio_infile after processing, you can os.remove(audio_infile).

@rabitt
Copy link
Collaborator

rabitt commented Oct 27, 2016

The other option is to put all your input files and output files in separate directories to allow them to have the same basename. This is what I usually do, fwiw.

@justinsalamon
Copy link
Contributor Author

I see your point, but my personal preference is to avoid having input_file = output_file, because it's harder to find and reproduce bugs. Why not just do:

audio_infile = '/Users/justin/Downloads/trimtest.wav'
audio_outfile = '/Users/justin/Downloads/trimtest_out.wav'

and if you no longer care about audio_infile after processing, you can os.remove(audio_infile).

Ehm, well, I'm of course aware that I can do this, but that's not my use case ;) My use case requires overwriting the same file. Saving the output file into a separate directory is exactly what the snippet I included in my first post does (and then copies it over and overwrites the input file). But I feel like that's besides the point - regardless of whether one prefers to keep their input/output files separate or not, right now the user can provide the same filepath for input/output and pysox will spit out an invalid audio file without any warning, which is undocumented behavior and, I think, undesirable.

@ejhumphrey
Copy link
Contributor

this is an easy enough thing to do with tempfiles: https://github.com/ejhumphrey/claudio/blob/master/claudio/sox.py#L131

just make sure if you use mkstmp, you explicitly close the file ID. this is one of those cases where python3 + context managers is way shinier than py2: https://docs.python.org/3/library/tempfile.html#examples

@ejhumphrey
Copy link
Contributor

forgot to mention ... would be an easy enough PR 😉

@rabitt
Copy link
Collaborator

rabitt commented Oct 27, 2016

@justinsalamon totally agree that pysox should throw an error if input_filepath = output_filepath, rather than creating an empty file. Didn't realize that was the main issue you were bringing up - I thought you wanted to be able to allow the paths to be the same.

@ejhumphrey
Copy link
Contributor

fwiw, I am in favor of a call signature like do_a_thing(input_file="my_file.wav", output_file=None) for over-writing a file in place.

@justinsalamon
Copy link
Contributor Author

justinsalamon commented Oct 27, 2016

@ejhumphrey I know I can use tempfiles, it's in my first comment on this issue ;)

@justinsalamon totally agree that pysox should throw an error if input_filepath = output_filepath, rather than creating an empty file. Didn't realize that was the main issue you were bringing up - I thought you wanted to be able to allow the paths to be the same.

@rabitt yes, the main issue I was bringing up is that the creation of an empty file is bad. However, my preferred solution would be for pysox to use tempfiles (as per my example) to basically yes support using the same path for input/output. An alternative solution is what you propose (raise an error), which would be better than the current behavior, though I don't see why you wouldn't want to support overwriting the input file - there are plenty of tools out there (e.g. shutil, librosa) that happily overwrite files and it's a fairly common use case.

@ejhumphrey
Copy link
Contributor

@justinsalamon you know i don't know how to read

@rabitt
Copy link
Collaborator

rabitt commented Oct 27, 2016

@justinsalamon

the creation of an empty file is bad

agreed

However, my preferred solution would be for pysox to use tempfiles (as per my example) to basically yes support using the same path for input/output. An alternative solution is what you propose (raise an error), which would be better than the current behavior, though I don't see why you wouldn't want to support overwriting the input file - there are plenty of tools out there (e.g. shutil, librosa) that happily overwrite files and it's a fairly common use case.

In a library like this the main paradigm is creating outputs by modifying inputs. Other types of libraries might do something like "run process, create a new text file" and allow you to overwrite said text file. In this case, you can still "reproduce" the results because the original information is preserved. However in this kind of library, if you allow the input file to be overwritten, if you run the same process twice you get different results. That's bad for reproducibility, which is why I'm in favor of the "put input and output files in different directories" approach.

Because of ^^rant, I prefer to throw an error, rather than use the tempfile approach.

@justinsalamon
Copy link
Contributor Author

However in this kind of library, if you allow the input file to be overwritten, if you run the same process twice you get different results. That's bad for reproducibility, which is why I'm in favor of the "put input and output files in different directories" approach.

Well, you could argue that reproducibility means being able to repeat an experiment as performed and obtain the same results. If you start with the same dataset, then as long as the codebase is the same, you will obtain the same results, even if part of the process means overwriting some of the data. So I don't necessarily see this an an issue for reproducibility.

The bigger point though, I think, is that you're enforcing a certain behavior on the user when I think that decision should be left to the user. Of course you don't want to support the user doing something that's blatantly wrong, but I'm not sure overwriting files falls into that bucket. As a concrete example, for scaper I'm generating new audio files, but because pysox always involves writing to disk, I end up performing multiple steps of file io all to produce a single final result. There's no real reason for me to save each intermediate step to a separate file nor keep it on the file system once the process is over. So basically if pysox doesn't support using the same file for input/output, I'll just have to wrap it in code that implements this logic.

Anyway, it's your call and I won't insist futher, just wanted to put my 2c out there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants