-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
ENH: io.wavfile: read unseekable files #22208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@hpx7 would you be able to have a look, does this fix your issue? |
Thanks again @nickodell for reviewing #22215, would you have time to have a look at this too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general concept of this change, but there are a few details that need to be addressed.
scipy/io/wavfile.py
Outdated
|
||
def flush(self): | ||
# make np.fromfile fail intelligibly | ||
raise io.UnsupportedOperation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# make np.fromfile fail intelligibly
Can you elaborate? What is this protecting against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should have an exception message to clarify why it's being thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've investigated this in a little more detail.
The reason why np.fromfile()
calls flush()
is that it could be reading from a read/write file handle, where changes were just made. While reading, it flushes changes to the file, duplicates the file handle, seeks that handle to the offset, then starts reading. It closes the duplicated file handle. See numpy/numpy#7831 for detail.
This has two implications.
-
First, reading from unseekable streams will never use the
np.fromfile()
fast path, because it callsflush()
. -
Second, even if we added flush support, e.g.
def flush(self): return self.reader.flush()
this would still not allow use of the fast path, because NumPy wants to seek the stream, which this stream does not support. We cannot intercept this seek, because it is done in C. NumPy does this seek even if offset=0, so we cannot avoid it. Link to code
In summary, throwing an exception is the best we can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chasing that down. And there are additional seeks that can't be avoided without changing numpy's API, e.g. array_fromfile_binary
in ctors.c
.
if not hasattr(filename, 'read'): | ||
fid.close() | ||
else: | ||
fid.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am frankly mystified as to why this final seek operation was originally here. Most other libraries will not seek to the beginning of a file if you hand them an open file to read - e.g. if you use pd.read_csv()
to read a file, the file position after the function call will be where it stopped reading.
I would suggest removing fid.seek(0)
, and not seeking even if the file supports it.
Thoughts, @j-bowhay ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added in 47b9012, @nils-werner do you happen to remember?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly it was there before, I just proposed the if
clause. Maybe follow the git blame
to the changes before?
I don't think it needs to be the at all tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seek(0)
doesn't seem to appear prior to 47b9012, maybe it had something to do with StringIO compatibility?
Removing it makes sense to me, but I worry it could break user code that depends on that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, IIRC that was to fill the buffer with RIFF data, and then rewind it so that the buffer can immediately be used by something that wants to read a WAV file (which contains RIFF data).
The use case was the Audio
tag in Jupiter, where you'd want to export an array to a WAV file-like buffer, and then send it over to the frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that makes sense. I've added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit dd3568a looks like a nice compromise: it does not swallow exceptions if the underlying stream claims to support seeking, but actually can't. This was my primary concern. It also doesn't change functionality in case applications are relying on the seek(0)
behavior.
I wrote a test script for this PR to check that it works when reading wav data from a pipe, and it does. Example: import scipy
#import gzip
#import io
import subprocess
def checksum_first_channel(res):
sr, a = res
if a.ndim == 2:
chk = a[:, 0].astype('float').sum()
elif a.ndim == 1:
chk = a.astype('float').sum()
else:
raise Exception()
print("chk", chk)
return chk
def read_from_subproc(filename):
sp = subprocess.Popen(
['cat', filename],
stdout=subprocess.PIPE
)
return sp.stdout
a = scipy.io.wavfile.read('../wavtest1-rf64.wav')
assert checksum_first_channel(a) == -122419968.0
a = scipy.io.wavfile.read('../wavtest2.wav')
assert checksum_first_channel(a) == -82594741.0
a = scipy.io.wavfile.read(read_from_subproc('../wavtest1-rf64.wav'))
assert checksum_first_channel(a) == -122419968.0
a = scipy.io.wavfile.read(read_from_subproc('../wavtest2.wav'))
assert checksum_first_channel(a) == -82594741.0 I checked this with PCM and RF64 wav files. |
Thanks for the review! I've pushed changes, and can squash them if you like. |
Yes, please do. |
dd3568a
to
674602a
Compare
Squashed, and split a long string to make the linter happy. |
674602a
to
c41adc2
Compare
Allow scipy.io.wavfile.read() to read non-seekable files. * If passed stream is not seekable, wrap it in a SeekEmulatingReader, which tracks current position and implements tell() to report it. Most seek()s in wavfile.read() simply seek forward from the current position by a fixed amount. To address these: * Support limited SeekEmulatingReader.seek() that emulates seek()s that can be supported by read()ing and discarding the result. Befor returning, read() attempts to seek back to the start of the file. This is nice when possible, but if it isn't we shouldn't blow up. So: * Avoid final fid.seek(0) if underlying stream isn't seekable. The remaining seek()s are in RF64 support, where chunk size is global and stored in the RIFF header rather than in the chunk header. _read_data_chunk() currently seeks back to a fixed offset from the file origin to grab the chunk size. To address this: * Modify RF64 support to parse and store the chunk size in _read_riff_chunk() so that _read_data_chunk() doesn't have to seek() back for it. This applies whether or not reading from a seekable stream. To test this, I added test_streams(), which ensures that wavfile reads the same data no matter whether a path, seekable stream, or unseekable stream is passed to read(). Fixes scipy#11328
c41adc2
to
96d162d
Compare
(Ugh, sorry for the multiple force-pushes, there were stray whitespace changes that I fumbled. The change for the linter looks like this.) |
scipy/io/tests/test_wavfile.py
Outdated
rate1, data1 = wavfile.read(fp1) | ||
rate2, data2 = wavfile.read(Nonseekable(fp2)) | ||
rate3, data3 = wavfile.read(dfname, mmap=False) | ||
assert_array_equal(rate1, rate3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, could you change the new tests to use numpy.testing.assert_equal
as per the recommendations in the numpy docs
Thanks for a great first contribution to SciPy @smammy, please feel free to open prs for anything else you might be interested in working on:) Thanks as ever for the thorough review @nickodell |
Thanks everyone for your help and review! Glad to learn something and get this working. |
Reference issue
Closes gh-11328
What does this implement/fix?
Allow scipy.io.wavfile.read() to read non-seekable files.
Additional information
Allow scipy.io.wavfile.read() to read non-seekable files.
Most seek()s in wavfile.read() simply seek forward from the current position by a fixed amount. To address these:
Befor returning, read() attempts to seek back to the start of the file. This is nice when possible, but if it isn't we shouldn't blow up. So:
The remaining seek()s are in RF64 support, where chunk size is global and stored in the RIFF header rather than in the chunk header. _read_data_chunk() currently seeks back to a fixed offset from the file origin to grab the chunk size. To address this:
To test this, I added test_streams(), which ensures that wavfile reads the same data no matter whether a path, seekable stream, or unseekable stream is passed to read().