ENH: io.wavfile: read unseekable files #22208

smammy · 2024-12-28T16:13:51Z

Reference issue

Closes gh-11328

What does this implement/fix?

Allow scipy.io.wavfile.read() to read non-seekable files.

Additional information

Allow scipy.io.wavfile.read() to read non-seekable files.

If passed stream is not seekable, wrap it in a SeekEmulatingReader, which tracks current position and implements tell() to report it.

Most seek()s in wavfile.read() simply seek forward from the current position by a fixed amount. To address these:

Support limited SeekEmulatingReader.seek() that emulates seek()s that can be supported by read()ing and discarding the result.

Befor returning, read() attempts to seek back to the start of the file. This is nice when possible, but if it isn't we shouldn't blow up. So:

Avoid final fid.seek(0) if underlying stream isn't seekable.

The remaining seek()s are in RF64 support, where chunk size is global and stored in the RIFF header rather than in the chunk header. _read_data_chunk() currently seeks back to a fixed offset from the file origin to grab the chunk size. To address this:

Modify RF64 support to parse and store the chunk size in _read_riff_chunk() so that _read_data_chunk() doesn't have to seek() back for it. This applies whether or not reading from a seekable stream.

To test this, I added test_streams(), which ensures that wavfile reads the same data no matter whether a path, seekable stream, or unseekable stream is passed to read().

j-bowhay · 2024-12-28T22:22:22Z

@hpx7 would you be able to have a look, does this fix your issue?

j-bowhay · 2024-12-31T17:45:20Z

Thanks again @nickodell for reviewing #22215, would you have time to have a look at this too?

nickodell

I like the general concept of this change, but there are a few details that need to be addressed.

scipy/io/tests/test_wavfile.py

scipy/io/wavfile.py

nickodell · 2024-12-31T19:15:26Z

scipy/io/wavfile.py

+
+    def flush(self):
+        # make np.fromfile fail intelligibly
+        raise io.UnsupportedOperation()


# make np.fromfile fail intelligibly

Can you elaborate? What is this protecting against?

Also, this should have an exception message to clarify why it's being thrown.

I've investigated this in a little more detail.

The reason why np.fromfile() calls flush() is that it could be reading from a read/write file handle, where changes were just made. While reading, it flushes changes to the file, duplicates the file handle, seeks that handle to the offset, then starts reading. It closes the duplicated file handle. See numpy/numpy#7831 for detail.

This has two implications.

First, reading from unseekable streams will never use the np.fromfile() fast path, because it calls flush().

Second, even if we added flush support, e.g.

def flush(self): return self.reader.flush()

this would still not allow use of the fast path, because NumPy wants to seek the stream, which this stream does not support. We cannot intercept this seek, because it is done in C. NumPy does this seek even if offset=0, so we cannot avoid it. Link to code

In summary, throwing an exception is the best we can do here.

Thanks for chasing that down. And there are additional seeks that can't be avoided without changing numpy's API, e.g. array_fromfile_binary in ctors.c.

nickodell · 2024-12-31T19:16:25Z

scipy/io/wavfile.py

        if not hasattr(filename, 'read'):
            fid.close()
        else:
-            fid.seek(0)


I am frankly mystified as to why this final seek operation was originally here. Most other libraries will not seek to the beginning of a file if you hand them an open file to read - e.g. if you use pd.read_csv() to read a file, the file position after the function call will be where it stopped reading.

I would suggest removing fid.seek(0), and not seeking even if the file supports it.

Thoughts, @j-bowhay ?

It was added in 47b9012, @nils-werner do you happen to remember?

If I remember correctly it was there before, I just proposed the if clause. Maybe follow the git blame to the changes before?

I don't think it needs to be the at all tbh.

seek(0) doesn't seem to appear prior to 47b9012, maybe it had something to do with StringIO compatibility?

Removing it makes sense to me, but I worry it could break user code that depends on that behavior.

Ah right, IIRC that was to fill the buffer with RIFF data, and then rewind it so that the buffer can immediately be used by something that wants to read a WAV file (which contains RIFF data).

The use case was the Audio tag in Jupiter, where you'd want to export an array to a WAV file-like buffer, and then send it over to the frontend.

Ah ok, that makes sense. I've added a comment.

Commit dd3568a looks like a nice compromise: it does not swallow exceptions if the underlying stream claims to support seeking, but actually can't. This was my primary concern. It also doesn't change functionality in case applications are relying on the seek(0) behavior.

nickodell · 2024-12-31T19:23:39Z

I wrote a test script for this PR to check that it works when reading wav data from a pipe, and it does.

Example:

import scipy
#import gzip
#import io
import subprocess

def checksum_first_channel(res):
    sr, a = res
    if a.ndim == 2:
        chk = a[:, 0].astype('float').sum()
    elif a.ndim == 1:
        chk = a.astype('float').sum()
    else:
        raise Exception()
    print("chk", chk)
    return chk

def read_from_subproc(filename):
    sp = subprocess.Popen(
        ['cat', filename],
        stdout=subprocess.PIPE
    )
    return sp.stdout

a = scipy.io.wavfile.read('../wavtest1-rf64.wav')
assert checksum_first_channel(a) == -122419968.0
a = scipy.io.wavfile.read('../wavtest2.wav')
assert checksum_first_channel(a) == -82594741.0

a = scipy.io.wavfile.read(read_from_subproc('../wavtest1-rf64.wav'))
assert checksum_first_channel(a) == -122419968.0

a = scipy.io.wavfile.read(read_from_subproc('../wavtest2.wav'))
assert checksum_first_channel(a) == -82594741.0

I checked this with PCM and RF64 wav files.

smammy · 2025-01-01T20:12:09Z

Thanks for the review! I've pushed changes, and can squash them if you like.

nickodell · 2025-01-01T23:51:51Z

Thanks for the review! I've pushed changes, and can squash them if you like.

Yes, please do.

smammy · 2025-01-02T18:11:21Z

Squashed, and split a long string to make the linter happy.

Allow scipy.io.wavfile.read() to read non-seekable files. * If passed stream is not seekable, wrap it in a SeekEmulatingReader, which tracks current position and implements tell() to report it. Most seek()s in wavfile.read() simply seek forward from the current position by a fixed amount. To address these: * Support limited SeekEmulatingReader.seek() that emulates seek()s that can be supported by read()ing and discarding the result. Befor returning, read() attempts to seek back to the start of the file. This is nice when possible, but if it isn't we shouldn't blow up. So: * Avoid final fid.seek(0) if underlying stream isn't seekable. The remaining seek()s are in RF64 support, where chunk size is global and stored in the RIFF header rather than in the chunk header. _read_data_chunk() currently seeks back to a fixed offset from the file origin to grab the chunk size. To address this: * Modify RF64 support to parse and store the chunk size in _read_riff_chunk() so that _read_data_chunk() doesn't have to seek() back for it. This applies whether or not reading from a seekable stream. To test this, I added test_streams(), which ensures that wavfile reads the same data no matter whether a path, seekable stream, or unseekable stream is passed to read(). Fixes scipy#11328

smammy · 2025-01-02T18:22:20Z

(Ugh, sorry for the multiple force-pushes, there were stray whitespace changes that I fumbled. The change for the linter looks like this.)

j-bowhay · 2025-01-02T19:20:34Z

scipy/io/tests/test_wavfile.py

+            rate1, data1 = wavfile.read(fp1)
+            rate2, data2 = wavfile.read(Nonseekable(fp2))
+            rate3, data3 = wavfile.read(dfname, mmap=False)
+            assert_array_equal(rate1, rate3)


Nit, could you change the new tests to use numpy.testing.assert_equal as per the recommendations in the numpy docs

scipy/io/tests/test_wavfile.py

j-bowhay · 2025-01-06T15:41:01Z

Thanks for a great first contribution to SciPy @smammy, please feel free to open prs for anything else you might be interested in working on:)

Thanks as ever for the thorough review @nickodell

smammy · 2025-01-06T15:53:24Z

Thanks everyone for your help and review! Glad to learn something and get this working.

github-actions bot added scipy.io enhancement A new feature or improvement labels Dec 28, 2024

j-bowhay added this to the 1.16.0 milestone Dec 28, 2024

lucascolley changed the title ~~ENH: read unseekable files in scipy.io.wavfile.~~ ENH: io.wavfile: read unseekable files Dec 28, 2024

nickodell self-requested a review December 31, 2024 18:11

nickodell requested changes Dec 31, 2024

View reviewed changes

nickodell self-requested a review January 1, 2025 23:58

nickodell approved these changes Jan 1, 2025

View reviewed changes

smammy force-pushed the 11328-wavfile-support-unseekable branch from dd3568a to 674602a Compare January 2, 2025 18:09

smammy force-pushed the 11328-wavfile-support-unseekable branch from 674602a to c41adc2 Compare January 2, 2025 18:16

smammy force-pushed the 11328-wavfile-support-unseekable branch from c41adc2 to 96d162d Compare January 2, 2025 18:19

j-bowhay reviewed Jan 2, 2025

View reviewed changes

j-bowhay reviewed Jan 6, 2025

View reviewed changes

scipy/io/tests/test_wavfile.py Outdated Show resolved Hide resolved

TST: assert_array_equal -> assert_equal

4f9f3a0

j-bowhay merged commit f9749da into scipy:main Jan 6, 2025
34 of 37 checks passed

smammy deleted the 11328-wavfile-support-unseekable branch January 6, 2025 15:53

lucascolley added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jan 31, 2025

lucascolley removed the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jun 9, 2025

hippowm mentioned this pull request Jul 11, 2025

TST: io.wavfile: add test for SeekEmulatingReader.seek #23319

Merged

Uh oh!

ENH: io.wavfile: read unseekable files #22208

ENH: io.wavfile: read unseekable files #22208

Uh oh!

Conversation

smammy commented Dec 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference issue

What does this implement/fix?

Additional information

Uh oh!

j-bowhay commented Dec 28, 2024

Uh oh!

j-bowhay commented Dec 31, 2024

Uh oh!

nickodell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smammy Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickodell commented Dec 31, 2024

Uh oh!

smammy commented Jan 1, 2025

Uh oh!

nickodell commented Jan 1, 2025

Uh oh!

smammy commented Jan 2, 2025

Uh oh!

smammy commented Jan 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

j-bowhay commented Jan 6, 2025

Uh oh!

smammy commented Jan 6, 2025

Uh oh!

Uh oh!

smammy commented Dec 28, 2024 •

edited

Loading

smammy Jan 2, 2025 •

edited

Loading