Skip to content

Conversation

smammy
Copy link
Contributor

@smammy smammy commented Dec 28, 2024

Reference issue

Closes gh-11328

What does this implement/fix?

Allow scipy.io.wavfile.read() to read non-seekable files.

Additional information

Allow scipy.io.wavfile.read() to read non-seekable files.

  • If passed stream is not seekable, wrap it in a SeekEmulatingReader, which tracks current position and implements tell() to report it.

Most seek()s in wavfile.read() simply seek forward from the current position by a fixed amount. To address these:

  • Support limited SeekEmulatingReader.seek() that emulates seek()s that can be supported by read()ing and discarding the result.

Befor returning, read() attempts to seek back to the start of the file. This is nice when possible, but if it isn't we shouldn't blow up. So:

  • Avoid final fid.seek(0) if underlying stream isn't seekable.

The remaining seek()s are in RF64 support, where chunk size is global and stored in the RIFF header rather than in the chunk header. _read_data_chunk() currently seeks back to a fixed offset from the file origin to grab the chunk size. To address this:

  • Modify RF64 support to parse and store the chunk size in _read_riff_chunk() so that _read_data_chunk() doesn't have to seek() back for it. This applies whether or not reading from a seekable stream.

To test this, I added test_streams(), which ensures that wavfile reads the same data no matter whether a path, seekable stream, or unseekable stream is passed to read().

@github-actions github-actions bot added scipy.io enhancement A new feature or improvement labels Dec 28, 2024
@j-bowhay j-bowhay added this to the 1.16.0 milestone Dec 28, 2024
@lucascolley lucascolley changed the title ENH: read unseekable files in scipy.io.wavfile. ENH: io.wavfile: read unseekable files Dec 28, 2024
@j-bowhay
Copy link
Member

@hpx7 would you be able to have a look, does this fix your issue?

@j-bowhay
Copy link
Member

Thanks again @nickodell for reviewing #22215, would you have time to have a look at this too?

@nickodell nickodell self-requested a review December 31, 2024 18:11
Copy link
Member

@nickodell nickodell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general concept of this change, but there are a few details that need to be addressed.


def flush(self):
# make np.fromfile fail intelligibly
raise io.UnsupportedOperation()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# make np.fromfile fail intelligibly

Can you elaborate? What is this protecting against?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should have an exception message to clarify why it's being thrown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've investigated this in a little more detail.

The reason why np.fromfile() calls flush() is that it could be reading from a read/write file handle, where changes were just made. While reading, it flushes changes to the file, duplicates the file handle, seeks that handle to the offset, then starts reading. It closes the duplicated file handle. See numpy/numpy#7831 for detail.

This has two implications.

  1. First, reading from unseekable streams will never use the np.fromfile() fast path, because it calls flush().

  2. Second, even if we added flush support, e.g.

    def flush(self):
        return self.reader.flush()

    this would still not allow use of the fast path, because NumPy wants to seek the stream, which this stream does not support. We cannot intercept this seek, because it is done in C. NumPy does this seek even if offset=0, so we cannot avoid it. Link to code

In summary, throwing an exception is the best we can do here.

Copy link
Contributor Author

@smammy smammy Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing that down. And there are additional seeks that can't be avoided without changing numpy's API, e.g. array_fromfile_binary in ctors.c.

if not hasattr(filename, 'read'):
fid.close()
else:
fid.seek(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am frankly mystified as to why this final seek operation was originally here. Most other libraries will not seek to the beginning of a file if you hand them an open file to read - e.g. if you use pd.read_csv() to read a file, the file position after the function call will be where it stopped reading.

I would suggest removing fid.seek(0), and not seeking even if the file supports it.

Thoughts, @j-bowhay ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added in 47b9012, @nils-werner do you happen to remember?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly it was there before, I just proposed the if clause. Maybe follow the git blame to the changes before?

I don't think it needs to be the at all tbh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seek(0) doesn't seem to appear prior to 47b9012, maybe it had something to do with StringIO compatibility?

Removing it makes sense to me, but I worry it could break user code that depends on that behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, IIRC that was to fill the buffer with RIFF data, and then rewind it so that the buffer can immediately be used by something that wants to read a WAV file (which contains RIFF data).

The use case was the Audio tag in Jupiter, where you'd want to export an array to a WAV file-like buffer, and then send it over to the frontend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, that makes sense. I've added a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit dd3568a looks like a nice compromise: it does not swallow exceptions if the underlying stream claims to support seeking, but actually can't. This was my primary concern. It also doesn't change functionality in case applications are relying on the seek(0) behavior.

@nickodell
Copy link
Member

I wrote a test script for this PR to check that it works when reading wav data from a pipe, and it does.

Example:

import scipy
#import gzip
#import io
import subprocess

def checksum_first_channel(res):
    sr, a = res
    if a.ndim == 2:
        chk = a[:, 0].astype('float').sum()
    elif a.ndim == 1:
        chk = a.astype('float').sum()
    else:
        raise Exception()
    print("chk", chk)
    return chk

def read_from_subproc(filename):
    sp = subprocess.Popen(
        ['cat', filename],
        stdout=subprocess.PIPE
    )
    return sp.stdout

a = scipy.io.wavfile.read('../wavtest1-rf64.wav')
assert checksum_first_channel(a) == -122419968.0
a = scipy.io.wavfile.read('../wavtest2.wav')
assert checksum_first_channel(a) == -82594741.0

a = scipy.io.wavfile.read(read_from_subproc('../wavtest1-rf64.wav'))
assert checksum_first_channel(a) == -122419968.0

a = scipy.io.wavfile.read(read_from_subproc('../wavtest2.wav'))
assert checksum_first_channel(a) == -82594741.0

I checked this with PCM and RF64 wav files.

@smammy
Copy link
Contributor Author

smammy commented Jan 1, 2025

Thanks for the review! I've pushed changes, and can squash them if you like.

@nickodell
Copy link
Member

Thanks for the review! I've pushed changes, and can squash them if you like.

Yes, please do.

@nickodell nickodell self-requested a review January 1, 2025 23:58
@smammy smammy force-pushed the 11328-wavfile-support-unseekable branch from dd3568a to 674602a Compare January 2, 2025 18:09
@smammy
Copy link
Contributor Author

smammy commented Jan 2, 2025

Squashed, and split a long string to make the linter happy.

@smammy smammy force-pushed the 11328-wavfile-support-unseekable branch from 674602a to c41adc2 Compare January 2, 2025 18:16
Allow scipy.io.wavfile.read() to read non-seekable files.

  * If passed stream is not seekable, wrap it in a SeekEmulatingReader,
    which tracks current position and implements tell() to report it.

Most seek()s in wavfile.read() simply seek forward from the current
position by a fixed amount. To address these:

  * Support limited SeekEmulatingReader.seek() that emulates seek()s
    that can be supported by read()ing and discarding the result.

Befor returning, read() attempts to seek back to the start of the file.
This is nice when possible, but if it isn't we shouldn't blow up. So:

  * Avoid final fid.seek(0) if underlying stream isn't seekable.

The remaining seek()s are in RF64 support, where chunk size is global
and stored in the RIFF header rather than in the chunk header.
_read_data_chunk() currently seeks back to a fixed offset from the file
origin to grab the chunk size. To address this:

  * Modify RF64 support to parse and store the chunk size in
    _read_riff_chunk() so that _read_data_chunk() doesn't have to seek()
    back for it. This applies whether or not reading from a seekable
    stream.

To test this, I added test_streams(), which ensures that wavfile reads
the same data no matter whether a path, seekable stream, or unseekable
stream is passed to read().

Fixes scipy#11328
@smammy smammy force-pushed the 11328-wavfile-support-unseekable branch from c41adc2 to 96d162d Compare January 2, 2025 18:19
@smammy
Copy link
Contributor Author

smammy commented Jan 2, 2025

(Ugh, sorry for the multiple force-pushes, there were stray whitespace changes that I fumbled. The change for the linter looks like this.)

rate1, data1 = wavfile.read(fp1)
rate2, data2 = wavfile.read(Nonseekable(fp2))
rate3, data3 = wavfile.read(dfname, mmap=False)
assert_array_equal(rate1, rate3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, could you change the new tests to use numpy.testing.assert_equal as per the recommendations in the numpy docs

@j-bowhay j-bowhay merged commit f9749da into scipy:main Jan 6, 2025
34 of 37 checks passed
@j-bowhay
Copy link
Member

j-bowhay commented Jan 6, 2025

Thanks for a great first contribution to SciPy @smammy, please feel free to open prs for anything else you might be interested in working on:)

Thanks as ever for the thorough review @nickodell

@smammy
Copy link
Contributor Author

smammy commented Jan 6, 2025

Thanks everyone for your help and review! Glad to learn something and get this working.

@smammy smammy deleted the 11328-wavfile-support-unseekable branch January 6, 2025 15:53
@lucascolley lucascolley added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jan 31, 2025
@lucascolley lucascolley removed the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scipy unable to read piped wav file
5 participants