basic support for arbitrary chunks #25

kali · 2018-08-28T12:35:19Z

Hello there, this is an extension to support writting and reading arbitrary chunks between "fmt " and "data".

Let me know what you think.

(By the way, I think there is a tiny bug in skip_bytes call while looking for the data chunk. As far as I understand it, RIFF chunks must be aligned to even offsets so there might be a padding zero byte... This patch actually fixes it.)

ruuda · 2018-08-28T21:23:56Z

Thanks a lot for taking the time to implement this! I agree that it would be a useful thing to have.

I think the API can still be improved. I struggled with a similar thing a lot in Claxon. I considered a callback-based API there too, but then I realised that an iterator-based API is more flexible, more convenient, and more performant:

You can store an iterator in a struct, or move it into a function. A callback forces you to do everything in one place, at once.
With an iterator the caller is in control about when to read each chunk, and it is easy to stop early if desired (for example, if you're only interested in particular metadata, you can stop right after reading the chunk of interest). With a callback, the callback could return a boolean that indicates whether to continue, but not when to continue. For example, something like reading the first chunk of a set of files, then reading the second chunk of every file, etc. is straightforward with iterators, but not possible with callbacks.

Also, rather than buffering up the entire chunk in memory, it is advantageous to expose a bounded io::Read instead. The caller can still easily read the entire chunk into memory if desired, but io::Read also enables streaming. Furthermore, the particular implementation could be one that does a seek on drop if not all bytes are read (if the underlying reader is io::Seek). That makes it possible to quickly skip over uninteresting chunks. For Claxon I started working on an EmbeddedReader for this purpose. It is similar to io::Take in that it enforces an upper limit, so a consumer cannot read past the chunk, but it additionally ensures that the underlying reader is advanced by exactly the chunk size on drop, so if the caller does not consume all bytes, the EmbeddedReader will skip over the remainder on drop.

We could adapt WavReader as follows:

Don't read until the data chunk upon construction, only read the fmt chunk.
On samples, into_samples, or seek, read up to the data chunk if necessary.
Expose an additional method chunks, that returns a Chunks : Iterator<Item = Result<Chunk<'a>>> iterator, where struct Chunk { name: [u8; 4], reader: &'a mut EmbeddedReader<'a> }.

I had not fully implemented this for Claxon yet, so sorry if the details are a bit vague, but I think this is the way to go. The analogous parts there are MetadataReader which would be like Chunks, and ApplicationBlock which would be like Chunk.

For writing, providing slices at construction time limits the caller; it prevents streaming, or may force unnecessary buffering. (For example, if you would want to write a cover art image, and the writer takes an io::Write, then you would first need to write to an io::Cursor<Vec<u8>> and then provide the vec, while it could have been written at once.) This can be addressed in a similar way, at very little runtime cost:

Add a flag to WavWriter that specifies whether chunks can still be written, which is true at construction.
Set it to false in write_sample and get_i16_writer.
Add a method get_chunk_writer(&mut self, name: [u8; 4]) -> ChunkWriter<'s>, similar to get_i16_writer(). ChunkWriter would imlement io::Write and defer writes to the underlying writer, while additionally keeping track of the write count.
At ChunkWriter::drop(), it would seek back, overwrite the data chunk header with the correct header and put the length in place, similar to what update_header does, and then seek forward to the end again, and write a new data chunk header there, that will be updated by update_header when the file is finalized.

What do you think?

By the way, I think there is a tiny bug in skip_bytes call while looking for the data chunk. As far as I understand it, RIFF chunks must be aligned to even offsets so there might be a padding zero byte... This patch actually fixes it.

Thanks for pointing this out, looks like a bug indeed. Do you happen to have a small (< 100 byte) test sample that currently fails to read because of this? It would be nice to add that as a regression test.

kali · 2018-08-29T10:12:59Z

I like your suggestions, and actually just gave a shot this morning at a ChunksIterator/ChunkReader impl. The reader looks straightforward, but the Iterator is a bit tricky. The ChunkReader needs to access the original Read, requiring to borrow either directly or through the iterator the WavReader.

Here is a gist of the issue https://play.rust-lang.org/?gist=861c43a989f7b01ebef5aafcf466d274&version=stable&mode=debug&edition=2015, and a conversation that I think relates to the problem.

I think an alternative if to drop the Iterator trait and have a next_chunk() method instead... Something like that:

https://play.rust-lang.org/?gist=5630fe68b24d77f8c836fde21665f550&version=stable&mode=debug&edition=2015

If we go for this more extensive support, it is a bit sad that chunks before "fmt " could not be read. We could achieve by making spec() and all derived method lazy (WavReader.spec field becoming an Option), to the price of a change of API, as these methods could now trigger the spec reading and decoding and then could fail (so we need a Result<_>). Same remark for the trailing chunks.
I don't foresee any issue of the same kind with the plan for writing.
For the odd length decoding issue, I can artificially craft a wav with a custom chunk, but they are not that common, so I don't happen to have one organically showing the issue (for starters, you would need custom chunks, and they are not that frequent)... Would a synthetic wave be satisfactory for a unit test ?

ruuda · 2018-08-30T17:03:43Z

Here is a gist of the issue https://play.rust-lang.org/?gist=861c43a989f7b01ebef5aafcf466d274&version=stable&mode=debug&edition=2015, and a conversation that I think relates to the problem.

Aaah, that’s right. Now that you mention it, I remember hitting the same issue in Claxon. The root cause is that in std::Iterator, the return type of next(&'a mut self) -> cannot refer to 'a. I think your solution of not implementing Iterator but having a next method nonetheless is the best we can do. I would keep the interface as similar to that of std::Iterator as possible.

If we go for this more extensive support, it is a bit sad that chunks before "fmt " could not be read.

That's a shame indeed. I somehow thought that the fmt chunk must be the first one, but according to this document the only requirement is that it must precede the DATA chunk. And in fact there are test samples in the repository where there are other chunks before the fmt chunk. Making spec() lazy would be inconvenient because of the Result.

An alternative would be to expose ChunkReader directly and to put it on equal footing with WavReader. WavReader could internally use the ChunkReader, but a user could construct one manually too. To read both unknown chunks and the audio data it might be necessary to read a file twice. For now I don't think that is a big problem, and in the future it could be addressed with an extra WavReader constructor that can take over the reader mid-stream. What do you think? Would this be acceptable for your use case?

Would a synthetic wave be satisfactory for a unit test ?

I would prefer to have one that occurred in the wild; we’ve observed files before that should be invalid according to Microsoft’s documentation, but some software wrote them nonetheless. I can imagine that the padding byte should be there, but de facto it is not. All test samples in the repository happen to have only even-sized chunks. According to this document (chapter 2) you are right about the padding byte though; if an artificial test file is the best we can do, then let's go with that.

kali · 2018-09-03T10:06:07Z

Hey @ruuda, here is a WIP for reading arbitrary chunk, please have a look and tell me what you think. The idea is to cut WavReader in half, so there is now a ChunksReader that does most of the heavy lifting while the WavReader keeps the same easy to use interface.

I'm working on the writing side too, but once again, our nice big plan will not work as is. The problem here is that Drop::drop has no way to report a problem... So I'm considering other alternatives.

kali · 2018-09-03T13:05:48Z

Here we go, @ruuda, a temptative writer in the same spirit. Tell me if it's worth finalizeing or it it is too far from what you would like to see.

ruuda · 2018-09-03T17:20:22Z

The problem here is that Drop::drop has no way to report a problem...

We already deal with this: there is WavWriter::finalize() that consumes the writer and reports any errors; a kind of explicit destructor. If it is not called, drop will still do the finalization, but drop will silently discard any errors.

ruuda

I looked at the reader part, and I like the way it is going, but I think we can make the API easier and safer to use.

I did not look at the writer part yet ... this PR is getting really big, could you maybe split the writer part into a separate PR?

ruuda · 2018-09-03T17:29:01Z

src/lib.rs

@@ -327,6 +327,8 @@ pub enum Error {
    IoError(io::Error),
    /// Ill-formed WAVE data was encountered.
    FormatError(&'static str),
+    /// An inconsistency in parsing chunks has been detected.
+    InvalidState(&'static str),


It looks like this error represents a programming error. I would use an assertion instead in these cases. An Error is for errors that can arise at runtime even when the library is used correctly. Users of the library should expect these errors and be prepared to deal with them. An invalid state on the other hand, is a bug. As a user of the library, there is no good way to recover from it.

ruuda · 2018-09-03T17:29:37Z

src/lib.rs

@@ -779,6 +787,7 @@ fn append_works_on_files() {
    let len = fs::metadata("append.wav").unwrap().len();

    let mut appender = WavWriter::append("append.wav").unwrap();
+    println!("appender: {:?} len:{}", appender.spec(), len);


Is this a leftover debug print?

ruuda · 2018-09-03T17:37:26Z

src/lib.rs

+}
+
+#[test]
+fn write_read_chunks_is_lossless_with_odd_size() {


Thanks for covering this case too!

Maybe we can avoid some duplication by putting the test in a loop, where the chunk body size is the loop variable. Something like

let full_contents = b"12345678"; for body_len in 0..contents.len() { contents = &full_contents[..body_len]; // Write and read here, using `contents` as chunk body. }

ruuda · 2018-09-03T17:57:21Z

src/read.rs

-        }
-        // If no data chunk is ever encountered, the function will return
-        // via one of the try! macros that return an Err on end of file.
+    pub fn next_chunk(&mut self) -> Result<Option<ChunkHeader>> {


I would call this next() to stay as close to std::iter::Iterator as possible. Also, I think we can make the API safer by returning something like

enum Chunk<'r> { Fmt(pub WavSpecEx), Fact(pub Fact), Data(pub EmbeddedReader<'r>), Unknown(pub [u8; 4], pub EmbeddedReader<'r>), }

This would make the API impossible to misuse. It would prevent the case of calling read_fmt_chunk while not at an fmt chunk, and it would ensure that once the borrow of self ends, the reader is positioned at the start of a chunk (if EmbeddedReader would skip to the end of the chunk on drop).

ruuda · 2018-09-03T18:07:00Z

src/read.rs

+        self.reader
+    }
+
+    pub fn into_wav_reader(mut self) -> Result<WavReader<R>> {


I’m not sure how useful this is, because of the concerns you raised before about accessing chunks after the DATA chunk. What we could do instead, is allowing to construct the samples iterator from a WavSpecEx and a Reader that is assumed to be at the start of a DATA chunk. The reader could be an EmbeddedReader from Chunk::Data. Then if you want access to all chunks and still read the audio data in one pass, you can do so, albeit with a bit more manual work. And if you just want the audio WavReader still works, or if you just want the raw data of a specific chunk, ChunkReader works. What do you think?

kali · 2018-09-04T12:20:01Z

hey @ruuda, I move the the Writer part away for now and found out we had one sample with non standard chunk to test with. the lossless read/write tests will come back with the Writer later.

There was still one thing I could not do as you expected. I can not use the EmbeddedReader for the Data chunk as next() caller (WavReader::open) would have to keep both the ChunksReader and the EmbeddedReader around, making it a self-referencing struct. So the EmbeddedReader is perfect for the Unknown case, but I had to put some state inside the ChunksReader to manage the Data state...

ruuda · 2018-09-04T18:03:05Z

src/lib.rs

@@ -790,3 +790,38 @@ fn append_works_on_files() {

    assert_contents("append.wav", &[11, 13, 17, 19, 23]);
 }
+
+#[cfg(test)]
+macro_rules! gard {


Neat, I’ve wanted such a thing in a few other places too. What does gard mean though, did you mean guard?

Aha, indeed, this is a French-ism.

ruuda · 2018-09-04T18:06:12Z

src/read.rs

+    /// how long the chunk is, in bytes
+    pub len: i64,
+    /// how much bytes remains to be read
+    pub remaining: i64,


Is there a reason these are i64 rather than u64? If avoiding a few casts is the reason I would go for u64 instead; that makes it impossible to represent invalid states.

ruuda · 2018-09-04T18:10:39Z

src/read.rs

+                self.reader.seek(io::SeekFrom::Current(offset))
+            }
+        } else {
+            Err(io::Error::new(io::ErrorKind::Other, "Only relative seek is supported."))


I think we can also handle SeekFrom::Start and SeekFrom::End; we know the current position from the start of the chunk, so we can translate a seek to an absolute position into a relative seek on the underlying reader.

ruuda · 2018-09-04T18:11:21Z

src/read.rs

+            let current_in_chunk = self.len as i64 - self.remaining as i64;
+            let wanted_in_chunk = current_in_chunk as i64 + offset;
+            if wanted_in_chunk < 0 {
+                Err(io::Error::new(io::ErrorKind::Other, "Seeking befoer begin of chunk"))


Typo: “befoer”

ruuda · 2018-09-04T18:14:38Z

src/read.rs

+    ///
+    /// This function will only read the Riff header from the file
+    /// in order to position the stream to the first chunk.
+    pub fn new(mut reader: R) -> Result<ChunksReader<R>> {


👍 this is a nice way to handle the header.

ruuda · 2018-09-04T21:30:50Z

src/read.rs

+    ///
+    /// This function will panic if it is called while the reader is not in
+    /// the data chunk, or if the format has not been parsed.
+    pub fn len(&self) -> u32 {


This is a bit confusing on the chunks iterator, if I would see

let mut chunks_reader = ChunkReader::new(file)?; chunks_reader.len()

I would think len would return the number of chunks, not the number of samples in the data chunk. The best solution I can think of would be to leave off len() and duration() entirely; a user who reads chunks manually can compute them from the fmt chunk data instead.

ruuda · 2018-09-04T21:43:40Z

src/read.rs

+        }
+        let mut kind_str = [0; 4];
+        if let Err(_) = self.reader.read_into(&mut kind_str) {
+            // assumes EOF


This is not correct; it might stop iteration on any kind of IO error. Unfortunately read_into does not return an io::ErrorKind::UnexpectedEof because it did not yet exist in Rust 1.4.0. I can fix it later if you like, but could you please put a TODO here?

ruuda · 2018-09-04T22:14:13Z

So the EmbeddedReader is perfect for the Unknown case, but I had to put some state inside the ChunksReader to manage the Data state...

Hmm, I can see why now. Making DATA a special case feels a bit awkward, and it also makes me less confident that EmbeddedReader is suitable for building other things on top of. But at this point I can't think of a nicer way to make it work than your solution either.

ruuda · 2018-09-05T18:08:26Z

Looks good now! I think I’ll go ahead and merge this to master, and I’ll make another pass over it once I get around to fixing up metadata reading in Claxon, to make them consistent, and to get some confidence that the API is right, if it works for both flac and wav. Is that okay with you?

kali · 2018-09-05T18:59:35Z

Happy to hear that. I'm in no rush altogether, but do you want me to give a second go at the writer while I'm on the topic ?

ruuda · 2018-09-10T18:44:08Z

do you want me to give a second go at the writer while I'm on the topic?

Sure, go ahead.

ruuda · 2018-09-10T20:03:06Z

Merged as 0db8127. Thanks a lot!

kali force-pushed the master branch from bb9efd6 to cd30795 Compare August 28, 2018 13:03

kali added 2 commits August 31, 2018 14:56

introduce ChunksReader and wire everything again

650483c

full working read

b2df4fd

kali force-pushed the master branch from cd30795 to b2df4fd Compare September 3, 2018 10:02

ruuda reviewed Sep 3, 2018

View reviewed changes

kali force-pushed the master branch from 9406cdc to b243246 Compare September 4, 2018 12:14

chunk reader, take 3

d597a3d

kali force-pushed the master branch from b243246 to d597a3d Compare September 4, 2018 12:15

kali added 4 commits September 4, 2018 15:28

back to original Cargo.lock

eb9f2f1

fix fuzzed test

74f5360

fix again Cargo.lock

45edff9

pacify rust 1.4

3ab62cd

ruuda reviewed Sep 4, 2018

View reviewed changes

ack changes requests, and factorize chunk boundaries enforcer

5923395

ruuda closed this Sep 10, 2018

ruuda mentioned this pull request Sep 11, 2018

Panic on malformed input #26

Closed

ruuda mentioned this pull request Jan 15, 2020

Reading a file with length -1 #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic support for arbitrary chunks #25

basic support for arbitrary chunks #25

kali commented Aug 28, 2018

ruuda commented Aug 28, 2018

kali commented Aug 29, 2018

ruuda commented Aug 30, 2018

kali commented Sep 3, 2018

kali commented Sep 3, 2018

ruuda commented Sep 3, 2018

ruuda left a comment

ruuda Sep 3, 2018

ruuda Sep 3, 2018

ruuda Sep 3, 2018

ruuda Sep 3, 2018

ruuda Sep 3, 2018

kali commented Sep 4, 2018

ruuda Sep 4, 2018

kali Sep 5, 2018

ruuda Sep 4, 2018

ruuda Sep 4, 2018

ruuda Sep 4, 2018

ruuda Sep 4, 2018

ruuda Sep 4, 2018

ruuda Sep 4, 2018

ruuda commented Sep 4, 2018

ruuda commented Sep 5, 2018

kali commented Sep 5, 2018

ruuda commented Sep 10, 2018

ruuda commented Sep 10, 2018

basic support for arbitrary chunks #25

basic support for arbitrary chunks #25

Conversation

kali commented Aug 28, 2018

ruuda commented Aug 28, 2018

kali commented Aug 29, 2018

ruuda commented Aug 30, 2018

kali commented Sep 3, 2018

kali commented Sep 3, 2018

ruuda commented Sep 3, 2018

ruuda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kali commented Sep 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruuda commented Sep 4, 2018

ruuda commented Sep 5, 2018

kali commented Sep 5, 2018

ruuda commented Sep 10, 2018

ruuda commented Sep 10, 2018