Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic support for arbitrary chunks #25

Closed
wants to merge 8 commits into from
Closed

Conversation

kali
Copy link
Contributor

@kali kali commented Aug 28, 2018

Hello there, this is an extension to support writting and reading arbitrary chunks between "fmt " and "data".

Let me know what you think.

(By the way, I think there is a tiny bug in skip_bytes call while looking for the data chunk. As far as I understand it, RIFF chunks must be aligned to even offsets so there might be a padding zero byte... This patch actually fixes it.)

@ruuda
Copy link
Owner

ruuda commented Aug 28, 2018

Thanks a lot for taking the time to implement this! I agree that it would be a useful thing to have.

I think the API can still be improved. I struggled with a similar thing a lot in Claxon. I considered a callback-based API there too, but then I realised that an iterator-based API is more flexible, more convenient, and more performant:

  • You can store an iterator in a struct, or move it into a function. A callback forces you to do everything in one place, at once.
  • With an iterator the caller is in control about when to read each chunk, and it is easy to stop early if desired (for example, if you're only interested in particular metadata, you can stop right after reading the chunk of interest). With a callback, the callback could return a boolean that indicates whether to continue, but not when to continue. For example, something like reading the first chunk of a set of files, then reading the second chunk of every file, etc. is straightforward with iterators, but not possible with callbacks.

Also, rather than buffering up the entire chunk in memory, it is advantageous to expose a bounded io::Read instead. The caller can still easily read the entire chunk into memory if desired, but io::Read also enables streaming. Furthermore, the particular implementation could be one that does a seek on drop if not all bytes are read (if the underlying reader is io::Seek). That makes it possible to quickly skip over uninteresting chunks. For Claxon I started working on an EmbeddedReader for this purpose. It is similar to io::Take in that it enforces an upper limit, so a consumer cannot read past the chunk, but it additionally ensures that the underlying reader is advanced by exactly the chunk size on drop, so if the caller does not consume all bytes, the EmbeddedReader will skip over the remainder on drop.

We could adapt WavReader as follows:

  • Don't read until the data chunk upon construction, only read the fmt chunk.
  • On samples, into_samples, or seek, read up to the data chunk if necessary.
  • Expose an additional method chunks, that returns a Chunks : Iterator<Item = Result<Chunk<'a>>> iterator, where struct Chunk { name: [u8; 4], reader: &'a mut EmbeddedReader<'a> }.

I had not fully implemented this for Claxon yet, so sorry if the details are a bit vague, but I think this is the way to go. The analogous parts there are MetadataReader which would be like Chunks, and ApplicationBlock which would be like Chunk.

For writing, providing slices at construction time limits the caller; it prevents streaming, or may force unnecessary buffering. (For example, if you would want to write a cover art image, and the writer takes an io::Write, then you would first need to write to an io::Cursor<Vec<u8>> and then provide the vec, while it could have been written at once.) This can be addressed in a similar way, at very little runtime cost:

  • Add a flag to WavWriter that specifies whether chunks can still be written, which is true at construction.
  • Set it to false in write_sample and get_i16_writer.
  • Add a method get_chunk_writer(&mut self, name: [u8; 4]) -> ChunkWriter<'s>, similar to get_i16_writer(). ChunkWriter would imlement io::Write and defer writes to the underlying writer, while additionally keeping track of the write count.
  • At ChunkWriter::drop(), it would seek back, overwrite the data chunk header with the correct header and put the length in place, similar to what update_header does, and then seek forward to the end again, and write a new data chunk header there, that will be updated by update_header when the file is finalized.

What do you think?

By the way, I think there is a tiny bug in skip_bytes call while looking for the data chunk. As far as I understand it, RIFF chunks must be aligned to even offsets so there might be a padding zero byte... This patch actually fixes it.

Thanks for pointing this out, looks like a bug indeed. Do you happen to have a small (< 100 byte) test sample that currently fails to read because of this? It would be nice to add that as a regression test.

@kali
Copy link
Contributor Author

kali commented Aug 29, 2018

  1. I like your suggestions, and actually just gave a shot this morning at a ChunksIterator/ChunkReader impl. The reader looks straightforward, but the Iterator is a bit tricky. The ChunkReader needs to access the original Read, requiring to borrow either directly or through the iterator the WavReader.

Here is a gist of the issue https://play.rust-lang.org/?gist=861c43a989f7b01ebef5aafcf466d274&version=stable&mode=debug&edition=2015, and a conversation that I think relates to the problem.

I think an alternative if to drop the Iterator trait and have a next_chunk() method instead... Something like that:

https://play.rust-lang.org/?gist=5630fe68b24d77f8c836fde21665f550&version=stable&mode=debug&edition=2015

  1. If we go for this more extensive support, it is a bit sad that chunks before "fmt " could not be read. We could achieve by making spec() and all derived method lazy (WavReader.spec field becoming an Option), to the price of a change of API, as these methods could now trigger the spec reading and decoding and then could fail (so we need a Result<_>). Same remark for the trailing chunks.

  2. I don't foresee any issue of the same kind with the plan for writing.

  3. For the odd length decoding issue, I can artificially craft a wav with a custom chunk, but they are not that common, so I don't happen to have one organically showing the issue (for starters, you would need custom chunks, and they are not that frequent)... Would a synthetic wave be satisfactory for a unit test ?

@ruuda
Copy link
Owner

ruuda commented Aug 30, 2018

Here is a gist of the issue https://play.rust-lang.org/?gist=861c43a989f7b01ebef5aafcf466d274&version=stable&mode=debug&edition=2015, and a conversation that I think relates to the problem.

Aaah, that’s right. Now that you mention it, I remember hitting the same issue in Claxon. The root cause is that in std::Iterator, the return type of next(&'a mut self) -> cannot refer to 'a. I think your solution of not implementing Iterator but having a next method nonetheless is the best we can do. I would keep the interface as similar to that of std::Iterator as possible.

If we go for this more extensive support, it is a bit sad that chunks before "fmt " could not be read.

That's a shame indeed. I somehow thought that the fmt chunk must be the first one, but according to this document the only requirement is that it must precede the DATA chunk. And in fact there are test samples in the repository where there are other chunks before the fmt chunk. Making spec() lazy would be inconvenient because of the Result.

An alternative would be to expose ChunkReader directly and to put it on equal footing with WavReader. WavReader could internally use the ChunkReader, but a user could construct one manually too. To read both unknown chunks and the audio data it might be necessary to read a file twice. For now I don't think that is a big problem, and in the future it could be addressed with an extra WavReader constructor that can take over the reader mid-stream. What do you think? Would this be acceptable for your use case?

Would a synthetic wave be satisfactory for a unit test ?

I would prefer to have one that occurred in the wild; we’ve observed files before that should be invalid according to Microsoft’s documentation, but some software wrote them nonetheless. I can imagine that the padding byte should be there, but de facto it is not. All test samples in the repository happen to have only even-sized chunks. According to this document (chapter 2) you are right about the padding byte though; if an artificial test file is the best we can do, then let's go with that.

@kali
Copy link
Contributor Author

kali commented Sep 3, 2018

Hey @ruuda, here is a WIP for reading arbitrary chunk, please have a look and tell me what you think. The idea is to cut WavReader in half, so there is now a ChunksReader that does most of the heavy lifting while the WavReader keeps the same easy to use interface.

I'm working on the writing side too, but once again, our nice big plan will not work as is. The problem here is that Drop::drop has no way to report a problem... So I'm considering other alternatives.

@kali
Copy link
Contributor Author

kali commented Sep 3, 2018

Here we go, @ruuda, a temptative writer in the same spirit. Tell me if it's worth finalizeing or it it is too far from what you would like to see.

@ruuda
Copy link
Owner

ruuda commented Sep 3, 2018

The problem here is that Drop::drop has no way to report a problem...

We already deal with this: there is WavWriter::finalize() that consumes the writer and reports any errors; a kind of explicit destructor. If it is not called, drop will still do the finalization, but drop will silently discard any errors.

Copy link
Owner

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the reader part, and I like the way it is going, but I think we can make the API easier and safer to use.

I did not look at the writer part yet ... this PR is getting really big, could you maybe split the writer part into a separate PR?

src/lib.rs Outdated
@@ -327,6 +327,8 @@ pub enum Error {
IoError(io::Error),
/// Ill-formed WAVE data was encountered.
FormatError(&'static str),
/// An inconsistency in parsing chunks has been detected.
InvalidState(&'static str),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this error represents a programming error. I would use an assertion instead in these cases. An Error is for errors that can arise at runtime even when the library is used correctly. Users of the library should expect these errors and be prepared to deal with them. An invalid state on the other hand, is a bug. As a user of the library, there is no good way to recover from it.

src/lib.rs Outdated
@@ -779,6 +787,7 @@ fn append_works_on_files() {
let len = fs::metadata("append.wav").unwrap().len();

let mut appender = WavWriter::append("append.wav").unwrap();
println!("appender: {:?} len:{}", appender.spec(), len);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a leftover debug print?

src/lib.rs Outdated
}

#[test]
fn write_read_chunks_is_lossless_with_odd_size() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for covering this case too!

Maybe we can avoid some duplication by putting the test in a loop, where the chunk body size is the loop variable. Something like

let full_contents = b"12345678";
for body_len in 0..contents.len() {
  contents = &full_contents[..body_len];
  // Write and read here, using `contents` as chunk body.
}

src/read.rs Outdated
}
// If no data chunk is ever encountered, the function will return
// via one of the try! macros that return an Err on end of file.
pub fn next_chunk(&mut self) -> Result<Option<ChunkHeader>> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this next() to stay as close to std::iter::Iterator as possible. Also, I think we can make the API safer by returning something like

enum Chunk<'r> {
    Fmt(pub WavSpecEx),
    Fact(pub Fact),
    Data(pub EmbeddedReader<'r>),
    Unknown(pub [u8; 4], pub EmbeddedReader<'r>),
}

This would make the API impossible to misuse. It would prevent the case of calling read_fmt_chunk while not at an fmt chunk, and it would ensure that once the borrow of self ends, the reader is positioned at the start of a chunk (if EmbeddedReader would skip to the end of the chunk on drop).

src/read.rs Outdated
self.reader
}

pub fn into_wav_reader(mut self) -> Result<WavReader<R>> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure how useful this is, because of the concerns you raised before about accessing chunks after the DATA chunk. What we could do instead, is allowing to construct the samples iterator from a WavSpecEx and a Reader that is assumed to be at the start of a DATA chunk. The reader could be an EmbeddedReader from Chunk::Data. Then if you want access to all chunks and still read the audio data in one pass, you can do so, albeit with a bit more manual work. And if you just want the audio WavReader still works, or if you just want the raw data of a specific chunk, ChunkReader works. What do you think?

@kali
Copy link
Contributor Author

kali commented Sep 4, 2018

hey @ruuda, I move the the Writer part away for now and found out we had one sample with non standard chunk to test with. the lossless read/write tests will come back with the Writer later.

There was still one thing I could not do as you expected. I can not use the EmbeddedReader for the Data chunk as next() caller (WavReader::open) would have to keep both the ChunksReader and the EmbeddedReader around, making it a self-referencing struct. So the EmbeddedReader is perfect for the Unknown case, but I had to put some state inside the ChunksReader to manage the Data state...

src/lib.rs Outdated
@@ -790,3 +790,38 @@ fn append_works_on_files() {

assert_contents("append.wav", &[11, 13, 17, 19, 23]);
}

#[cfg(test)]
macro_rules! gard {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I’ve wanted such a thing in a few other places too. What does gard mean though, did you mean guard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, indeed, this is a French-ism.

src/read.rs Outdated
/// how long the chunk is, in bytes
pub len: i64,
/// how much bytes remains to be read
pub remaining: i64,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these are i64 rather than u64? If avoiding a few casts is the reason I would go for u64 instead; that makes it impossible to represent invalid states.

src/read.rs Outdated
self.reader.seek(io::SeekFrom::Current(offset))
}
} else {
Err(io::Error::new(io::ErrorKind::Other, "Only relative seek is supported."))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also handle SeekFrom::Start and SeekFrom::End; we know the current position from the start of the chunk, so we can translate a seek to an absolute position into a relative seek on the underlying reader.

src/read.rs Outdated
let current_in_chunk = self.len as i64 - self.remaining as i64;
let wanted_in_chunk = current_in_chunk as i64 + offset;
if wanted_in_chunk < 0 {
Err(io::Error::new(io::ErrorKind::Other, "Seeking befoer begin of chunk"))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: “befoer”

///
/// This function will only read the Riff header from the file
/// in order to position the stream to the first chunk.
pub fn new(mut reader: R) -> Result<ChunksReader<R>> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is a nice way to handle the header.

src/read.rs Outdated
///
/// This function will panic if it is called while the reader is not in
/// the data chunk, or if the format has not been parsed.
pub fn len(&self) -> u32 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing on the chunks iterator, if I would see

let mut chunks_reader = ChunkReader::new(file)?;
chunks_reader.len()

I would think len would return the number of chunks, not the number of samples in the data chunk. The best solution I can think of would be to leave off len() and duration() entirely; a user who reads chunks manually can compute them from the fmt chunk data instead.

src/read.rs Outdated
}
let mut kind_str = [0; 4];
if let Err(_) = self.reader.read_into(&mut kind_str) {
// assumes EOF
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct; it might stop iteration on any kind of IO error. Unfortunately read_into does not return an io::ErrorKind::UnexpectedEof because it did not yet exist in Rust 1.4.0. I can fix it later if you like, but could you please put a TODO here?

@ruuda
Copy link
Owner

ruuda commented Sep 4, 2018

So the EmbeddedReader is perfect for the Unknown case, but I had to put some state inside the ChunksReader to manage the Data state...

Hmm, I can see why now. Making DATA a special case feels a bit awkward, and it also makes me less confident that EmbeddedReader is suitable for building other things on top of. But at this point I can't think of a nicer way to make it work than your solution either.

@ruuda
Copy link
Owner

ruuda commented Sep 5, 2018

Looks good now! I think I’ll go ahead and merge this to master, and I’ll make another pass over it once I get around to fixing up metadata reading in Claxon, to make them consistent, and to get some confidence that the API is right, if it works for both flac and wav. Is that okay with you?

@kali
Copy link
Contributor Author

kali commented Sep 5, 2018

Happy to hear that. I'm in no rush altogether, but do you want me to give a second go at the writer while I'm on the topic ?

@ruuda
Copy link
Owner

ruuda commented Sep 10, 2018

do you want me to give a second go at the writer while I'm on the topic?

Sure, go ahead.

@ruuda
Copy link
Owner

ruuda commented Sep 10, 2018

Merged as 0db8127. Thanks a lot!

@ruuda ruuda closed this Sep 10, 2018
@ruuda ruuda mentioned this pull request Sep 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants