-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Premature EOF on WARC files #41
Comments
Is there an easy workaround? I don't mind writing extra code on my end, but I don't know any other way to reliably check for EOF, since the GzDecoder takes ownership of its underlying reader and so I can't keep a borrowed reference to the File to check eof(). I suppose for my use-case I could get rid of flate entirely and pipe the output of |
You may be able to do something like:
Although perhaps not tested, the I haven't tested this out yet, though, so it may not work :( |
To clarify, this is what I would expect: extern crate flate2;
use std::io::prelude::*;
use std::io;
fn main() {
let mut v = Vec::new();
flate2::write::GzEncoder::new(&mut v, flate2::Compression::Best)
.write_all(b"foo")
.unwrap();
flate2::write::GzEncoder::new(&mut v, flate2::Compression::Best)
.write_all(b"bar")
.unwrap();
let mut data = &v[..];
io::copy(&mut flate2::bufread::GzDecoder::new(&mut data).unwrap(),
&mut io::stdout()).unwrap();
io::copy(&mut flate2::bufread::GzDecoder::new(&mut data).unwrap(),
&mut io::stdout()).unwrap();
} It's crucial that you use |
I can confirm that this works. Concatenated gz members are quite common in bioinformatics as well. The Bgzf standard uses this in combination with an index to allow random access on files or concurrent processing. The members are quite small (max 64Kb uncompressed data). Performance seems fine. Time to process a file is comparable to Partial example that uses the fastq reader from let mut reader = BufReader::new(file);
let mut r = fastq::Record::new();
loop {
//loop over all possible gzip members
match reader.fill_buf() {
Ok(b) => if b.is_empty() { break },
Err(e) => panic!(e)
}
//decode the next member
let gz = flate2::bufread::GzDecoder::new(&mut reader).unwrap();
let mut fqreader = fastq::Reader::new(gz);
//loop over all records in this member
loop {
match fqreader.read(&mut r) {
Ok(()) => {
if r.is_empty() {
//current gz member finished, more to decode?
break;
}
},
Err(err) => panic!(err)
}
//do stuff
}
} |
Thanks for looking into this. I've been working on other parts of my system lately, but I'll implement this solution when I return to the Rust code. |
When parsing part of the CommonCrawl corpus (which consists of ~1G WARC files where each record is individually compressed), flate2 will return EOF after the first chunk has been decompressed rather than continuing to read the rest of the file. Sample data:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701166570.91/warc/CC-MAIN-20160205193926-00310-ip-10-236-182-209.ec2.internal.warc.gz
(Downloadable with 'aws s3 cp s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701166570.91/warc/CC-MAIN-20160205193926-00310-ip-10-236-182-209.ec2.internal.warc.gz' if you have the AWS CLI installed. 837M, no fees.)
Sample code:
Arguably the files shouldn't do this, but the WARC spec recommends record-at-a-time compression, and it's pretty common practice in the Hadoop world to operate on big files that are the concatenation of individually-gzipped records so that Hadoop can split the input without reading it. gunzip/gzcat can read it, and re-compressing it with gzip allows flate2 to as well. Given that these files exist, maybe flate2 could avoid returning EOF until the underlying stream does, instead returning the stream of decompressed bytes from the next record?
The text was updated successfully, but these errors were encountered: