New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-streaming interface would be useful #64
Comments
The tool evolved from the need to support the streaming use case specifically, but I definitely see how this can be useful for certain situations. Then again, it might make it too easy to just store a multi-GB record in memory, which right now you'd have to do manually. |
I actually made a mistake in "warcio check" with the streaming interface! Triggering a digest check requires reading all of the payload, and I did that with
Two reviewers didn't spot it :-) (Yes, I'll fix it.) |
I just got bitten by this as well. Another related issue is that it isn't possible to read both the raw stream and the decoded/content stream for a record. Here's how I worked around it for my use case, starting with rawPayload = io.BytesIO(record.raw_stream.read())
if record.http_headers:
recordCopy = warcio.recordloader.ArcWarcRecord(record.format, record.rec_type, record.rec_headers, rawPayload, record.http_headers, record.content_type, record.length)
decodedPayload = io.BytesIO(recordCopy.content_stream().read())
rawPayload.seek(0)
else:
decodedPayload = rawPayload With the obvious caveats that the two streams might be the same object (i.e. share the stream position) and that you need enough RAM to keep everything in memory a couple times if it gets decoded. |
Can we expect any improvement on this topic in the future? This is still a major headache. The worst form is when I want to pass around multiple select records which are stored distant points in the WARC file. I'm getting My workaround is the following: stream = open('my_stuff.warc.gz', 'rb')
archive_it = ArchiveIterator(stream)
my_rec = next(it)
# Encode
my_rec_pos = archive_it.get_record_offset()
pass_this_around = (stream, my_rec_pos)
# ...
# Decode
stream, my_rec_pos = pass_this_around
sream.seek(my_rec_pos)
my_rec = next(iter(ArchiveIterator(stream))) IMO this could be a quick and dirty internal workaround as well. As it could solve most of the issues with the streaming API. |
@dlazesz your solution does not work in a streaming environment. The solution at the top works for both streaming and non-streaming. I think your solution also breaks the checksum code. That said, if it works for you, do use it! @ikreymer reasonably suggested that the interface I describe at the top, with a new I think there's a way to preserve checksums after a seek, if it's a seek back to the start of the content -- the checksums can be reset. Anyway, the main reason why this isn't already in the code is due to lack of time on my part! I am glad to hear that more people are using the code in interesting ways, which is how we find these problems. |
The checksum is fine after the seeking and copying. At least for request and response records which I've tested. If the stream is seekable -- which IMO is the case most of the time -- my solution works. The rest of the cases you must fiddle with caching the content with e.g.
I think many issues arose from the odd behaviour of the streaming interface which IMO should be optional and turned off by default. As ikeymer said: Even if you decide the other way around, I would be happy to be able to enable some a seeking mechanism for the non-streaming use cases instead of creating workarounds because strange zlib errors. Hope you'll have some time for this in the future! 🤞 |
Right now the only interface for getting at the record content is
record.content_stream().read()
, which is streaming. I can't do that twice. So if I'm passing a record around in a program and want to access the record content in multiple places, I've ended up wrapping warcio's record with a class that has a.content()
method.That seems odd. Other packages like Requests offer both streaming and non-streaming interfaces.
Obviously we'd want to preserve streaming behavior -- pure streaming code should continue to not buffer all of the content in memory. One way to do that would be to save all of the content in memory only if
.content()
is called before.content_stream().read()
, and make calling.content()
after callingcontent_stream().read()
raise an exception.The text was updated successfully, but these errors were encountered: