Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-streaming interface would be useful #64

Open
wumpus opened this issue Jan 25, 2019 · 6 comments
Open

non-streaming interface would be useful #64

wumpus opened this issue Jan 25, 2019 · 6 comments

Comments

@wumpus
Copy link
Collaborator

wumpus commented Jan 25, 2019

Right now the only interface for getting at the record content is record.content_stream().read(), which is streaming. I can't do that twice. So if I'm passing a record around in a program and want to access the record content in multiple places, I've ended up wrapping warcio's record with a class that has a .content() method.

That seems odd. Other packages like Requests offer both streaming and non-streaming interfaces.

Obviously we'd want to preserve streaming behavior -- pure streaming code should continue to not buffer all of the content in memory. One way to do that would be to save all of the content in memory only if .content() is called before .content_stream().read(), and make calling .content() after calling content_stream().read() raise an exception.

@ikreymer
Copy link
Member

The tool evolved from the need to support the streaming use case specifically, but I definitely see how this can be useful for certain situations. Then again, it might make it too easy to just store a multi-GB record in memory, which right now you'd have to do manually.
Since it would be explicit with a .content() method, I guess that can work as it only happens when requested. Sure, if you'd like to make a PR for this, we can add it!

@wumpus
Copy link
Collaborator Author

wumpus commented Jan 30, 2019

I actually made a mistake in "warcio check" with the streaming interface! Triggering a digest check requires reading all of the payload, and I did that with

    record.content_stream().read()

Two reviewers didn't spot it :-)

(Yes, I'll fix it.)

@JustAnotherArchivist
Copy link
Contributor

JustAnotherArchivist commented Jan 5, 2021

I just got bitten by this as well.

Another related issue is that it isn't possible to read both the raw stream and the decoded/content stream for a record.


Here's how I worked around it for my use case, starting with record returned from iterating over a WARCIterator:

rawPayload = io.BytesIO(record.raw_stream.read())
if record.http_headers:
	recordCopy = warcio.recordloader.ArcWarcRecord(record.format, record.rec_type, record.rec_headers, rawPayload, record.http_headers, record.content_type, record.length)
	decodedPayload = io.BytesIO(recordCopy.content_stream().read())
	rawPayload.seek(0)
else:
	decodedPayload = rawPayload

With the obvious caveats that the two streams might be the same object (i.e. share the stream position) and that you need enough RAM to keep everything in memory a couple times if it gets decoded.

@dlazesz
Copy link

dlazesz commented Dec 9, 2021

Can we expect any improvement on this topic in the future?

This is still a major headache. The worst form is when I want to pass around multiple select records which are stored distant points in the WARC file. I'm getting zlib.error: Error -3 while decompressing data: incorrect data check errors when the payload slips out of the cache.

My workaround is the following:

stream = open('my_stuff.warc.gz', 'rb')
archive_it = ArchiveIterator(stream)
my_rec = next(it)
# Encode
my_rec_pos = archive_it.get_record_offset()
pass_this_around = (stream, my_rec_pos)
# ...
# Decode
stream, my_rec_pos = pass_this_around
sream.seek(my_rec_pos)
my_rec = next(iter(ArchiveIterator(stream)))

IMO this could be a quick and dirty internal workaround as well. As it could solve most of the issues with the streaming API.

@wumpus
Copy link
Collaborator Author

wumpus commented Dec 9, 2021

@dlazesz your solution does not work in a streaming environment. The solution at the top works for both streaming and non-streaming. I think your solution also breaks the checksum code. That said, if it works for you, do use it!

@ikreymer reasonably suggested that the interface I describe at the top, with a new record.content interface, be even more explicit, such as having to pass a kwarg to ArchiveIterator(), something like cache_content_in_memory with a default of False. And it could only cache in memory if the stream is unseekable.

I think there's a way to preserve checksums after a seek, if it's a seek back to the start of the content -- the checksums can be reset.

Anyway, the main reason why this isn't already in the code is due to lack of time on my part! I am glad to hear that more people are using the code in interesting ways, which is how we find these problems.

@dlazesz
Copy link

dlazesz commented Dec 10, 2021

@dlazesz your solution does not work in a streaming environment. The solution at the top works for both streaming and non-streaming. I think your solution also breaks the checksum code. That said, if it works for you, do use it!

The checksum is fine after the seeking and copying. At least for request and response records which I've tested.

If the stream is seekable -- which IMO is the case most of the time -- my solution works. The rest of the cases you must fiddle with caching the content with e.g. content(). Anyhow these two cases should be separated.

@ikreymer reasonably suggested that the interface I describe at the top, with a new record.content interface, be even more explicit, such as having to pass a kwarg to ArchiveIterator(), something like cache_content_in_memory with a default of False. And it could only cache in memory if the stream is unseekable.

I think many issues arose from the odd behaviour of the streaming interface which IMO should be optional and turned off by default. As ikeymer said: The tool evolved from the need to support the streaming use case specifically.

Even if you decide the other way around, I would be happy to be able to enable some a seeking mechanism for the non-streaming use cases instead of creating workarounds because strange zlib errors.

Hope you'll have some time for this in the future! 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants