New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip detected in WARCs when server sends gzip-encoded responses? #57
Comments
Hey Dragan |
Indeed, you are right, the "unpacked" data is following the gzipped ones, like this: ---
filename : 'despens-mbcbftworig-20150912122950303-00000-8-0431e4b3640d.warc.gz/20150912123009/http:/www.teleportacia.org/war/103.htm'
filesize : 174
modified : 2015-09-12T12:30:09Z
errors :
sha1 : 282242ead3258b719610e575cedb74118cda538c
matches :
- id : pronom
puid : x-fmt/266
format : 'GZIP Format'
version :
mime : 'application/x-gzip'
basis : 'byte match at 0, 3'
warning : 'extension mismatch; MIME mismatch'
---
filename : 'despens-mbcbftworig-20150912122950303-00000-8-0431e4b3640d.warc.gz/20150912123009/http:/www.teleportacia.org/war/103.htm/103'
filesize : 237
modified : 1970-01-01T01:00:00+01:00
errors :
sha1 : 9c5f3015c081ff5feb94aee0fdd50f77fe3a1628
matches :
- id : pronom
puid : fmt/96
format : 'Hypertext Markup Language'
version :
mime : 'text/html'
basis : 'byte match at [[[10 5]] [[230 7]]] (signature 1/2)'
warning : 'extension mismatch' However, the URL I think that if the Here is the relevant excerpt from the WARC:
The |
Thanks Dragan. The "/103" business reveals a separate bug in the gzip decompressor routine that I can look into: it is trying to construct a child path by just stripping the trailing extension and appending it, but this is clearly the wrong behaviour here. Re. the main issue, seems to be two options:
Re. 2): transparently decoding Transfer-Encoding (chunked or compression) is probably the right thing to do anyway. For Content-Encoding, it is perhaps a bit more debatable (you might argue that gzip is a file's true format if it is content-encoded since that is how it is theoretically stored and not just an ephemeral transfer thing) but in practice it is probably what users would expect. I'm happy to do either but am leaning towards option (2). What do you think? |
I also think 2) is the better solution. GZIP I rather see as a feature of the transfer protocol that was used. To make this a top-level item would be akin to noting that a file was read from for example an NTFS volume, since you need a file system driver that can actually handle NTFS, and then go deeper into what the file is actually made of. It would be correct, but not practical, creating lots of redundant information that other points in the toolchain would need to detect and skip. If I want to get the data out of the WARC, I will need a tool that can handle WARCs. That should be a given. |
was hoping I'd wake up and this would be cleanly settled! Some useful commentary on twitter & Andy Jackson linked to this https://gist.github.com/anjackson/48308ecab5f954218d4b which is an Option 3): ID both the GZIP and the decoded payload. But also ID parent WARC record and, if there were any transfer encoding, also ID that (i.e. not just jump to the payload as sf does presently) |
This is getting very interesting 😺 My take on this is not what is the "ideal WARC ID result", as mentioned by Andy Jackson above, but rather what would enable certain useful activities. RFC2616 §3.5 and §14.11 is I guess enough information on what can happen with My idea for requesting the analysis of WARCs in the beginning was to enable for web archives what is already possible with CD-ROMs: selecting a suitable emulation environment for re-enactment, based on the types of data contained on an ISO. Now, WARC analysis could provide the insight to select emulation environments sporting the right browsers and plugins best suited for a web archive. If there would be browsers that are actually in use support all kinds of HTML5 trickery but no gzip encoding, storing that information would be meaningful to me. Since this isn't the case, I regard this information as rather academic and even a burden, since I will need to weed it out before getting to the data that is actually interesting, again, to me. I believe that tools to get data out of WARCs (as in, making files from WARCs) will not be built based on the information of a data type analysis, but rather the standards defining HTTP and WARC. Tools for the re-enactment of WARCs already exist, these are browsers and for example pywb. Maybe applying PRONOM, which is about "file formats", to WARCs is a bit of a stretch, since WARCs aren't file containers, but rather a record of an activity that was performed on the web. Anyway, PRONOM is delivering applicable results in this case, so why not do it? Also, "files" look like they are developing into this direction on a basic technical level, with btrfs or zfs recording time-based snapshots or services like dropbox or ownCloud keeping multiple versions of the "same" file; time- and performance-based analysis is coming to your plain old files soon 😀—WARCs seem like a good case to think this through. |
so the twitter vote closed and it favoured option (2), I think we'll go with that. Thanks to everyone who gave input on this one. This will be in the next point release which I hope will be out in a fortnight or so. |
HI Dragan If this too hard and you can share a small WARC that has some interesting encodings within, I can do testing myself and send you results. Once this is tested I'll merge to the master branch and issue the next sf release. |
believe this is fixed with 1.4.2 release. Please reopen if necessary |
When analysing WARC files, Siegfried often returns the GZIP Format for HTML responses:
Indeed this file is served with gzip compression by the server and was recorded that way:
But the information that the server response was gzipped is, I think, not accurate, since it is not the actual data format, it is just the encoding. The
Content-Type
correctly points out this istext/html
. I guess Siegfried would need to un-gzip any response that is encoded via gzip and do an analysis on that. Otherwise anytext/*
response, which could be JSON, HTML, XML, etc, could be showing up as GZIP.The text was updated successfully, but these errors were encountered: