Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip detected in WARCs when server sends gzip-encoded responses? #57

Closed
despens opened this issue Nov 8, 2015 · 9 comments
Closed

gzip detected in WARCs when server sends gzip-encoded responses? #57

despens opened this issue Nov 8, 2015 · 9 comments
Assignees
Labels
Milestone

Comments

@despens
Copy link

despens commented Nov 8, 2015

When analysing WARC files, Siegfried often returns the GZIP Format for HTML responses:

---
filename : 'despens-mbcbftworig-20150912122950303-00000-8-0431e4b3640d.warc.gz/20150912123008/http:/www.teleportacia.org/war/warb.htm'
filesize : 100
modified : 2015-09-12T12:30:08Z
errors   : 
sha1     : e5f8bcea337f2e66904e88db6a22ff7b1943700e
matches  :
  - id      : pronom
    puid    : x-fmt/266
    format  : 'GZIP Format'
    version : 
    mime    : 'application/x-gzip'
    basis   : 'byte match at 0, 3'
    warning : 'extension mismatch; MIME mismatch'

Indeed this file is served with gzip compression by the server and was recorded that way:

$ curl -I -H 'Accept-Encoding: gzip,deflate' http://www.teleportacia.org/war/warb.htm
HTTP/1.1 200 OK
Date: Sun, 08 Nov 2015 07:30:43 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Tue, 20 Jan 2015 06:04:19 GMT
ETag: "81061-83-50d0f35a17ec0"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 100
Content-Type: text/html

But the information that the server response was gzipped is, I think, not accurate, since it is not the actual data format, it is just the encoding. The Content-Type correctly points out this is text/html. I guess Siegfried would need to un-gzip any response that is encoded via gzip and do an analysis on that. Otherwise any text/* response, which could be JSON, HTML, XML, etc, could be showing up as GZIP.

@richardlehane
Copy link
Owner

Hey Dragan
when sf is running in -z mode it should just keep recursively descending whenever it encounters an archive format... so I would have expected that if a gzip (or zip or tar) was detected within a warc, it would already try to unpack and give you the results for the unpacked stream (although the MIME wouldn't get passed along & this might be something to tweak). Is the result immediately following the one in your report for the ungzipped stream??

@despens
Copy link
Author

despens commented Nov 8, 2015

Indeed, you are right, the "unpacked" data is following the gzipped ones, like this:

---
filename : 'despens-mbcbftworig-20150912122950303-00000-8-0431e4b3640d.warc.gz/20150912123009/http:/www.teleportacia.org/war/103.htm'
filesize : 174
modified : 2015-09-12T12:30:09Z
errors   : 
sha1     : 282242ead3258b719610e575cedb74118cda538c
matches  :
  - id      : pronom
    puid    : x-fmt/266
    format  : 'GZIP Format'
    version : 
    mime    : 'application/x-gzip'
    basis   : 'byte match at 0, 3'
    warning : 'extension mismatch; MIME mismatch'
---
filename : 'despens-mbcbftworig-20150912122950303-00000-8-0431e4b3640d.warc.gz/20150912123009/http:/www.teleportacia.org/war/103.htm/103'
filesize : 237
modified : 1970-01-01T01:00:00+01:00
errors   : 
sha1     : 9c5f3015c081ff5feb94aee0fdd50f77fe3a1628
matches  :
  - id      : pronom
    puid    : fmt/96
    format  : 'Hypertext Markup Language'
    version : 
    mime    : 'text/html'
    basis   : 'byte match at [[[10 5]] [[230 7]]] (signature 1/2)'
    warning : 'extension mismatch'

However, the URL http:/www.teleportacia.org/war/103.htm/103 doesn't exist in the WARC and seems like a stand-in for the de-compressed data. This means that the "extension mismatch" is in fact not happening and the data, should I want to extract it from the WARC, would not be accessible via the filename property, I'd have to guess the actual URL.

I think that if the Content-Encoding header is set, maybe Siegfried has to note that into another property field and ignore it otherwise?

Here is the relevant excerpt from the WARC:

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:278abefe-7ed6-46ef-a135-97744e94fc82>
WARC-Date: 2015-09-12T12:30:09Z
WARC-Target-URI: http://www.teleportacia.org/war/103.htm
WARC-IP-Address: 185.21.100.12
Content-Type: application/http;msgtype=response
Content-Length: 455
WARC-Block-Digest: sha1:efcb633605b42fe39c50265ecb393efc92999048
WARC-Payload-Digest: sha1:282242ead3258b719610e575cedb74118cda538c

HTTP/1.1 200 OK
Date: Sat, 12 Sep 2015 12:52:30 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Tue, 20 Jan 2015 06:04:19 GMT
ETag: "81063-ed-50d0f35a17ec0"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 174
Content-Type: text/html

The Content-Type is clearly present, but that gets lost, along with the original URL, when descending into the gzipped blob.

@richardlehane
Copy link
Owner

Thanks Dragan.

The "/103" business reveals a separate bug in the gzip decompressor routine that I can look into: it is trying to construct a child path by just stripping the trailing extension and appending it, but this is clearly the wrong behaviour here.

Re. the main issue, seems to be two options:

  1. present two results, one for the gzip encoded response, one for the decoded response. Both have the same "filename", the Content-Type MIME gets passed on to the decoded response.
  2. present a single result just for the decoded response. This would mean changing the "github.com/richardlehane/webarchive" package so that the NextPayload() method checks Content-Encoding and Transfer-Encoding fields and, if either are present, decodes the response before passing it on.

Re. 2): transparently decoding Transfer-Encoding (chunked or compression) is probably the right thing to do anyway. For Content-Encoding, it is perhaps a bit more debatable (you might argue that gzip is a file's true format if it is content-encoded since that is how it is theoretically stored and not just an ephemeral transfer thing) but in practice it is probably what users would expect.

I'm happy to do either but am leaning towards option (2). What do you think?

@richardlehane richardlehane self-assigned this Nov 8, 2015
@richardlehane richardlehane added this to the v1.4.2 milestone Nov 8, 2015
@despens
Copy link
Author

despens commented Nov 8, 2015

I also think 2) is the better solution. GZIP I rather see as a feature of the transfer protocol that was used. To make this a top-level item would be akin to noting that a file was read from for example an NTFS volume, since you need a file system driver that can actually handle NTFS, and then go deeper into what the file is actually made of. It would be correct, but not practical, creating lots of redundant information that other points in the toolchain would need to detect and skip.

If I want to get ‎the data out of the WARC, I will need a tool that can handle WARCs. That should be a given.

@richardlehane
Copy link
Owner

was hoping I'd wake up and this would be cleanly settled!

Some useful commentary on twitter & Andy Jackson linked to this https://gist.github.com/anjackson/48308ecab5f954218d4b

which is an Option 3): ID both the GZIP and the decoded payload. But also ID parent WARC record and, if there were any transfer encoding, also ID that (i.e. not just jump to the payload as sf does presently)

@despens
Copy link
Author

despens commented Nov 9, 2015

selection_176

This is getting very interesting 😺

My take on this is not what is the "ideal WARC ID result", as mentioned by Andy Jackson above, but rather what would enable certain useful activities. RFC2616 §3.5 and §14.11 is I guess enough information on what can happen with Content-Encoding. Including that info into the Siegfried analysis would make Siegfried into another tool, one that is concerned with structures of data and anything it can be contained in. This can go from complicated TIFF files to tarballs that contain ZIPs that contain an ISO in Joliet format, and so forth.—The question is rather on what level it is meaningful.

My idea for requesting the analysis of WARCs in the beginning was to enable for web archives what is already possible with CD-ROMs: selecting a suitable emulation environment for re-enactment, based on the types of data contained on an ISO. Now, WARC analysis could provide the insight to select emulation environments sporting the right browsers and plugins best suited for a web archive.

If there would be browsers that are actually in use support all kinds of HTML5 trickery but no gzip encoding, storing that information would be meaningful to me. Since this isn't the case, I regard this information as rather academic and even a burden, since I will need to weed it out before getting to the data that is actually interesting, again, to me.

I believe that tools to get data out of WARCs (as in, making files from WARCs) will not be built based on the information of a data type analysis, but rather the standards defining HTTP and WARC. Tools for the re-enactment of WARCs already exist, these are browsers and for example pywb.


Maybe applying PRONOM, which is about "file formats", to WARCs is a bit of a stretch, since WARCs aren't file containers, but rather a record of an activity that was performed on the web. Anyway, PRONOM is delivering applicable results in this case, so why not do it? Also, "files" look like they are developing into this direction on a basic technical level, with btrfs or zfs recording time-based snapshots or services like dropbox or ownCloud keeping multiple versions of the "same" file; time- and performance-based analysis is coming to your plain old files soon 😀—WARCs seem like a good case to think this through.

@richardlehane
Copy link
Owner

so the twitter vote closed and it favoured option (2), I think we'll go with that. Thanks to everyone who gave input on this one. This will be in the next point release which I hope will be out in a fortnight or so.

@richardlehane
Copy link
Owner

HI Dragan
I've had a go at implementing a fix for this. Could you please test when you get a chance? (I'm having a hard time finding WARCs with examples of transfer and content encoding).
The fix is on the "develop" branch:
Build steps are:
git clone the repository
git checkout -b develop
go install -a github.com/richardlehane/siegfried/cmd/sf [you need the -a flag to force refetch and rebuild of the webarchive repository where most of changes made]

If this too hard and you can share a small WARC that has some interesting encodings within, I can do testing myself and send you results.

Once this is tested I'll merge to the master branch and issue the next sf release.
thanks!
Richard

@richardlehane
Copy link
Owner

believe this is fixed with 1.4.2 release. Please reopen if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants