Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed: support reading and sanitizing WARCs with spaces in WARC-Target-URI #80

Merged
merged 1 commit into from
Jun 21, 2019

Conversation

rebeccacremona
Copy link
Contributor

Motivation and Context

Due to an accident of history, we have an unknown number of WARCs with resource records whose WARC-Target-URI headers contain spaces (e.g WARC-Target-URI: file:///088s3gtTLhg/source/Markey Cover_sm.jpg). We may be the only archive in the world with this problem: we produced the infelicitous WARCs ourselves when implementing initial Pywb support in Perma, back in the summer of 2015.

(Fun fact: if you use wget to retrieve a website like https://www.yellowstonenationalpark.com/wolves.htm, which includes assets like https://www.yellowstonenationalpark.com/Terraces%20- Yellowstone%20National%20Park.jpg, wget will save those assets to disk with spaces in the filename... not %20s. Who knew?)

As is the case with ARCs with space-containing urls, the presence of these unexpected spaces causes problems in downstream projects when attempting to compile CDX/CDXJ for the WARCs. For instance, if uploading the WARC to Webrecorder and playing it back, pywb cannot successfully create CDXObjects, because it splits on spaces.

Description

This PR adds a second sanitization step to ArcWarcRecordLoader._ensure_target_uri_format, percent-encoding any spaces that are present. It also updates test_archiveiterator.py to include a test for this situation, and adds a small WARC (example-space-in-target-uri.warc.gz) affected by the problem.

We have no idea if this is something warcio wants to support. Our problem is likely pretty niche. We are 100% content to continue patching in perpetuity if you aren't interested in this change.

@ikreymer ikreymer merged commit c755f4f into webrecorder:develop Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants