Proposed: support reading and sanitizing WARCs with spaces in WARC-Target-URI #80
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Due to an accident of history, we have an unknown number of WARCs with resource records whose
WARC-Target-URI
headers contain spaces (e.gWARC-Target-URI: file:///088s3gtTLhg/source/Markey Cover_sm.jpg
). We may be the only archive in the world with this problem: we produced the infelicitous WARCs ourselves when implementing initial Pywb support in Perma, back in the summer of 2015.(Fun fact: if you use
wget
to retrieve a website likehttps://www.yellowstonenationalpark.com/wolves.htm
, which includes assets likehttps://www.yellowstonenationalpark.com/Terraces%20- Yellowstone%20National%20Park.jpg
,wget
will save those assets to disk with spaces in the filename... not%20
s. Who knew?)As is the case with ARCs with space-containing urls, the presence of these unexpected spaces causes problems in downstream projects when attempting to compile CDX/CDXJ for the WARCs. For instance, if uploading the WARC to Webrecorder and playing it back, pywb cannot successfully create CDXObjects, because it splits on spaces.
Description
This PR adds a second sanitization step to
ArcWarcRecordLoader._ensure_target_uri_format
, percent-encoding any spaces that are present. It also updatestest_archiveiterator.py
to include a test for this situation, and adds a small WARC (example-space-in-target-uri.warc.gz) affected by the problem.We have no idea if this is something warcio wants to support. Our problem is likely pretty niche. We are 100% content to continue patching in perpetuity if you aren't interested in this change.