Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some URLs not playing back in pywb #29

Closed
anjackson opened this issue Oct 19, 2018 · 9 comments
Closed

Some URLs not playing back in pywb #29

anjackson opened this issue Oct 19, 2018 · 9 comments
Labels
bug Something isn't working

Comments

@anjackson
Copy link
Contributor

We have an oddity, in that some URLs, like this one:

http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png

Playback fine in OpenWayback: https://www.webarchive.org.uk/wayback/archive/20130307094428im_/http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png

But do not play back in pywb: https://alpha.webarchive.org.uk/wayback/en/archive/20130307094428im_/http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png

In the back-end OutbackCDX service (which is old, which might matter here), the thing checks out:

> http://bigcdx.n45.wa.bl.uk:8080/data-heritrix?url=http%3A%2F%2F3.bp.blogspot.com%2F-W8IWj9tFz-I%2FUTCS2D5Pt-I%2FAAAAAAAAAI4%2F8BCbTLsJ3tI%2Fs320%2FAfrican%2Bwomen4.png
< com,blogspot,bp,3)/-w8iwj9tfz-i/utcs2d5pt-i/aaaaaaaaai4/8bcbtlsj3ti/s320/african+women4.png 20130307094428 http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png image/png 200 GRDKSKQHAA72NSYKH6UUAOAELGHBKGPW 0

From the pywb logs we see:

127.0.0.1 - - [2018-10-19 10:12:54] "POST /archive/resource/postreq?url=http%3A%2F%2F3.bp.blogspot.com%2F-W8IWj9tFz-I%2FUTCS2D5Pt-I%2FAAAAAAAAAI4%2F8BCbTLsJ3tI%2Fs320%2FAfrican%2Bwomen4.png&matchType=exact&closest=20130307094428 HTTP/1.1" 404 536 0.060883

So perhaps the issue is the + getting escaped?

@N0taN3rd
Copy link
Contributor

@anjackson are you able to share the page that this URL came from?
I suspect that the issue is either as described above / URL encoding issue or a cdx matching issue.
Would like determine if the URL encoding+decoding and cdx fuzzy matching improvements made after the 2.0.4 release address this issue.

@N0taN3rd
Copy link
Contributor

Thanks for the context.
Using both webrecorder/pywb@dev and ukwa/pywb@master, a capture of the page(s) from today, and local cdx was able to replay the images with no issue.
Appears that it might be an issue with OutbackCDX.

@anjackson anjackson added the bug Something isn't working label Feb 6, 2019
@anjackson
Copy link
Contributor Author

Thanks @N0taN3rd - I've asked OutbackCDX about it.

@anjackson
Copy link
Contributor Author

anjackson commented Feb 27, 2019

Having discussed nla/outbackcdx#44 with @ato it we think the OpenSearch API implementation (here) is the issue (your test probably didn't use that @N0taN3rd?)

To quote @ato

that's missing the outer escape
EXACT_SUFFIX = '?q=type:urlquery+url:{url}'
should be more like
'?q=' + quote_plus('type:urlquery+url:' + quote_plus(url))
it's got the inner one query_url.format(url=quote_plus(url)) but not the outer one of the entire value of the q= parameter

i.e. the problem is the OpenSearch API expects everything to be doubly-escaped.

@ato
Copy link

ato commented Feb 27, 2019

The original Java Wayback code, which is the reference implementation of the protocol that OutbackCDX is trying to be compatible with is here:
https://github.com/iipc/openwayback/blob/a9152ce/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java#L1210-L1216

You can see the double escaping with URLEncoder (once inside the loop, once outside it).

@N0taN3rd
Copy link
Contributor

@anjackson yes my test did not use that.

@anjackson
Copy link
Contributor Author

Thanks all. I'll tag and roll-out ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants