Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeGate missing/broken #37

Closed
anjackson opened this issue Jan 16, 2019 · 16 comments · Fixed by ukwa/pywb#8
Closed

TimeGate missing/broken #37

anjackson opened this issue Jan 16, 2019 · 16 comments · Fixed by ukwa/pywb#8
Labels
bug Something isn't working

Comments

@anjackson
Copy link
Contributor

anjackson commented Jan 16, 2019

As per this thread, we have a report that the Memento TimeGate is not where it is expected according to the documentation.

Specifically it seems the URI of the UKWA TimeGate has changed from:

https://www.webarchive.org.uk/wayback/archive/

to:

https://www.webarchive.org.uk/wayback/archive/mp_/

This seems likely to be a bug, probably introduced by our theming/overlaid changes. We should probably add an integration test to check the TG is there.

@anjackson
Copy link
Contributor Author

Looks like the timemaps are missing too (although they do appear in the headers of the TG response)

https://www.webarchive.org.uk/wayback/archive/timemap/link/https://www.bl.uk

@anjackson
Copy link
Contributor Author

Note that these links should behave as expected by the Memento standard, even in the absence of e.g. an Accept-Datetime header.

@ibnesayeed
Copy link

I was looking at MemGator logs and see some success responses till October 30, but after that all responses are 404s. Does this align with the timeline of any changes made on the server?

@anjackson
Copy link
Contributor Author

@ibnesayeed Yes, it does.

Curiously, the service says the TimeMap is where I expect:

$ curl -I https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/
HTTP/1.1 200 OK
Server: nginx/1.14.0
Date: Thu, 14 Feb 2019 22:42:27 GMT
Content-Type: text/html
Content-Length: 3000
Connection: keep-alive
Link: <http://www.bl.uk/>; rel="original", <https://www.webarchive.org.uk/wayback/archive/mp_/http://www.bl.uk/>; rel="timegate", <https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime
Accept-Ranges: bytes

But when you go there it doesn't work...

$ curl -I https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/
HTTP/1.1 404 Not Found
Server: nginx/1.14.0
Date: Thu, 14 Feb 2019 22:42:56 GMT
Content-Type: application/link-format
Content-Length: 0
Connection: keep-alive
Link: <http://timemap/link/http://www.bl.uk/>; rel="original", <https://www.webarchive.org.uk/wayback/archive/mp_/http://timemap/link/http://www.bl.uk/>; rel="timegate", <https://www.webarchive.org.uk/wayback/archive/timemap/link/http://timemap/link/http://www.bl.uk/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 14, 2019
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- update tests
ikreymer added a commit to ukwa/pywb that referenced this issue Feb 14, 2019
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
ikreymer added a commit to ukwa/pywb that referenced this issue Feb 14, 2019
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
@anjackson
Copy link
Contributor Author

Hm, so the TimeGate seems to work now:

$ curl -I https://beta.webarchive.org.uk/wayback/archive/http://www.bl.uk/
HTTP/1.1 307 Temporary Redirect
Server: nginx/1.14.2
Date: Fri, 15 Feb 2019 10:49:00 GMT
Content-Length: 0
Connection: keep-alive
Location: https://beta.webarchive.org.uk/wayback/archive/20190101231715/https://www.bl.uk/
Link: <https://www.bl.uk/>; rel="original", <https://beta.webarchive.org.uk/wayback/archive/https://www.bl.uk/>; rel="timegate", <https://beta.webarchive.org.uk/wayback/archive/timemap/link/https://www.bl.uk/>; rel="timemap"; type="application/link-format", <https://beta.webarchive.org.uk/wayback/archive/20190101231715mp_/https://www.bl.uk/>; rel="memento"; datetime="Tue, 01 Jan 2019 23:17:15 GMT"
Preference-Applied: rewritten
Vary: accept-datetime, Prefer

$ curl -I https://beta.webarchive.org.uk/wayback/archive/http://www.bbc.co.uk/
HTTP/1.1 451 Unavailable For Legal Reasons
Server: nginx/1.14.2
Date: Fri, 15 Feb 2019 10:49:08 GMT
Content-Type: text/html
Content-Length: 3314
Connection: keep-alive

But the TimeMaps not so much?

$ curl https://beta.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/
$

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 15, 2019
- fix timemap in 'redirect-to-exact' mode, (ensure timegate redirect condition applies only to top-frame)
- tests: add additional timemap tests, with and without exact redirect
@ibnesayeed
Copy link

@anjackson is this change deployed yet?

@anjackson
Copy link
Contributor Author

@ibnesayeed Easy tiger! 🐅 😄

Just building the image at the moment. Will roll out to beta.webarchive.org.uk later on.

@anjackson
Copy link
Contributor Author

Woo! Looks good!

$ curl -s https://beta.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/ | head
<https://beta.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/>; rel="self"; type="application/link-format"; from="Tue, 30 Oct 2001 00:00:19 GMT",
<https://beta.webarchive.org.uk/wayback/archive/http://www.bl.uk/>; rel="timegate",
<http://www.bl.uk/>; rel="original",
<http://www.bl.uk/>; rel="memento"; datetime="Tue, 30 Oct 2001 00:00:19 GMT"; collection="archive",
<http://www.bl.uk/>; rel="memento"; datetime="Tue, 13 Nov 2001 00:00:00 GMT"; collection="archive",
<http://www.bl.uk/>; rel="memento"; datetime="Tue, 30 Apr 2002 23:00:24 GMT"; collection="archive",
<http://www.bl.uk/>; rel="memento"; datetime="Tue, 14 May 2002 23:00:01 GMT"; collection="archive",

Queries are still a little slow in some 451 cases, as per #38, but those TimeMaps come back. You can't tell it's a 451 from the TimeMap, but that's the expected behaviour.

I'm going to tag a 1.0.6 release and roll it out.

@ikreymer
Copy link
Contributor

Do you have examples of queries that still slow? There should be a big improvement with the access check optimizations in #39, and the example from #38 loads quickly

@anjackson
Copy link
Contributor Author

Oh it's much better in general! This might just be larger sites, e.g.

https://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

has quite a long pause while it assembles the data.

@ibnesayeed
Copy link

TimeMaps are still not in good shape. Each memento entry returns URI-Rs, not URI-Ms.

@anjackson
Copy link
Contributor Author

anjackson commented Feb 15, 2019

Hm, and 'misses' (should be 404s) throw errors... https://www.webarchive.org.uk/wayback/archive/http://www.no.domain/

NOTE TO SELF: Disable debug mode before rolling out next time! :-)

@anjackson
Copy link
Contributor Author

Okay, I think we're good. Need to disable DEBUG mode, but functionality should be there.

@ibnesayeed
Copy link

Looks much better. However, I am wondering if we should return a Link header in case of 404s?

Also, it is taking somewhere between 10-20 seconds for 404 responses to arrive which can perhaps be improved.

$ curl -i https://www.webarchive.org.uk/wayback/en/archive/timemap/link/http://www.no.domain/
HTTP/1.1 404 Not Found
Server: nginx/1.14.0
Date: Sat, 16 Feb 2019 16:28:03 GMT
Content-Type: application/link-format
Content-Length: 0
Connection: keep-alive
Link: <http://www.no.domain/>; rel="original", <https://www.webarchive.org.uk/wayback/en/archive/http://www.no.domain/>; rel="timegate", <https://www.webarchive.org.uk/wayback/en/archive/timemap/link/http://www.no.domain/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime

@ikreymer
Copy link
Contributor

I'm not seeing that type of delay, no more than ~1 sec here

time curl -I "https://www.webarchive.org.uk/wayback/en/archive/timemap/link/http://www.no-such-invalid.domain/"
HTTP/1.1 404 Not Found
Server: nginx/1.14.0
Date: Sat, 16 Feb 2019 23:29:55 GMT
Content-Type: application/link-format
Content-Length: 0
Connection: keep-alive
Link: <http://www.no-such-invalid.domain/>; rel="original", <https://www.webarchive.org.uk/wayback/en/archive/http://www.no-such-invalid.domain/>; rel="timegate", <https://www.webarchive.org.uk/wayback/en/archive/timemap/link/http://www.no-such-invalid.domain/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime


real	0m0.866s
user	0m0.030s
sys	0m0.008s

@ibnesayeed
Copy link

Now, I don't see it either. However, inclusion of Link header in 404s remains something to wonder about.

N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
- fix timemap in 'redirect-to-exact' mode, (ensure timegate redirect condition applies only to top-frame)
- tests: add additional timemap tests, with and without exact redirect
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants