Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.2 and some guidance on setup/making the embed plugin and make it work consistently #26

Closed
DiegoPino opened this issue Nov 17, 2020 · 5 comments

Comments

@DiegoPino
Copy link

DiegoPino commented Nov 17, 2020

Hi @ikreymer @emmadickson

I know you guys are busy with WACZ but wanted to catch up with some issues we have been having on the embed version of replay web on Archipelago with version 1.2

I suspect a lot of this is because we are CDN loading the JS but also because the files we are testing agains are "largish" and also pure WARC. But we may have other issues so open to suggestions.

  1. Safari (Version 14.0 (15610.1.28.1.9, 15610)) v/s Chrome (Version 86.0.4240.198 (Official Build) (x86_64)).
    This URL , a 1 GByte WARC file eventually loads on Safari (slow), I see small pauses made every 1000 records and I get a lot of restarts and failed attempts (inclusive from the Client reloading the whole page) and surprisingly loads faster (still a few minutes() and more resilient on Chrome but without any CSS/Images/assets

image

  1. I updated the way we deliver the file to use streaming and was using a get argument (?stream=true) to enable modal during testing. This basically is read 1024 bytes from S3+ pass 1024 back to the HTTP request (I may chunk this larger?) Sadly can not use GET argument that in production because the embed tag fails if the "source" property is not just actual end in a valid file extension (not a big issue, modified it to always stream fro WARC and WACZ files in that first URL i shared). But know I have second thoughts. Is streaming what you need/works better? or is chunking better for your JS? Also, does the stream need to be seek-able?

To test a direct download of the stream please test this url (reusing or IIIF endpoint, please dismiss the weird semantic here)
https://webarchive.archipelago.nyc//do/4/iiif/51b281b4-093e-494c-9820-9eeeb03a4c6e/full/full/0/default.warc
e.g wget takes (942.50M 26.7MB/s in 39s ), replay embed 5 minutes of more on Chrome.

  1. JS errors in every browser. We are getting quite a few. E.g Firefox:
failed to load ‘https://webarchive.archipelago.nyc/do/4/iiif/51b281b4-093e-494c-9820-9eeeb03a4c6e/full/full/0/default.warc’. A ServiceWorker intercepted the request and encountered an unexpected error. sw.js:33:126116
Read 1000 records sw.js:9:52746
AbortError: A request was aborted, for example through a call to IDBTransaction.abort. sw.js:9:179926
AbortError: AbortError

And It restarts. Sometimes it works, sometimes not.

E.g Safari

(anonymous function)
rejectPromise
rejectPromiseWithFirstResolvingFunctionCallCheck
s — sw.js:9:159182
s — sw.js:9:159182

Also should we be worried about the initial message (since running from CDN)

done
webarchive.archipelago.nyc/:8 GET https://webarchive.archipelago.nyc/replay/ui.js net::ERR_ABORTED 404 (Not Found)
ui.js:661 GET https://webarchive.archipelago.nyc/replay/wabac/api/id-a5343ec7bd53?all=1 404

We are loading ui.js via CDN and it works.

FYI we are running NGINX, and file delivery is not directly S3, (access control + some other users may be using Azure or directly filesystem so we wrap things. maybe we need to tune our Binary responses?

Sorry for the "cover it all" issue but I feel its more like a use case sharing and for sure using just WACZ should solve all the issues, but for now I want to be sure its not us/something we can do better

Thanks for the great work!!

@ikreymer
Copy link
Member

ikreymer commented Nov 18, 2020

Thanks for the detailed report. There's a couple of issues, some on my end, some on my the Archipelago end :)

  1. The AbortError is the same issue, and appears to be related to ref counting related to large WARCs. It probably should not be enabled for WARC loading at all, and I'll disable it for now. I believe this should fix all of the AbortErrors.

  2. Looking at https://webarchive.archipelago.nyc//do/4/iiif/51b281b4-093e-494c-9820-9eeeb03a4c6e/full/full/0/default.warc, it appears that the the current nginx setup does not support byte range requests.
    This will be needed for WACZ support as well. ReplayWeb.page checks to see if it can make range requests, and if so, optimizes to read data on-demand later. If it can not, it will try to store everything, which exacerbates the AbortError in this case (but they shouldn't happen either way).

    nginx should handle range request automatically for static files, and there is also this module: https://www.nginx.com/blog/smart-efficient-byte-range-caching-nginx/

    However, if you're proxying from S3, it should be possible to just get the ranges from S3 directly..

    If you can get range requests working, that should address a lot of the issues, and I will also add an additional fix to the 1.3.0 release. Let me know if you run into any questions, happy to look at the nginx config.

  3. The initial ui.js not found is fine, it will then load it from the CDN as a fallback, this is expected.

ikreymer added a commit to webrecorder/wabac.js that referenced this issue Nov 18, 2020
- update to warcio 1.3.2, better handling of WARCs with incorrect content-length, should fix webrecorder/replayweb.page#23
- loading: dedupResource() uses put() instead of add() to avoid aborting transaction with duplicate add, fixes part of webrecorder/replayweb.page#26
- statuscodes: update to latest http-status-codes, wrap getStatusText() to ensure never throws, returns 'Unknown Status' if unknown
ikreymer added a commit to webrecorder/wabac.js that referenced this issue Nov 18, 2020
- update to warcio 1.3.2, better handling of WARCs with incorrect content-length, should fix webrecorder/replayweb.page#23
- loading: dedupResource() uses put() instead of add() to avoid aborting transaction with duplicate add, fixes part of webrecorder/replayweb.page#26
- statuscodes: update to latest http-status-codes, wrap getStatusText() to ensure never throws, returns 'Unknown Status' if unknown
bump version to 2.2.2
@ikreymer
Copy link
Member

@DiegoPino Released v1.3.0, which should fix the AbortError issue, even without range request support -- it will load slowly, but it should load w/o errors now.

@DiegoPino
Copy link
Author

@ikreymer thanks for your detailed response. We are having some trouble finding the right balance between "security", flexibilty and speed right now. I managed to get HEAD requests for the WACZ implementation working but hitting some resource limits when trying to seek/deliver the range request afterwards (PHP/AWS S3 SDK are playing with my patience on streams and memory usage).
I have some options (like delivering a presigned URL) that would alleviate this on the short term but may get me in trouble with caching (local one since presigned urls are made to last less than the HTML caching) but I will get there! Some explanation (probably out of context) here esmero/archipelago-deployment#75

I will test V1.3.0

There is still one issue that seems to be affecting is that some CSS/Images are being handled differently (no clue why) on WARC files and end not being served. E.g here

https://webarchive.archipelago.nyc/do/db17b0d6-886b-4ee4-bfb9-0edf9ce404b5

In this case missing CSS is consistent for the landing page

But for the first example I shared (1 Gbyte WARC)
Safari can load it, Chrome not.

Will do my homework first and get Ranges working without killing the server!

Thanks!

@ikreymer
Copy link
Member

There is still one issue that seems to be affecting is that some CSS/Images are being handled differently (no clue why) on WARC files and end not being served. E.g here

https://webarchive.archipelago.nyc/do/db17b0d6-886b-4ee4-bfb9-0edf9ce404b5

Try updating to 1.3.0 -- The AbortError would cause certain resources to not be loaded, hopefully this is fixed now.

Sorry to hear about the difficulties with streaming!

I would definitely recommend using the S3 presigned URLs with a reasonable duration (a day?), then you do not need to worry about local cacheing at all! That should work pretty well, you'll just need to configure CORS settings on the bucket, which I can help you with also.

Here's the WARC you shared loading from DigitalOcean CDN, it takes some time, but does load:
https://replayweb.page/?source=https%3A%2F%2Fdh-preserve.sfo2.cdn.digitaloceanspaces.com%2Fmisc%2Fnyarc.warc

@ikreymer
Copy link
Member

I think all the issues mentioned here have now been resolved and the embedding is working!
Closing for now, please re-open if anything unresolved, or open a new issue for any new errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants