-
-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to download non-HTML content if served with HTTP cookies #129
Comments
Thanks for taking a look at this! The direct fetch logic was altered initially to deal with redirects, which were not being fetched. The cookie check was added 'just in case', as the direct fetch did not use the cookies from the browser - which is possible to do as well, in case it resulted in a 'soft 404' or not valid capture. Erring on the side of caution here, it may have introduced the unnecessary timeouts. In simple testing, I was getting the PDF-pages to load without timeouts, Would like to take a look at the example site with PDFs to see what can be done. |
- don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
Can confirm that |
* non-html page load improvements: fix for #129 - don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors - don't do text extraction for non-HTML pages (will need to handle pdf separately) bump to 0.5.0-beta.8
Was fixed in earlier release! |
I'm trying to crawl a site with many PDFs that are served with HTTP cookies. Since the introduction of
browsertrix-crawler/crawler.js
Line 758 in 5e5efda
Crawler.directFetchCapture()
returnsfalse
, despite the fetching and recording having completed correctly and the.warc.gz
having been written; this causesCrawler.loadPage()
to attempt a browser-based capture, and puppeteer'spage.goto()
times out atbrowsertrix-crawler/crawler.js
Lines 310 to 312 in 09082e8
I was able to work around it by adding an additional check like this:
and also passing
--wait-until domcontentloaded
, but this is obviously not ideal.What's the logic behind checking for http cookies and forcing the use of the browser for non-HTML resources? I've taken that restriction out in a local version to fetch the ~20k PDFs that are my immediate concern, but I don't understand everything that's going on here.
Thanks!
The text was updated successfully, but these errors were encountered: