Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download non-HTML content if served with HTTP cookies #129

Closed
simonwiles opened this issue Mar 20, 2022 · 3 comments
Closed

Unable to download non-HTML content if served with HTTP cookies #129

simonwiles opened this issue Mar 20, 2022 · 3 comments

Comments

@simonwiles
Copy link
Contributor

I'm trying to crawl a site with many PDFs that are served with HTTP cookies. Since the introduction of

return resp.status === 200 && !resp.headers.get("set-cookie");
in ab096cd, this is not possible.

Crawler.directFetchCapture() returns false, despite the fetching and recording having completed correctly and the .warc.gz having been written; this causes Crawler.loadPage() to attempt a browser-based capture, and puppeteer's page.goto() times out at

if (this.params.behaviorOpts) {
await Promise.allSettled(page.frames().map(frame => evaluateWithCLI(frame, "self.__bx_behaviors.run();")));
}

I was able to work around it by adding an additional check like this:

if (await this.isHTML(data.url) && this.params.behaviorOpts) {
  await Promise.allSettled(page.frames().map(frame => evaluateWithCLI(frame, "self.__bx_behaviors.run();")));
}

and also passing --wait-until domcontentloaded, but this is obviously not ideal.

What's the logic behind checking for http cookies and forcing the use of the browser for non-HTML resources? I've taken that restriction out in a local version to fetch the ~20k PDFs that are my immediate concern, but I don't understand everything that's going on here.

Thanks!

@simonwiles simonwiles changed the title Unable to download PDFs if served with HTTP cookies Unable to download non-HTML content if served with HTTP cookies Mar 21, 2022
@ikreymer
Copy link
Member

Thanks for taking a look at this! The direct fetch logic was altered initially to deal with redirects, which were not being fetched. The cookie check was added 'just in case', as the direct fetch did not use the cookies from the browser - which is possible to do as well, in case it resulted in a 'soft 404' or not valid capture. Erring on the side of caution here, it may have introduced the unnecessary timeouts.

In simple testing, I was getting the PDF-pages to load without timeouts, Would like to take a look at the example site with PDFs to see what can be done.

ikreymer added a commit that referenced this issue Mar 22, 2022
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
@simonwiles
Copy link
Contributor Author

Can confirm that non-html-page-load branch fixes the problem in the examples I have available. Thanks!

ikreymer added a commit that referenced this issue Mar 23, 2022
* non-html page load improvements: fix for #129
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
- don't do text extraction for non-HTML pages (will need to handle pdf separately)
bump to 0.5.0-beta.8
@ikreymer
Copy link
Member

ikreymer commented Jul 1, 2022

Was fixed in earlier release!

@ikreymer ikreymer closed this as completed Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants