Unable to download non-HTML content if served with HTTP cookies #129

simonwiles · 2022-03-20T07:41:45Z

I'm trying to crawl a site with many PDFs that are served with HTTP cookies. Since the introduction of

Line 758 in 5e5efda

return resp.status === 200 && !resp.headers.get("set-cookie");

in ab096cd, this is not possible.

Crawler.directFetchCapture() returns false, despite the fetching and recording having completed correctly and the .warc.gz having been written; this causes Crawler.loadPage() to attempt a browser-based capture, and puppeteer's page.goto() times out at

browsertrix-crawler/crawler.js

Lines 310 to 312 in 09082e8

    
           if (this.params.behaviorOpts) { 
        
             await Promise.allSettled(page.frames().map(frame => evaluateWithCLI(frame, "self.__bx_behaviors.run();"))); 
        
           }

I was able to work around it by adding an additional check like this:

if (await this.isHTML(data.url) && this.params.behaviorOpts) {
  await Promise.allSettled(page.frames().map(frame => evaluateWithCLI(frame, "self.__bx_behaviors.run();")));
}

and also passing --wait-until domcontentloaded, but this is obviously not ideal.

What's the logic behind checking for http cookies and forcing the use of the browser for non-HTML resources? I've taken that restriction out in a local version to fetch the ~20k PDFs that are my immediate concern, but I don't understand everything that's going on here.

Thanks!

The text was updated successfully, but these errors were encountered:

ikreymer · 2022-03-21T22:41:53Z

Thanks for taking a look at this! The direct fetch logic was altered initially to deal with redirects, which were not being fetched. The cookie check was added 'just in case', as the direct fetch did not use the cookies from the browser - which is possible to do as well, in case it resulted in a 'soft 404' or not valid capture. Erring on the side of caution here, it may have introduced the unnecessary timeouts.

In simple testing, I was getting the PDF-pages to load without timeouts, Would like to take a look at the example site with PDFs to see what can be done.

- don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors

simonwiles · 2022-03-22T04:25:42Z

Can confirm that non-html-page-load branch fixes the problem in the examples I have available. Thanks!

* non-html page load improvements: fix for #129 - don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors - don't do text extraction for non-HTML pages (will need to handle pdf separately) bump to 0.5.0-beta.8

ikreymer · 2022-07-01T02:28:55Z

Was fixed in earlier release!

simonwiles changed the title ~~Unable to download PDFs if served with HTTP cookies~~ Unable to download non-HTML content if served with HTTP cookies Mar 21, 2022

ikreymer added a commit that referenced this issue Mar 22, 2022

non-html page load improvements: fix for #129

8a1a5c3

- don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors

ikreymer mentioned this issue Mar 23, 2022

Non-HTML Page Load Optimization #130

Merged

ikreymer closed this as completed Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to download non-HTML content if served with HTTP cookies #129

Unable to download non-HTML content if served with HTTP cookies #129

simonwiles commented Mar 20, 2022

ikreymer commented Mar 21, 2022

simonwiles commented Mar 22, 2022

ikreymer commented Jul 1, 2022

Unable to download non-HTML content if served with HTTP cookies #129

Unable to download non-HTML content if served with HTTP cookies #129

Comments

simonwiles commented Mar 20, 2022

ikreymer commented Mar 21, 2022

simonwiles commented Mar 22, 2022

ikreymer commented Jul 1, 2022