Crawl nested html documents loaded dynamically #185

ab14p · 2018-03-29T00:41:29Z

What is the current behavior?

With robots.txt set to false

Crawler waits using WaitFor and a specified timeout of 10 seconds, but still not able to crawl nested documents. For example, I want to extract src of an <iframe> which is an AD (display advertisement).

Enabled screen shot option to check if the ad iframe has loaded before evaluatePage function was executed. I can see the ad in screen shot but function does not return the src from <iframe>.

What is the expected behavior?

Be able to crawl nested html documents such as ADs which are loaded dynamically by Java Script.

Can you please provide an example or solution for this..

Please tell us about your environment:

Version: 1.5.0
Platform / OS version:
Node.js version: 6.11.2

BubuAnabelas · 2018-03-29T01:08:55Z

Maybe you could use the waitUntiloption in the crawler.queue() method in combination with waitFor, or passing a function to waitFor checking if the ad was loaded.

yujiosaka · 2018-03-30T04:35:55Z

@ab14p
Isn't it because it was blocked by Cross Origin Policy?
Please try passing --disable-web-security argument. See https://github.com/yujiosaka/headless-chrome-crawler#launch-options for more details.

ab14p · 2018-04-02T23:53:40Z

@yujiosaka @BubuAnabelas
I used waitUntil: 'networkidle0' and waitFor: { selectorOrFunctionOrTimeout: 20000 }
I did try setting up Cross Origin policy in launch options as shown below, but still no luck.
args: ['--disable-web-security']

yujiosaka · 2018-04-20T18:55:20Z

@ab14p
Still not encountered the same situation.
Can you provide the code?

yujiosaka · 2018-06-10T06:02:34Z

Closing this issue because no information is provided for a month.
Please feel free to reopen it if you have further questions.

yujiosaka added the question label Apr 20, 2018

yujiosaka closed this as completed Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl nested html documents loaded dynamically #185

Crawl nested html documents loaded dynamically #185

ab14p commented Mar 29, 2018

BubuAnabelas commented Mar 29, 2018

yujiosaka commented Mar 30, 2018

ab14p commented Apr 2, 2018 •

edited

yujiosaka commented Apr 20, 2018

yujiosaka commented Jun 10, 2018

Crawl nested html documents loaded dynamically #185

Crawl nested html documents loaded dynamically #185

Comments

ab14p commented Mar 29, 2018

BubuAnabelas commented Mar 29, 2018

yujiosaka commented Mar 30, 2018

ab14p commented Apr 2, 2018 • edited

yujiosaka commented Apr 20, 2018

yujiosaka commented Jun 10, 2018

ab14p commented Apr 2, 2018 •

edited