Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arguments of the chrome headless helper #22

Closed
RLesur opened this issue Oct 30, 2018 · 6 comments
Closed

Arguments of the chrome headless helper #22

RLesur opened this issue Oct 30, 2018 · 6 comments

Comments

@RLesur
Copy link
Collaborator

RLesur commented Oct 30, 2018

The chrome_print function is great! Thanks!

When I use chrome headless on html pages with paged.js, I always need the --virtual-time-budget CLI argument. It seems to be logical because paged.js is launched when the page is loaded. If I understand well chrome headless, the pdf is built at the same time. So, the DOM is not processed by paged.js before the pdf generation.
On average documents (paged.js+mathjax stuff, 30 pages), I often need to allow a budget of 5 to 10e+06 virtual milliseconds (5-6 effective seconds).
BTW, with puppeteer, it is easier to control that paged.js has finished its job.

What do you think of adding an argument to chrome_print() for the virtual time budget?

@yihui
Copy link
Member

yihui commented Oct 31, 2018

I have only spent about an hour on this function. There is certainly a lot of room for improvement. We should definitely add such an argument to allow some waiting time before printing. I just didn't know the name of the argument, so thanks for the tip! Currently you can do this:

chrome_print('https://pagedown.rbind.io', extra_args = '--virtual-time-budget=5000000')

but it makes more sense to promote it to an argument of chrome_print() (and perhaps should make it default to 5-6 seconds, too).

The only issue with puppeteer is the dependency on Node. I'm not sure if an average R user is willing to install it. That said, we could certainly provide a wrapper function for puppeteer for those who don't mind installing Node.

@yihui
Copy link
Member

yihui commented Dec 4, 2018

@RLesur I wonder if you could clear my confusion about the --virtual-time-budget argument: https://stackoverflow.com/q/53548438/559676

@RLesur
Copy link
Collaborator Author

RLesur commented Dec 4, 2018

I've had a quick look (mainly reading the sources). My guess is that when the network stack becomes empty the virtual time advances. In other words, I won't be surprised that it takes similar real times if you test 1,000, 10,000 or 100,000 virtual seconds... (I'm not totally sure)

Details
From headless_shell.cc, we can see that headless Chrome uses the Emulate.setVirtualTimePolicy command with the policy parameter set to pauseIfNetworkFetchesPending.
The DevTools Protocol documentation states that The virtual time base may not advance if there are any pending resource fetches. (that's why I think that the virtual time advances when the network stack is empty).

I think I could make some experiments with the DevTools Protocol in order to mimic the --print-to-pdf option. There are some interesting events that could help us to understand when the virtual time budget advances: virtualTimeAdvanced, virtualTimeBudgetExpired and virtualTimePaused.

@yihui
Copy link
Member

yihui commented Dec 4, 2018

So when the page is still being loaded, the virtual time won't advance. After it is fully loaded, the virtual time will start to advance. My confusion is why it doesn't make much difference whether I want it to advance for 10 seconds or 1000 seconds.

@RLesur
Copy link
Collaborator Author

RLesur commented Dec 4, 2018

The virtual time budget
I did some tests. I am not sure to fully understand the behavior of the virtual time budget, but it seems there is like a "fast-forward" mechanism.

Here's one test with Chrome in remote debugging mode (precision: I think that this script does not replicate the behavior of the print-to-pdf Chrome CLI):

remotes::install_github('rlesur/crrri')
Sys.setenv(DEBUGME='crrri')
Sys.setenv(DEBUGME_OUTPUT_FILE='log1e5.txt')

library(crrri)

chrome <- chr_connect()

chrome %>%
  Network.enable() %>%
  Page.enable() %>%
  Emulation.setVirtualTimePolicy(policy = 'pauseIfNetworkFetchesPending', budget = 100000L, waitForNavigation = TRUE) %>%
  Page.navigate(url = 'https://www.chromestatus.com/') %>%
  Emulation.virtualTimeBudgetExpired() %>%
  chr_disconnect()

log file: log1e5.txt

If you inspect the log, you will see that it takes about 5 seconds for 100 virtual seconds.
I also tested with 1000 virtual seconds, it took 47 seconds.

These results are obtained with Chrome in remote debugging mode: Chrome surely takes extra time to send the messages through its websocket server. I suspect that the virtual time flows faster with Chrome CLI.

There's a document referenced in the Chromium issue opened for Emulation.setVirtualTimePolicy: it describes the concept of virtual time.
I have to confess that I did not fully understand it but it mentions a fast forward mechanism.

As an intermediate conclusion, I think that we cannot easily establish a rule that transforms virtual time in real time.

@yihui
Copy link
Member

yihui commented Dec 4, 2018

Okay. That is very helpful! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants