Skip to content

Wait in click() #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Conversation

jonthegeek
Copy link

Closes #405.

I tried to make this automatic, and got it working some of the time, but I couldn't find a way to detect that something might change, and thus wait in that situation. I tried to make it as clear as possible for users to be able to fix this. Once a strategy is agreed on, the same strategy should probably be applied to other methods.

An approach I worked on but failed to finalize would be to either wait for a loadEventFired() event or time out (probably after a relatively brief time, and only if no events occur in that window, or something along those lines). I couldn't find a clean way to check if anything is in progress and thus wait if so; if we're waiting for a loadEventFired that never fires, we'll introduced a new error for many click events.

I also tried adding private$refresh_root() as a callback on loadEventFired(), which might be a useful approach but it still doesn't solve the issue of users not waiting for that to occur before executing their next step.

I like this better than status quo (since it makes it possible to deal with these situations), but I'm not sure that it's quite there yet.

You work at posit now, not rstudio.
Closes tidyverse#405.

I tried to make this automatic, and got it working some of the time, but I couldn't find a way to detect that something might change, and thus wait in that situation. I tried to make it as clear as possible for users to be able to fix this. Once a strategy is agreed on, the same strategy should probably be applied to other methods.
R/live.R Outdated
if (grepl("-32000", cnd_message(cnd))) {
cli::cli_abort(
c(
"Can't find root node.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather make refresh_root() public and then suggest that the user call it. But I'd also want to be 100% sure we shouldn't just call refresh_root() on their behalf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is it might not be ready yet. I feel like there has to be a way to essentially wait for the loadEventFired only if the page is navigating or reloading, but I couldn't get there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should always be waiting on the loadEventFired so we can update the root node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did put refresh_root() into loadEventFired as a callback at one point, and that's probably worth re-implementing. The problem occurs when the user tries to do something that results in a find_nodes() after a click, but before everything has processed. If the click results in a loadEventFired() and we don't wait before find_nodes(), the root will be wrong (which triggered a thought, see below). But if we do wait when they click somewhere that doesn't navigate/reload, the loadEventFired() never occurs and times out, so we make them wait for nothing (and we throw a confusing warning or error).

But... something knows that the page is in an invalid state, because there's an error when you try to find_nodes() too soon. So in theory at THAT point waiting on loadEventFired() should work. I feel like that's getting into the issues that are discussed in https://rstudio.github.io/chromote/#loading-a-page-reliably though, so that might not solve it.

Perhaps create the promise right away on click (and anything else that might change the page), but only wait for it to resolve when there's an indication that it hasn't resolved? Ie, inside click we'd do something like this:

private$load_promise <- self$session$Page$loadEventFired(wait_ = FALSE)

And then in find_nodes() we could wait for that promise to resolve and try again (inside this try_fetch(), probably). Async programming still always feels weird to me, so I can't decide if this is hacky, correct, or somewhere in-between.

We might also have to capture the timeout message if the promise never resolves, to avoid random confusing warnings showing up for the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! I think that approach sounds like a way forward, and then I can double check it with Joe/Winston.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's closer now, I think, but it fails ungracefully for "normal" clicks, and I haven't yet found a way to clean this up. To see what I mean, run the "can click a button" test in test-live.R a few times. Or, more specifically:

sess <- read_html_live(html_test_path("click"))
sess$click("button")
Sys.sleep(10)
rm(sess)  

I can see where that happens in {chromote}, and it makes sense that it does so, but the only way I can think of to avoid the error message would be to set the timeout to Inf, and that just means we could stack up promises that never resolve. Or maybe line 149 of live.R should do something fancier when it sets up private$loadEvent_promise, to capture (and silently discard) the promise if it times out.

@jonthegeek
Copy link
Author

Another summer trip = another look at scraping fancier websites! I think I understand promises much better now than I did a year ago, so I'm seeing if I can do this a bit more "right". The thing that led me back was a page that never fires its load event, so there might end up being a second wave of updates after the first wave (mostly to export more to make it more possible for users to deal with advanced topics).

If you already have a separate effort to make LiveHTML behave better, let me know, and I'll hold off.

Update a snapshot that I shouldn't have changed. I'm assuming this is from an underlying change on the page, but we can revert this if I broke GHA by accepting it.
I used httr2 as an example of current best practice.
@jonthegeek
Copy link
Author

This is getting closer to a "best of both worlds" solution. Users can set up their own promise to wait for, or use the loadEventFired when that's relevant. It should work by default in most cases, but is fixable (and gives tips) if the page is slow or anything like that.

I can definitely see ways to make this more robust (eg, waiting for an actual timeout in the try_fetch() calls and implementing an async option throughout), but this is much more solid than the status quo.

I implemented the same fix in all the "event" methods, so users can set up promises to wait for if the thing they're doing results in a navigation. If a user hits the error message that led to this PR (which should be rarer now to begin with), it now tells them how to fix the issue.

Once this is sorted out, I could write a vignette showing some of the trickier options. I have a TidyTuesday in mind that will build off of this (and use a custom promise). I'll see if I can get it working with this and sort out a vignette if so.

@jonthegeek
Copy link
Author

I think I just figured out a simpler API (no need for wait_for args, but options for more explicit async for advanced users) that will also be more robust. I'm on the road, but will update the PR when I stop.

@jonthegeek jonthegeek marked this pull request as draft July 6, 2025 16:08
@jonthegeek
Copy link
Author

@hadley is like to add a wait argument to this method (and the other "action" methods) as part of this update, to allow advanced users to return a promise rather than always waiting for a result. If I do that, is wait OK (rather than the underlying wait_ from chromote)? The underscore makes it scary and weird 🙃

My current plan is to add a done arg (in place of the wait_for arg in the most recent commit), to set up a promise that signifies that this action has resolved. That would default to "auto", which will briefly wait for a documentUpdated event and assume navigation on the current page if none occurs. That should work in most cases, but I plan to also accept "load" to explicitly wait for loadEventFired, and then use an updated wait_for_selector for any other strings. We could also accept numbers for a simple delay. We'd also accept a promise object in case the user wants to get more fancy.

That would fix #405 + work for my next case (a page with a sort of "infinite scroll" that causes it to never fire the load event) + I think work for any other arbitrarily weir page... And those are precisely the sorts of pages that require this LiveHTML stuff in the first place.

I have most of the code theoretically worked out in conversations with Google Gemini during a road trip, I just need to sit at my computer and stitch snatches of ideas together and clean out the hallucinations I've presumably missed, etc. I'll probably get some of it done tonight but I'm not sure whether I'll finish).

Let me know what you think!

@hadley
Copy link
Member

hadley commented Jul 7, 2025

Given that this would change the return type, I think I'd prefer (e.g.) $click_async() like ellmer does.

@hadley
Copy link
Member

hadley commented Jul 7, 2025

And I've asked @shikokuchuo to take a look general; he is way better at async than me 😄.

@jonthegeek
Copy link
Author

Don't review the code as it is now in any case. I'm a lot better at async than I was 2 days ago, at least it sure seems to make more sense now!

Alternate methods for async makes sense to me! I'll take a closer look at ellmer before I implement these changes, too!

@jonthegeek
Copy link
Author

OK, I don't think my implementation will be quite like ellmer under the hood, but I can definitely split the methods that way. I'll almost definitely call the async versions from the normal versions, then self$session$wait_for() the resulting p (then return invisible(self) in most cases, I think; whatever it's doing now). That will make sure they don't drift apart in functionality. The biggest change will be around smarter ways to figure out when the action has completed, which I'm PRETTY sure I have all ready to go!

@shikokuchuo
Copy link
Member

I agree, splitting the methods for the different return types is the way to go.

I much prefer being explicit (even if it requires the user to know a bit more) rather than implement heuristics, as timing in this context is just so uncontrollable. Something that generally 'just works' and then breaks on the 'n'th try is to be avoided in my view. But sounds good that you have something in mind for getting the event completion.

Don't review the code as it is now in any case.

Just ping me once you're ready (no time pressure!)

@jonthegeek
Copy link
Author

I much prefer being explicit (even if it requires the user to know a bit more) rather than implement heuristics, as timing in this context is just so uncontrollable. Something that generally 'just works' and then breaks on the 'n'th try is to be avoided in my view. But sounds good that you have something in mind for getting the event completion.

I generally agree, but I'm trying to make it less likely to throw an error, and to make it throw something informative when it has to throw an error. The status quo throws an ugly, confusing, unfixable error any time an action causes navigation to a new page, and my initial fix throws either a warning or an error (I'd think an error but it doesn't halt anything so I can't remember for sure) long after the event (when the chromote session ends), about a timeout that didn't matter.

If the user is working interactively, these errors and warnings often don't matter anyway (the user can view() and continue when they see the page load, and just waiting a second to type the next bit of code will likely give the browser time to get things ready even without a view()), and I want the warnings and errors to be informative in cases where they do matter.

So the thing I plan to submit later does include a bit of a heuristic that I expect to "just work" in almost all cases (I'll do some testing to validate that, but obviously I can't test every scenario / hardware config / internet connection). To me it feels like rvest's LiveHTML is aiming to be the "usually just works" solution for people who haven't dug deep into chromote and aren't comfortable with promises. But then when it fails it will explain why, and give the user options to fix their code without having to learn all of the intricacies of chromote and chrome dev tools in most cases. I poured over that documentation and poked and prodded at things, to figure out what users probably want. And then, of course, if their situation is stranger, we'll give them the ability to implement a custom solution while keeping rvest's friendly wrapper.

That is, assuming it actually works how I think it will! I still haven't had a chance yet to sit down and actually type and test the latest iteration of what I think should work, but I'll get back to Austin sometime today and take some time to work things out!

… way things work (other than the fact that I added some pipeable wrappers for the methods), but it made it much easier for me to reason about. The next update will pretty much completely change the way waiting works, but it should be much easier to see it at that point!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LiveHTML object corrupted after $click()
3 participants