Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting elements for screenshotting based on tag content #43

Closed
psychemedia opened this issue Mar 14, 2022 · 21 comments
Closed

Selecting elements for screenshotting based on tag content #43

psychemedia opened this issue Mar 14, 2022 · 21 comments
Labels
enhancement New feature or request

Comments

@psychemedia
Copy link

Selecting elements as the target for screenshots based on CSS selectors does not (currently) allow for the selection of elements based on tag content, or relative to DOM elements selected based on tag content.

However, elements can be selected based on tag content using Javascript. It would be useful to allow for the selection of elements via Javascript as well as CSS.

Alternatively, support a method that can be called from a javascript scraper call that will apply a screen shot to a Javascript selected element.

At the moment, selector based screenshots seem to be focused in _selector_javascript(selectors) by:

let els = %s.map(s => document.querySelector(s));

As well as passing --selector(s), s, one approach might be to pass element(s) el returned from a --js-selector script?

@simonw simonw added the enhancement New feature or request label Mar 20, 2022
@simonw
Copy link
Owner

simonw commented Mar 20, 2022

This is an interesting idea. Right now there is a workaround for this, but it's pretty messy. You can use JavaScript to assign an additional class to an element, then take a screenshot of that:

shot-scraper https://simonwillison.net -s '.screenshot-me' --javascript '
var el = Array.from(
  document.querySelectorAll("p")
).filter(el => el.innerText.includes("search"))[0];
el.classList.add("screenshot-me");
'

Running that now gets me this:

simonwillison-net

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

I'm trying to think of a syntactically elegant way to support this.

One way would be to support a fragment of JavaScript which is executed against every element on the page - so you would do this:

shot-scraper ... --js-selector 'el.innerText.includes("search")'

And it would effectively run the equivalent of my code above, except filtering against the list of elements returned by document.querySelectorAll('*')

That could actually work pretty well.

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

Would --js-selector X return just the first element matching that filter, or would it select ALL elements matching that selector? Maybe a --js-selector-all option could select all of them?

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

Alternative names for this option:

  • --js-select X and --js-select-all X
  • --js-filter X and --js-filter-all X
  • --filter X and --filter-all X

I'm not completely convinced by any of those yet.

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

As a reminder, the current options for the shot-selector shot command are:

Options:
  -a, --auth FILENAME    Path to JSON authentication context file
  -w, --width INTEGER    Width of browser window, defaults to 1280
  -h, --height INTEGER   Height of browser window and shot - defaults to the
                         full height of the page
  -o, --output FILE
  -s, --selector TEXT    Take shot of first element matching this CSS selector
  -p, --padding INTEGER  When using selectors, add this much padding in pixels
  -j, --javascript TEXT  Execute this JS prior to taking the shot
  --retina               Use device scale factor of 2
  --quality INTEGER      Save as JPEG with this quality, e.g. 80
  --wait INTEGER         Wait this many milliseconds before taking the
                         screenshot
  -i, --interactive      Interact with the page in a browser before taking the
                         shot
  --devtools             Interact mode with developer tools
  --help                 Show this message and exit.

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

I don't like --filter X meaning "first of the things that match this filter" because I would assume that --filter meant "ALL things that match this filter".

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

Oh hang on...

  -s, --selector TEXT    Take shot of first element matching this CSS selector

We already have a precedent for --selector meaning "first thing that matches" rather than "all things that match".

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

--js-selector is the most consistent with --selector, maybe that's the best option.

Or --js-filter since it's more of a filter than a selector operation. But again, that implies match all, not match first to me.

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

I think I've talked myself around to your original suggestion of --js-selector here!

@simonw
Copy link
Owner

simonw commented Mar 20, 2022

If it's returning the first match it can use Array.find() instead of Array.filter(): https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/find

The find() method returns the first element in the provided array that satisfies the provided testing function. If no values satisfy the testing function, undefined is returned.

@psychemedia
Copy link
Author

TBH, your original work-around isn't that clunky.. it's really elegant in how it riffs on what's currently available, feature wise

@rdmurphy
Copy link
Contributor

rdmurphy commented Mar 29, 2022

It'd require some reworking of a couple pieces of shot-scraper but I think what you're after @psychemedia could probably be accomplished using XPath via document.evaluate. Playwright supports XPath out of the box but as you mentioned the hard-coded querySelector is the limiting factor.

There could of course be an --xpath-selector or --xpath option, but it might be interesting to match Playwright's logic on this:

Selector starting with // or .. is assumed to be an xpath selector. For example, Playwright converts '//html/body' to 'xpath=//html/body'.

Then you could just use --selector (or add --multi-selector or something) and have it swap into XPath mode when it picks up on a XPath signal. (xpath= or // or ..)

Relevant Playwright client code:

https://github.com/microsoft/playwright/blob/1961959dcb726143586a384799f25f149b694214/packages/playwright-core/src/server/injected/xpathSelectorEngine.ts

@simonw
Copy link
Owner

simonw commented Apr 9, 2022

OK, I built a protoype. @psychemedia you can try it out by running:

pip install https://github.com/simonw/shot-scraper/archive/8a61c7e1ba90f1210868cf7f84ece78e5f8300db.zip

Here's a demo:

shot-scraper https://simonwillison.net/2022/Apr/8/weeknotes/ \
  --js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")' \
  --padding 15 --retina

Which produced this image (of the first paragraph on the page to contain the term shot-scraper):

image

Note that I had to include el.tagName == "P" in the expression - because it scans through every tag on the page, so without that it returns the html tag since that includes the shot-scraper text somewhere in its innerText!

The demo also currently outputs debug info showing the JavaScript it generated - I'd remove that before shipping the feature:

% shot-scraper https://simonwillison.net/2022/Apr/8/weeknotes/ \
      --js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")' \
      --padding 15 --retina
() => {
Array.from(
  document.getElementsByTagName('*')
).find(el => el.tagName == "P" && el.innerText.includes("shot-scraper")).classList.add("js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96");
} ['.js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96']
Screenshot of '.js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96' on 'https://simonwillison.net/2022/Apr/8/weeknotes/' written to 'simonwillison-net-2022-Apr-8-weeknotes.1.png'

@psychemedia
Copy link
Author

Ah, wonderful.. will look for something to try it out on...

@psychemedia
Copy link
Author

I notice that you need to uppercase the tagname?

@psychemedia
Copy link
Author

Ooh... nifty...

shot-scraper https://rallydatajunkie.com/sweden_2022_timing/overall-standings.html#overall-standings --js-selector 'el.tagName == "TR" && el.innerText.includes("ROV")' --padding 0 --retina

image

@psychemedia
Copy link
Author

Sort of thing you could easily weave into a twitterbot..

@simonw
Copy link
Owner

simonw commented Apr 11, 2022

The thing that's holding me off from merging this right now is that it's got me thinking about the multiple element case.

In trying this out I found myself wanting the ability to say "take a screenshot that incorporates all of the matching elements on the page" - where this mechanism currently filters for just the first match (using .find()):

Array.from(
document.getElementsByTagName('*')
).find(el => {}).classList.add("{}");

This then made me think that I actually want the same ability for just raw CSS selectors. But what should that look like?

@simonw
Copy link
Owner

simonw commented Apr 11, 2022

I implemented --selector-all. The next challenge is how the JavaScript equivalent of that should work.

--js-selector currently takes a JavaScript expression, e.g. --js-selector 'el.tagName == "TR" && el.innerText.includes("ROV")'

The obvious complement to the new --selector-all option would be --js-selector-all - does this make intuitive sense?

shot-scraper $URL --js-selector-all 'el.tagName == "TR" && el.innerText.includes("ROV")'

I was worried that it wouldn't make sense because there's no equivalent to the querySelector/querySelectorAll methods used by the CSS selector version - but having written it down I think it's OK. And the consistence between --selector-all and --js-selector-all feels right to me.

So I think the answer here is a --js-selector-all option, in addition to --js-selector.

@simonw
Copy link
Owner

simonw commented Apr 11, 2022

Manual testing (with debug output):

% shot-scraper https://simonwillison.net/ --js-selector-all 'Array.from(el.classList).includes("day")' --padding 10
() => {
Array.from(
  document.getElementsByTagName('*')
).filter(el => Array.from(el.classList).includes("day")).forEach(el => el.classList.add("js-selector-all-d43c969104a6001c9d50b56c59430864"));
} []
Screenshot of '.js-selector-all-d43c969104a6001c9d50b56c59430864' on 'https://simonwillison.net/' written to 'simonwillison-net.2.png'

simonw added a commit that referenced this issue Apr 11, 2022
Imitates --selector-all from #64
@simonw simonw closed this as completed in 518b6cc Apr 11, 2022
simonw added a commit that referenced this issue Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants