shot-scraper html command #96

simonw · 2022-10-15T18:41:06Z

Using https://playwright.dev/python/docs/api/class-page#page-content

page.content()

Added in: v1.8

returns: <str>#

Gets the full HTML contents of the page, including the doctype.

shot-scraper html URL would output the full HTML for that page.

Where it gets fun is that this can still support various options such as --javascript for modifying the page with JavaScript before grabbing that HTML snapshot.

Originally posted by @simonw in #92 (comment)

Suggested by @honzajde

The text was updated successfully, but these errors were encountered:

simonw · 2022-10-15T18:58:37Z

Prototype, help-driven development:

% shot-scraper html --help
Usage: shot-scraper html [OPTIONS] URL

  Save the HTML of the specified page

  Usage:

      shot-scraper html https://datasette.io/

  Use -o to specify a filename:

      shot-scraper html https://datasette.io/ -o index.html

Options:
  -a, --auth FILENAME    Path to JSON authentication context file
  -o, --output FILE
  -j, --javascript TEXT  Execute this JS prior to saving the HTML
  -s, --selector TEXT    Return outerHTML of first element matching this CSS
                         selector
  --wait INTEGER         Wait this many milliseconds before taking the
                         snapshot
  --width TEXT           Browser window width
  --height TEXT          Browser window height
  --help                 Show this message and exit.

simonw · 2022-10-15T18:59:30Z

Question: should this default to saving to a file even if one is not specified (like shot-scraper shot and shot-scraper pdf) or should it default to outputting to stdout like shot-scraper javascript does?

simonw · 2022-10-15T19:08:40Z

I added --browser and tested it like this:

% shot-scraper html 'https://www.whatismybrowser.com/detect/what-is-my-user-agent/' -s '.detected_result a' -o - --browser firefox
<a href="https://developers.whatismybrowser.com/useragents/parse/?analyse-my-user-agent=yes">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0</a>
% shot-scraper html 'https://www.whatismybrowser.com/detect/what-is-my-user-agent/' -s '.detected_result a' -o -
<a href="https://developers.whatismybrowser.com/useragents/parse/?analyse-my-user-agent=yes">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4863.0 Safari/537.36</a>

simonw · 2022-10-15T19:10:04Z

I know it's inconsistent with pdf and shot but it turns out every time I use this I want it to default to outputting to the console.

So I'm only going to save to a file if -o filename.html is used.

simonw · 2022-10-15T19:28:11Z

Documentation for this feature: https://shot-scraper.datasette.io/en/latest/html.html

simonw · 2022-10-15T19:30:03Z

I dropped --width and --height - the idea was that they could set the viewport for websites that run JavaScript that changes the HTML based on the browser width, but that feels like an obscure enough case that it's not worth having confusing extra options to cover it.

Refs #95, #96

simonw added the enhancement New feature or request label Oct 15, 2022

simonw closed this as completed in 5048e21 Oct 15, 2022

simonw added a commit that referenced this issue Oct 15, 2022

Better documentation for shot-scraper html, refs #96

dbb9acd

simonw added a commit that referenced this issue Oct 15, 2022

Release 1.0

ab6c4d2

Refs #95, #96

simonw added a commit that referenced this issue Oct 23, 2022

Test for shot-scraper html -s, refs #96

5ba2383

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shot-scraper html command #96

shot-scraper html command #96

simonw commented Oct 15, 2022 •

edited

Loading

page.content()

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022 •

edited

Loading

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

shot-scraper html command #96

shot-scraper html command #96

Comments

simonw commented Oct 15, 2022 • edited Loading

page.content()

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022 • edited Loading

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022 •

edited

Loading

simonw commented Oct 15, 2022 •

edited

Loading