Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shot-scraper html command #96

Closed
simonw opened this issue Oct 15, 2022 · 6 comments
Closed

shot-scraper html command #96

simonw opened this issue Oct 15, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Oct 15, 2022

Using https://playwright.dev/python/docs/api/class-page#page-content

page.content()

Added in: v1.8

Gets the full HTML contents of the page, including the doctype.

shot-scraper html URL would output the full HTML for that page.

Where it gets fun is that this can still support various options such as --javascript for modifying the page with JavaScript before grabbing that HTML snapshot.

Originally posted by @simonw in #92 (comment)

Suggested by @honzajde

@simonw simonw added the enhancement New feature or request label Oct 15, 2022
@simonw
Copy link
Owner Author

simonw commented Oct 15, 2022

Prototype, help-driven development:

% shot-scraper html --help
Usage: shot-scraper html [OPTIONS] URL

  Save the HTML of the specified page

  Usage:

      shot-scraper html https://datasette.io/

  Use -o to specify a filename:

      shot-scraper html https://datasette.io/ -o index.html

Options:
  -a, --auth FILENAME    Path to JSON authentication context file
  -o, --output FILE
  -j, --javascript TEXT  Execute this JS prior to saving the HTML
  -s, --selector TEXT    Return outerHTML of first element matching this CSS
                         selector
  --wait INTEGER         Wait this many milliseconds before taking the
                         snapshot
  --width TEXT           Browser window width
  --height TEXT          Browser window height
  --help                 Show this message and exit.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2022

Question: should this default to saving to a file even if one is not specified (like shot-scraper shot and shot-scraper pdf) or should it default to outputting to stdout like shot-scraper javascript does?

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2022

I added --browser and tested it like this:

% shot-scraper html 'https://www.whatismybrowser.com/detect/what-is-my-user-agent/' -s '.detected_result a' -o - --browser firefox
<a href="https://developers.whatismybrowser.com/useragents/parse/?analyse-my-user-agent=yes">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0</a>
% shot-scraper html 'https://www.whatismybrowser.com/detect/what-is-my-user-agent/' -s '.detected_result a' -o -
<a href="https://developers.whatismybrowser.com/useragents/parse/?analyse-my-user-agent=yes">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4863.0 Safari/537.36</a>

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2022

I know it's inconsistent with pdf and shot but it turns out every time I use this I want it to default to outputting to the console.

So I'm only going to save to a file if -o filename.html is used.

@simonw simonw closed this as completed in 5048e21 Oct 15, 2022
@simonw
Copy link
Owner Author

simonw commented Oct 15, 2022

Documentation for this feature: https://shot-scraper.datasette.io/en/latest/html.html

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2022

I dropped --width and --height - the idea was that they could set the viewport for websites that run JavaScript that changes the HTML based on the browser width, but that feels like an obscure enough case that it's not worth having confusing extra options to cover it.

simonw added a commit that referenced this issue Oct 15, 2022
simonw added a commit that referenced this issue Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant