Support returning output from evaluated JavaScript, including as status code #38

simonw · 2022-03-13T17:57:15Z

This is a bit of an out-there idea: what if you could execute custom JavaScript that returned a result, and then write that result to disk?

You could even skip the screenshot entirely and use this as a generic scraping tool at that point.

Bonus: if it can affect the exit code in some way it could be used as part of a CI flow to test something.

simonw · 2022-03-13T18:12:51Z

This could open up some truly crazy hacks... like running the JavaScript version of tesseract OCR against images inside a Playwright browser and returning the extracted text.

Maybe I should have called this project browser-scraper instead...

simonw · 2022-03-13T18:17:39Z

Maybe this:

shot-scraper javascript simonwillison.net \
  "return {'title': document.title}"

simonw · 2022-03-13T18:23:24Z

Or... this could be part of the existing shot command when you output to JSON rather than PNG or JPG.

That's slightly inconsistent with the pdf command though.

But relates to the challenge of making PDF and accessibility actionable using the multi command:

YAML configuration for PDF shots #27

simonw · 2022-03-13T18:26:12Z

... and if the JSON returned is compatible with sqlite-utils there could be an option to write it straight to a SQLite database file!

(I like that because it helps justify me adding this project to https://datasette.io/tools/shot-scraper)

simonw · 2022-03-13T18:31:11Z

If I'm turning this into a general scraping tool I'll need a mechanism by which scrapes can happen across multiple pages.

That could be as simple as documenting and encouraging people to fetch subsequent pages using fetch(), but ideally there would be an easy way to trigger a full browser navigation without terminating the script early.

simonw · 2022-03-13T18:38:41Z

One way to handle multiple pages would be if the returned JSON could include a key that specifies more URLs that the code should execute on. Effectively providing a miniature recurring crawl mechanism.

simonw · 2022-03-13T18:54:04Z

Being able to pipe JavaScript directly to the command and have it execute in the browser and return the result would be very cool too.

simonw · 2022-03-13T19:22:35Z

Prototype:

@cli.command()
@click.argument("url")
@click.argument("javascript")
@click.option(
    "-a",
    "--auth",
    type=click.File("r"),
    help="Path to JSON authentication context file",
)
@click.option(
    "-o",
    "--output",
    type=click.File("w"),
    default="-",
)
def javascript(url, javascript, auth, output):
    """
    Execute JavaScript against the page and return the result

    Usage:

        shot-scraper javascript https://datasette.io/ "document.title"
    """
    with sync_playwright() as p:
        context, browser = _browser_context(p, auth)
        page = context.new_page()
        page.goto(url)
        result = page.evaluate(javascript)
        browser.close()
    output.write(json.dumps(result, indent=4, default=str))
    output.write("\n")

% shot-scraper javascript https://datasette.io "document.title"
"Datasette: An open source multi-tool for exploring and publishing data"

simonw · 2022-03-13T19:24:05Z

% shot-scraper javascript https://datasette.io "({title: document.title, h1: document.querySelector('h2').innerHTML})"
{
    "title": "Datasette: An open source multi-tool for exploring and publishing data",
    "h1": "<a href=\"/for/exploratory-analysis\">Exploratory data analysis</a>"
}

Had to wrap the { in () for this to work.

simonw · 2022-03-13T19:26:30Z

This works too:

% shot-scraper javascript https://datasette.io "
new Promise(done => done({
    title: document.title,
    h2: document.querySelector('h2').innerHTML
}));"
{
    "title": "Datasette: An open source multi-tool for exploring and publishing data",
    "h2": "<a href=\"/for/exploratory-analysis\">Exploratory data analysis</a>"
}

simonw · 2022-03-13T19:39:04Z

This is fun:

% shot-scraper javascript simonwillison.net "({title: document.title, location: document.location})"
{
    "title": "Simon Willison\u2019s Weblog",
    "location": {
        "ancestorOrigins": {},
        "href": "https://simonwillison.net/",
        "origin": "https://simonwillison.net",
        "protocol": "https:",
        "host": "simonwillison.net",
        "hostname": "simonwillison.net",
        "port": "",
        "pathname": "/",
        "search": "",
        "hash": "",
        "assign": null,
        "reload": null,
        "replace": null,
        "toString": null
    }
}

simonw · 2022-03-13T19:58:48Z

Documentation: https://github.com/simonw/shot-scraper/blob/2cd70b021e28ffe8c5287ac8e62c221dadbc7746/README.md#scraping-pages-using-javascript

simonw · 2022-03-13T21:45:57Z

Not yet solved: how to raise an error that results in a status code failure.

A couple of options:

Raise an exception from JavaScript and test that the Python script errors in a pleasing fashion (maybe with a way to hide the stack trace)
JavaScript can return {"error": "message"} and Python turns that into a non-zero code.

I like the first option best provided I can get the aesthetics to work well.

simonw · 2022-03-13T23:01:54Z

Reminder that in zsh the way to see the last exit code is echo $?:

% echo '34' | shot-scraper multi -
Error: YAML file must contain a list
% echo $?                         
1

simonw · 2022-03-13T23:03:00Z

Errors do indeed currently return a 1 exit code:

% shot-scraper javascript simonwillison.net 'foo(' 
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
    sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 292, in javascript
    result = page.evaluate(javascript)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 6821, in evaluate
    self._sync(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 403, in evaluate
    return await self._main_frame.evaluate(expression, arg)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 248, in evaluate
    await self._channel.send(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: SyntaxError: Unexpected end of input
    at eval (<anonymous>)
    at t.default.evaluate (<anonymous>:3:2389)
    at t.default.<anonymous> (<anonymous>:1:44)
% echo $?                                         
1

simonw · 2022-03-13T23:09:13Z

Figured out how to make that nicer:

        try:
            result = page.evaluate(javascript)
        except Error as error:
            raise click.ClickException(error.message)

Which results in this:

% shot-scraper javascript simonwillison.net 'document.querySelector(".faeou").innerHTML'
Error: TypeError: Cannot read properties of null (reading 'innerHTML')
    at eval (eval at evaluate (:3:2389), <anonymous>:1:33)
    at eval (<anonymous>)
    at t.default.evaluate (<anonymous>:3:2389)
    at t.default.<anonymous> (<anonymous>:1:44)

simonw · 2022-03-13T23:12:23Z

Here's how to do custom errors. This can go in the README:

% shot-scraper javascript simonwillison.net "throw 'Fail!';"                    
Error: Fail!
% echo $?                                                                               
1

Refs #37, #38, #39, #40, #41

simonw added enhancement New feature or request research labels Mar 13, 2022

simonw added a commit that referenced this issue Mar 13, 2022

shot-scraper javascript command, refs #38

2cd70b0

simonw mentioned this issue Mar 13, 2022

Mechanism to accept JavaScript from a file or standard input #39

Closed

simonw added a commit that referenced this issue Mar 13, 2022

Turn JavaScript errors into nicer Click errors, refs #38

f732db6

simonw closed this as completed in cac2250 Mar 13, 2022

simonw added a commit that referenced this issue Mar 13, 2022

Release 0.9

9c73ed6

Refs #37, #38, #39, #40, #41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support returning output from evaluated JavaScript, including as status code #38

Support returning output from evaluated JavaScript, including as status code #38

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022 •

edited

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022 •

edited

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

Support returning output from evaluated JavaScript, including as status code #38

Support returning output from evaluated JavaScript, including as status code #38

Comments

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022 • edited

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022 • edited

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022

simonw commented Mar 13, 2022 •

edited

simonw commented Mar 13, 2022 •

edited