Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support returning output from evaluated JavaScript, including as status code #38

Closed
simonw opened this issue Mar 13, 2022 · 17 comments
Closed
Labels
enhancement New feature or request research

Comments

@simonw
Copy link
Owner

simonw commented Mar 13, 2022

This is a bit of an out-there idea: what if you could execute custom JavaScript that returned a result, and then write that result to disk?

You could even skip the screenshot entirely and use this as a generic scraping tool at that point.

Bonus: if it can affect the exit code in some way it could be used as part of a CI flow to test something.

@simonw simonw added enhancement New feature or request research labels Mar 13, 2022
@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

This could open up some truly crazy hacks... like running the JavaScript version of tesseract OCR against images inside a Playwright browser and returning the extracted text.

Maybe I should have called this project browser-scraper instead...

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Maybe this:

shot-scraper javascript simonwillison.net \
  "return {'title': document.title}"

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Or... this could be part of the existing shot command when you output to JSON rather than PNG or JPG.

That's slightly inconsistent with the pdf command though.

But relates to the challenge of making PDF and accessibility actionable using the multi command:

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

... and if the JSON returned is compatible with sqlite-utils there could be an option to write it straight to a SQLite database file!

(I like that because it helps justify me adding this project to https://datasette.io/tools/shot-scraper)

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

If I'm turning this into a general scraping tool I'll need a mechanism by which scrapes can happen across multiple pages.

That could be as simple as documenting and encouraging people to fetch subsequent pages using fetch(), but ideally there would be an easy way to trigger a full browser navigation without terminating the script early.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

One way to handle multiple pages would be if the returned JSON could include a key that specifies more URLs that the code should execute on. Effectively providing a miniature recurring crawl mechanism.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Being able to pipe JavaScript directly to the command and have it execute in the browser and return the result would be very cool too.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Prototype:

@cli.command()
@click.argument("url")
@click.argument("javascript")
@click.option(
    "-a",
    "--auth",
    type=click.File("r"),
    help="Path to JSON authentication context file",
)
@click.option(
    "-o",
    "--output",
    type=click.File("w"),
    default="-",
)
def javascript(url, javascript, auth, output):
    """
    Execute JavaScript against the page and return the result

    Usage:

        shot-scraper javascript https://datasette.io/ "document.title"
    """
    with sync_playwright() as p:
        context, browser = _browser_context(p, auth)
        page = context.new_page()
        page.goto(url)
        result = page.evaluate(javascript)
        browser.close()
    output.write(json.dumps(result, indent=4, default=str))
    output.write("\n")
% shot-scraper javascript https://datasette.io "document.title"
"Datasette: An open source multi-tool for exploring and publishing data"

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

% shot-scraper javascript https://datasette.io "({title: document.title, h1: document.querySelector('h2').innerHTML})"
{
    "title": "Datasette: An open source multi-tool for exploring and publishing data",
    "h1": "<a href=\"/for/exploratory-analysis\">Exploratory data analysis</a>"
}

Had to wrap the { in () for this to work.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

This works too:

% shot-scraper javascript https://datasette.io "
new Promise(done => done({
    title: document.title,
    h2: document.querySelector('h2').innerHTML
}));"
{
    "title": "Datasette: An open source multi-tool for exploring and publishing data",
    "h2": "<a href=\"/for/exploratory-analysis\">Exploratory data analysis</a>"
}

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

This is fun:

% shot-scraper javascript simonwillison.net "({title: document.title, location: document.location})"
{
    "title": "Simon Willison\u2019s Weblog",
    "location": {
        "ancestorOrigins": {},
        "href": "https://simonwillison.net/",
        "origin": "https://simonwillison.net",
        "protocol": "https:",
        "host": "simonwillison.net",
        "hostname": "simonwillison.net",
        "port": "",
        "pathname": "/",
        "search": "",
        "hash": "",
        "assign": null,
        "reload": null,
        "replace": null,
        "toString": null
    }
}

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Not yet solved: how to raise an error that results in a status code failure.

A couple of options:

  • Raise an exception from JavaScript and test that the Python script errors in a pleasing fashion (maybe with a way to hide the stack trace)
  • JavaScript can return {"error": "message"} and Python turns that into a non-zero code.

I like the first option best provided I can get the aesthetics to work well.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Reminder that in zsh the way to see the last exit code is echo $?:

% echo '34' | shot-scraper multi -
Error: YAML file must contain a list
% echo $?                         
1

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Errors do indeed currently return a 1 exit code:

% shot-scraper javascript simonwillison.net 'foo(' 
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
    sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 292, in javascript
    result = page.evaluate(javascript)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 6821, in evaluate
    self._sync(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 403, in evaluate
    return await self._main_frame.evaluate(expression, arg)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 248, in evaluate
    await self._channel.send(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: SyntaxError: Unexpected end of input
    at eval (<anonymous>)
    at t.default.evaluate (<anonymous>:3:2389)
    at t.default.<anonymous> (<anonymous>:1:44)
% echo $?                                         
1

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Figured out how to make that nicer:

        try:
            result = page.evaluate(javascript)
        except Error as error:
            raise click.ClickException(error.message)

Which results in this:

% shot-scraper javascript simonwillison.net 'document.querySelector(".faeou").innerHTML'
Error: TypeError: Cannot read properties of null (reading 'innerHTML')
    at eval (eval at evaluate (:3:2389), <anonymous>:1:33)
    at eval (<anonymous>)
    at t.default.evaluate (<anonymous>:3:2389)
    at t.default.<anonymous> (<anonymous>:1:44)

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

Here's how to do custom errors. This can go in the README:

% shot-scraper javascript simonwillison.net "throw 'Fail!';"                    
Error: Fail!
% echo $?                                                                               
1

@simonw simonw closed this as completed in cac2250 Mar 13, 2022
simonw added a commit that referenced this issue Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research
Projects
None yet
Development

No branches or pull requests

1 participant