New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support returning output from evaluated JavaScript, including as status code #38
Comments
This could open up some truly crazy hacks... like running the JavaScript version of tesseract OCR against images inside a Playwright browser and returning the extracted text. Maybe I should have called this project |
Maybe this:
|
Or... this could be part of the existing That's slightly inconsistent with the But relates to the challenge of making PDF and accessibility actionable using the |
... and if the JSON returned is compatible with (I like that because it helps justify me adding this project to https://datasette.io/tools/shot-scraper) |
If I'm turning this into a general scraping tool I'll need a mechanism by which scrapes can happen across multiple pages. That could be as simple as documenting and encouraging people to fetch subsequent pages using |
One way to handle multiple pages would be if the returned JSON could include a key that specifies more URLs that the code should execute on. Effectively providing a miniature recurring crawl mechanism. |
Being able to pipe JavaScript directly to the command and have it execute in the browser and return the result would be very cool too. |
Prototype: @cli.command()
@click.argument("url")
@click.argument("javascript")
@click.option(
"-a",
"--auth",
type=click.File("r"),
help="Path to JSON authentication context file",
)
@click.option(
"-o",
"--output",
type=click.File("w"),
default="-",
)
def javascript(url, javascript, auth, output):
"""
Execute JavaScript against the page and return the result
Usage:
shot-scraper javascript https://datasette.io/ "document.title"
"""
with sync_playwright() as p:
context, browser = _browser_context(p, auth)
page = context.new_page()
page.goto(url)
result = page.evaluate(javascript)
browser.close()
output.write(json.dumps(result, indent=4, default=str))
output.write("\n")
|
Had to wrap the |
This works too:
|
This is fun:
|
Not yet solved: how to raise an error that results in a status code failure. A couple of options:
I like the first option best provided I can get the aesthetics to work well. |
Reminder that in
|
Errors do indeed currently return a 1 exit code:
|
Figured out how to make that nicer: try:
result = page.evaluate(javascript)
except Error as error:
raise click.ClickException(error.message) Which results in this:
|
Here's how to do custom errors. This can go in the README:
|
This is a bit of an out-there idea: what if you could execute custom JavaScript that returned a result, and then write that result to disk?
You could even skip the screenshot entirely and use this as a generic scraping tool at that point.
Bonus: if it can affect the exit code in some way it could be used as part of a CI flow to test something.
The text was updated successfully, but these errors were encountered: