Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add shell function to scrape websites #10

Closed
David-Else opened this issue Jun 1, 2024 · 4 comments · Fixed by #11
Closed

Feature Request: Add shell function to scrape websites #10

David-Else opened this issue Jun 1, 2024 · 4 comments · Fixed by #11

Comments

@David-Else
Copy link
Contributor

David-Else commented Jun 1, 2024

Thanks for the amazing functions feature! It would be great to in addition to searching the web, to be able to query actual web pages or sites for data. Maybe this is possible in some way with the duckduckgo function, although I would think either BeautifulSoup or a combination of curl and pandoc to convert the HTML might be needed?

EDIT:

My first attempt fails with:

Call get_webpage '{}'
Call get_webpage '{}':
    error: the following required arguments were not provided:
      --query <QUERY>
#!/usr/bin/env bash
set -e

# @describe Takes in a URL for a webpage and returns the HTML as markdown.
# Use it to answer user questions that require access to web pages such as creating a summary.
# @meta require-tools curl pandoc
# @option --query! The URL to scrape.

main() {
  curl "$argc_query" | pandoc -f html-native_divs-native_spans -t gfm-raw_html | sed -E 's/!\[.*?\]\((data:image\/svg\+xml[^)]+)\)//g'
}

eval "$(argc --argc-eval "$0" "$@")"
@sigoden
Copy link
Owner

sigoden commented Jun 1, 2024

It seems the LLM doesn't understand the function parameters. Did you generate the functions.json file according to the readme instructions?

@David-Else
Copy link
Contributor Author

Thanks, I forgot to build with the last update, after removing # @meta require-tools curl pandoc it works!

I am unsure if what I made is correct and worth making a pull request out of? If you could suggest any improvements it would be great :) Why does # @meta require-tools curl pandoc not work when I have them installed?

@sigoden
Copy link
Owner

sigoden commented Jun 1, 2024

The @meta require-tools has syntax error.

-- # @meta require-tools curl pandoc 
++ # @meta require-tools curl,pandoc 

See https://github.com/sigoden/argc/blob/main/docs/specification.md#meta

You should submit a PR.

Two Tips:

  • change --query to --url
  • curl add '-fsSL' options

@David-Else
Copy link
Contributor Author

#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants