Scrape a web page using requests, and get either the html or a summary using html2text or lxml. With support for wikipedia api, pages for wikipedia will be auto-rerouted to the API for much better results. Can scrape multiple urls.
Data of type XML or JSON will be parsed and returned as Python data structures (JSON -> dict/list; XML -> xml.etree.ElementTree.Element).
- Runs locally
- Uses standard, well known libraries
- Detects JSON and XML and returns it directly
- No API keys or subscriptions.
- No other services required
- Plenty of aliases to help even dumb models find the tools
- Best of all: it actually works!
- Your model may not like "scrape" so try "fetch", "get", or others if blocked.
- Sometimes blocked by "anti-scrape" mechanism
- Try the "Fine Tuning" section for an example system instruction.
- Scrape function now allow for multiple urls to be given
- Bugfixes, better aliases
- added wikipedia helpers
- auto-redirect to wiki api
- removed beautiful soup dependencies
- made html2text optional, with a lxml fallback function
- Caching changed to built in lru_cache
- uses fake_useragent if available, falls back to a static string
- Automatically detect JSON and XML data and return it without parsing
- scrape https://www.reddit.com/r/openwebui.rss and explain the results
- fetch the page at https://webscraper.io/test-sites/e-commerce/allinone and explain the html structure
- explain the contents of https://www.cs.utexas.edu/~mitra/csFall2010/cs329/lectures/xml/xslplanes.1.xml.txt
- https://openwebui.com/robots.txt is scraping allowed?
- get https://www.web-scraping.dev/product/2 and give me a summary
- get the wikipedia page for "Beer" and explain it to me
- user_agent: The agent to pretend to be.
- retries: Number of times to attempt the scrape
- min_summary_size: the minimum size of a page before a summary is allowed.
- concurrency: max parallel fetches for multi-URL scraping.
- allow_hosts: optional host allowlist (exact hostnames).
- deny_hosts: optional host blocklist (allowlist entries take precedence).
- wiki_lang: language code for Wikipedia API (e.g., 'en', 'de').
- max_body_bytes: truncate large bodies to this many bytes.
You have the async_webscraper/scrape tool.
It allows you to retrieve either the html or an auto-generated summary.
The summary is much shorter and useful for quick overviews, the html is longer and better for deeper dives.
The Rules:
- Do not make anything up if the scrape fails.
- When calling this tools make sure to send only properly formatted complete urls.
- As we’re only making a single request per user input, the standards of robots.txt allow us to fetch from every site that doesn’t explicitly disallow it.
Feedback more than welcome. author: openwebui@zackallison.com