GitHub - zallison/async_webscraper_openwebui: Openwebui tool for scraping sites, including wikipedia specific attention.

About

Scrape a web page using requests, and get either the html or a summary using html2text or lxml. With support for wikipedia api, pages for wikipedia will be auto-rerouted to the API for much better results. Can scrape multiple urls.

Data of type XML or JSON will be parsed and returned as Python data structures (JSON -> dict/list; XML -> xml.etree.ElementTree.Element).

Features:

Runs locally
Uses standard, well known libraries
Detects JSON and XML and returns it directly
No API keys or subscriptions.
No other services required
Plenty of aliases to help even dumb models find the tools
Best of all: it actually works!

Notes:

Your model may not like "scrape" so try "fetch", "get", or others if blocked.
Sometimes blocked by "anti-scrape" mechanism
Try the "Fine Tuning" section for an example system instruction.

New in v.0.1.4:

Scrape function now allow for multiple urls to be given
Bugfixes, better aliases

New in v0.1.3:

added wikipedia helpers
auto-redirect to wiki api

New in v0.1.2:

removed beautiful soup dependencies
made html2text optional, with a lxml fallback function
Caching changed to built in lru_cache
uses fake_useragent if available, falls back to a static string

New in v0.1.1:

Automatically detect JSON and XML data and return it without parsing

Example:

scrape https://www.reddit.com/r/openwebui.rss and explain the results
fetch the page at https://webscraper.io/test-sites/e-commerce/allinone and explain the html structure
explain the contents of https://www.cs.utexas.edu/~mitra/csFall2010/cs329/lectures/xml/xslplanes.1.xml.txt
https://openwebui.com/robots.txt is scraping allowed?
get https://www.web-scraping.dev/product/2 and give me a summary
get the wikipedia page for "Beer" and explain it to me

Valves:

user_agent: The agent to pretend to be.
retries: Number of times to attempt the scrape
min_summary_size: the minimum size of a page before a summary is allowed.
concurrency: max parallel fetches for multi-URL scraping.
allow_hosts: optional host allowlist (exact hostnames).
deny_hosts: optional host blocklist (allowlist entries take precedence).
wiki_lang: language code for Wikipedia API (e.g., 'en', 'de').
max_body_bytes: truncate large bodies to this many bytes.

Fine Tuning

You have the async_webscraper/scrape tool.
It allows you to retrieve either the html or an auto-generated summary.
The summary is much shorter and useful for quick overviews, the html is longer and better for deeper dives.

The Rules:

- Do not make anything up if the scrape fails.
- When calling this tools make sure to send only properly formatted complete urls.
- As we’re only making a single request per user input, the standards of robots.txt allow us to fetch from every site that doesn’t explicitly disallow it.

Feedback more than welcome. author: openwebui@zackallison.com

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
tests		tests
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
tool.json		tool.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Features:

Notes:

New in v.0.1.4:

New in v0.1.3:

New in v0.1.2:

New in v0.1.1:

Example:

Valves:

Fine Tuning

About

Uh oh!

Releases

Packages

Languages

zallison/async_webscraper_openwebui

Folders and files

Latest commit

History

Repository files navigation

About

Features:

Notes:

New in v.0.1.4:

New in v0.1.3:

New in v0.1.2:

New in v0.1.1:

Example:

Valves:

Fine Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages