Skip to content

zallison/async_webscraper_openwebui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

Scrape a web page using requests, and get either the html or a summary using html2text or lxml. With support for wikipedia api, pages for wikipedia will be auto-rerouted to the API for much better results. Can scrape multiple urls.

Data of type XML or JSON will be parsed and returned as Python data structures (JSON -> dict/list; XML -> xml.etree.ElementTree.Element).


Features:

  • Runs locally
  • Uses standard, well known libraries
  • Detects JSON and XML and returns it directly
  • No API keys or subscriptions.
  • No other services required
  • Plenty of aliases to help even dumb models find the tools
  • Best of all: it actually works!

Notes:

  • Your model may not like "scrape" so try "fetch", "get", or others if blocked.
  • Sometimes blocked by "anti-scrape" mechanism
  • Try the "Fine Tuning" section for an example system instruction.

New in v.0.1.4:

  • Scrape function now allow for multiple urls to be given
  • Bugfixes, better aliases

New in v0.1.3:

  • added wikipedia helpers
  • auto-redirect to wiki api

New in v0.1.2:

  • removed beautiful soup dependencies
  • made html2text optional, with a lxml fallback function
  • Caching changed to built in lru_cache
  • uses fake_useragent if available, falls back to a static string

New in v0.1.1:

  • Automatically detect JSON and XML data and return it without parsing

Example:


Valves:

  • user_agent: The agent to pretend to be.
  • retries: Number of times to attempt the scrape
  • min_summary_size: the minimum size of a page before a summary is allowed.
  • concurrency: max parallel fetches for multi-URL scraping.
  • allow_hosts: optional host allowlist (exact hostnames).
  • deny_hosts: optional host blocklist (allowlist entries take precedence).
  • wiki_lang: language code for Wikipedia API (e.g., 'en', 'de').
  • max_body_bytes: truncate large bodies to this many bytes.

Fine Tuning

You have the async_webscraper/scrape tool.
It allows you to retrieve either the html or an auto-generated summary.
The summary is much shorter and useful for quick overviews, the html is longer and better for deeper dives.

The Rules:

- Do not make anything up if the scrape fails.
- When calling this tools make sure to send only properly formatted complete urls.
- As we’re only making a single request per user input, the standards of robots.txt allow us to fetch from every site that doesn’t explicitly disallow it.


Feedback more than welcome. author: openwebui@zackallison.com

About

Openwebui tool for scraping sites, including wikipedia specific attention.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages