Convert webpages into feeds (via ruleset or heuristic)
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


web2feed: turn webpages into feeds

This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).

The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.

Output is a list of dictionaries, but serializations (JSON, XML, RDF) will be supported. Additional options to be developed include advertisement/javascript removal, link/image/media isolation, etc.

Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)

The following libraries are used:

  • BeautifulSoup
  • html5lib (for beautiful soup parser fixes)
  • simplejson

(A complete client would have no non-standard library dependencies.)



None yet

(Also, heuristic scraping hasn't even been started yet.)

Output format:

Output is a list of dictionaries, with the following keys:

  • uri
  • title
  • date, a python datetime object, but may not include a time component if the website didn't list the time
  • author, name of the author
  • contents and/or summary, which probably contain minor HTML such as <p> and <img>
  • contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
  • contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text

More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.


My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)