web2feed: turn webpages into feeds
This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).
The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.
Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)
The following libraries are used:
- html5lib (for beautiful soup parser fixes)
(A complete client would have no non-standard library dependencies.)
- Less Wrong
- Biology News
- Data Portability Blog
- Marc Canter's blog
- Robert Scoble's blog
- Aaron Swartz' blog
(Also, heuristic scraping hasn't even been started yet.)
Output is a list of dictionaries, with the following keys:
- date, a python datetime object, but may not include a time component if the website didn't list the time
- author, name of the author
- contents and/or summary, which probably contain minor HTML such as <p> and <img>
- contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
- contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text
More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.
My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)