This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).
The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.
Output is a list of dictionaries, but serializations (JSON, XML, RDF) will be supported. Additional options to be developed include advertisement/javascript removal, link/image/media isolation, etc.
Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)
The following libraries are used:
- BeautifulSoup
- html5lib (for beautiful soup parser fixes)
- simplejson
(A complete client would have no non-standard library dependencies.)
- Slashdot
- Techcrunch
- Techdirt
- Less Wrong
- Biology News
- Data Portability Blog
- Marc Canter's blog
- Robert Scoble's blog
- Aaron Swartz' blog
None yet
(Also, heuristic scraping hasn't even been started yet.)
Output is a list of dictionaries, with the following keys:
- uri
- title
- date, a python datetime object, but may not include a time component if the website didn't list the time
- author, name of the author
- contents and/or summary, which probably contain minor HTML such as <p> and <img>
- contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
- contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text
More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.
My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)