Convert webpages into feeds (via ruleset or heuristic)
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
libs
sites
.gitignore
README.mkd
mapper.py
web2feed.py

README.mkd

web2feed: turn webpages into feeds

This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).

The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.

Output is a list of dictionaries, but serializations (JSON, XML, RDF) will be supported. Additional options to be developed include advertisement/javascript removal, link/image/media isolation, etc.

Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)

The following libraries are used:

  • BeautifulSoup
  • html5lib (for beautiful soup parser fixes)
  • simplejson

(A complete client would have no non-standard library dependencies.)

Sites/Blogs

Software

None yet

(Also, heuristic scraping hasn't even been started yet.)

Output format:

Output is a list of dictionaries, with the following keys:

  • uri
  • title
  • date, a python datetime object, but may not include a time component if the website didn't list the time
  • author, name of the author
  • contents and/or summary, which probably contain minor HTML such as <p> and <img>
  • contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
  • contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text

More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.

License

My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)