Skip to content

Latest commit

 

History

History
68 lines (51 loc) · 2.61 KB

README.mkd

File metadata and controls

68 lines (51 loc) · 2.61 KB

web2feed: turn webpages into feeds

This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).

The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.

Output is a list of dictionaries, but serializations (JSON, XML, RDF) will be supported. Additional options to be developed include advertisement/javascript removal, link/image/media isolation, etc.

Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)

The following libraries are used:

  • BeautifulSoup
  • html5lib (for beautiful soup parser fixes)
  • simplejson

(A complete client would have no non-standard library dependencies.)

Sites/Blogs

Software

None yet

(Also, heuristic scraping hasn't even been started yet.)

Output format:

Output is a list of dictionaries, with the following keys:

  • uri
  • title
  • date, a python datetime object, but may not include a time component if the website didn't list the time
  • author, name of the author
  • contents and/or summary, which probably contain minor HTML such as <p> and <img>
  • contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
  • contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text

More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.

License

My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)