Skip to content
/ web2feed Public

Convert webpages into feeds (via ruleset or heuristic)

Notifications You must be signed in to change notification settings

spsu/web2feed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web2feed: turn webpages into feeds

This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).

The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.

Output is a list of dictionaries, but serializations (JSON, XML, RDF) will be supported. Additional options to be developed include advertisement/javascript removal, link/image/media isolation, etc.

Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)

The following libraries are used:

  • BeautifulSoup
  • html5lib (for beautiful soup parser fixes)
  • simplejson

(A complete client would have no non-standard library dependencies.)

Sites/Blogs

Software

None yet

(Also, heuristic scraping hasn't even been started yet.)

Output format:

Output is a list of dictionaries, with the following keys:

  • uri
  • title
  • date, a python datetime object, but may not include a time component if the website didn't list the time
  • author, name of the author
  • contents and/or summary, which probably contain minor HTML such as <p> and <img>
  • contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
  • contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text

More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.

License

My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)

About

Convert webpages into feeds (via ruleset or heuristic)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages