Web Page Parser¶ ↑

Web Page Parser is a Ruby library to parse the content out of web pages, such as BBC News pages. It strips all non-textual stuff out, leaving the title, publication date and an array of paragraphs. It makes heavy use of regular expressions, rather than actually parsing the HTML. This may sound a bit whacky, but BBC News html in particular has semantic markup *within comments*, which cannot easily be referenced with standard HTML parsing. Regular expressions are much faster than full HTML parsing too.

Web Page Parser currently supports BBC News pages and Guardian news articles but new parsers are planned and can be added easily.

It is used by the News Sniffer project, which parses and archives news articles to keep track of how they change.

Example usage¶ ↑

require 'web-page-parser'
require 'open-uri'

url = "http://news.bbc.co.uk/1/hi/uk/8041972.stm"
page_data = open(url).read

page = WebPageParser::ParserFactory.parser_for(:url => url, :page => page_data)

puts page.title # MPs hit back over expenses claims
puts page.date # 2009-05-09T18:58:59+00:00
puts page.content.first # The wife of author Ken Follett and ...

Ruby 1.8 support¶ ↑

Installing the Oniguruma gem on Ruby 1.8 will make Web Page Parser run faster, it’s highly recommended but not required.

More Info¶ ↑

Web Page Parser was written by John Leach and is released under the MIT License.

The code is available on github.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
lib		lib
spec		spec
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.rdoc		README.rdoc
Rakefile		Rakefile
web-page-parser.gemspec		web-page-parser.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Page Parser¶ ↑

Example usage¶ ↑

Ruby 1.8 support¶ ↑

More Info¶ ↑

About

Uh oh!

Releases

Packages

License

simondodson/web-page-parser

Folders and files

Latest commit

History

Repository files navigation

Web Page Parser¶ ↑

Example usage¶ ↑

Ruby 1.8 support¶ ↑

More Info¶ ↑

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages