A website crawler / validator, designed to be integrated into an automated website integrity test. Inspired by the Python "Webcheck" project (http://arthurdejong.org/webcheck/)
Pull request Compare This branch is even with fixermark:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
test
.gitignore
LICENSE
README.rdoc
TODO
consistencychecker.rb
linkfinder.rb
webcheck.rb
webcrawler.rb

README.rdoc

Webcheck-ruby

Webcheck-ruby is a simple web site crawler / checker, inspired by the Python program Webcheck, currently maintained by Arthur de Jong

Features

  • fully automated

  • identifies broken links, missing images, missing linked resources, and missing scripts

  • modular design allows for substitution of HTML link extractor, crawler, and page-validation rules

Usage

To crawl a web page:

require 'webcheck'
checker = Webcheck.new
results = checker.check("http://example.com")
print results[:uris404]
print results[:urisUnknown]
checker.prettyPrint(results)

The default format of the results is a hash with the following keys

uris404

Links that returned a 404 error

uris200

Links that were followed successfully

urisUnknown

Links that returned an unknown HTTP code. These are in the form of hashes with the keys :code,:uri

checked

A hash of all the URIs checked. The keys are the URIs

urisNonHTTP

Links on the page that led to resources not accessed via the HTTP protocol (such as e-mail and FTP)

Components

Webcrawler

A recursive crawling tool. At each step, it retrieves a resource from a URL and feeds it to a block. The block yields links to retrieve on subsequent steps.

ConsistencyChecker

The validation engine. Takes the HTML bodies retrieved by the crawler and determines how to handle each one.

Linkfinder

A link extraction tool, which takes HTML bodies and extracts URLs from them.

Dependencies

The Linkfinder uses Nokogiri to extract links from web pages.

Author

Copyright © 2010 by Mark T. Tomczak, released under the MIT license.