Webcheck-ruby is a simple web site crawler / checker, inspired by the Python program Webcheck, currently maintained by Arthur de Jong
identifies broken links, missing images, missing linked resources, and missing scripts
modular design allows for substitution of HTML link extractor, crawler, and page-validation rules
To crawl a web page:
require 'webcheck' checker = Webcheck.new results = checker.check("http://example.com") print results[:uris404] print results[:urisUnknown] checker.prettyPrint(results)
The default format of the results is a hash with the following keys
Links that returned a 404 error
Links that were followed successfully
Links that returned an unknown HTTP code. These are in the form of hashes with the keys :code,:uri
A hash of all the URIs checked. The keys are the URIs
Links on the page that led to resources not accessed via the HTTP protocol (such as e-mail and FTP)
A recursive crawling tool. At each step, it retrieves a resource from a URL and feeds it to a block. The block yields links to retrieve on subsequent steps.
The validation engine. Takes the HTML bodies retrieved by the crawler and determines how to handle each one.
A link extraction tool, which takes HTML bodies and extracts URLs from them.
The Linkfinder uses Nokogiri to extract links from web pages.
Copyright © 2010 by Mark T. Tomczak, released under the MIT license.