Rcrawl web crawler for Ruby

Rcrawl is a web crawler written entirely in ruby. It's limited right now by the fact that it will stay on the original domain provided.

Rcrawl is available as a gem here: https://rubygems.org/gems/rcrawl/versions/0.5.1

Rcrawl uses a modular approach to processing the HTML it receives. The link exraction portion of Rcrawl depends on the scrAPI toolkit by Assaf Arkin (http://labnotes.org). gem install scrapi

The structure of the crawling process was inspired by the specs of the Mercator crawler (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf). A (somewhat) brief overview of the design philosophy follows.

The Rcrawl process

Remove an absolute URL from the URL Server.
Download corresponding document from the internet, grabbing and processing robots.txt first, if available.
Feed the document into a rewind input stream(ris) to be read/re-read as needed. Based on MIME type, invoke the process method of the
processing module associated with that MIME type. For example, a link extractor or tag counter module for text/html MIME types, or a gif stats module for image/gif. By default, all text/html MIME types will pass through the link extractor. Each link will be converted to an absolute URL and tested against a (ideally user-supplied) URL filter to determine if it should be downloaded.
If the URL passes the filter (currently hard coded as Same Domain?), then call the URL-seen? test.
Has the URL been seen before? Namely, is it in the URL Server or has it been downloaded already? If the URL is new, it is added to the URL Server.
Back to step 1, repeat until the URL Server is empty.

Examples

Instantiate a new Rcrawl object

`crawler = Rcrawl::Crawler.new(url)`

Begin the crawl process

`crawler.crawl`

After the crawler is done crawling

Returns an array of visited links

`crawler.visited_links`

Returns a hash where the key is a url and the value is the raw html from that url

`crawler.raw_html`

Returns a hash where the key is a URL and the value is

the error message from stderr

`crawler.errors`

Returns an array of external links

`crawler.external_links`

Set user agent

`crawler.user_agent = "Your fancy crawler name here"`

License

Developed for Digital Duckies by Shawn Hansen and Jason Burgett

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
lib		lib
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
TODO		TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rcrawl web crawler for Ruby

The Rcrawl process

Examples

Instantiate a new Rcrawl object

Begin the crawl process

After the crawler is done crawling

Returns an array of visited links

Returns a hash where the key is a url and the value is the raw html from that url

Returns a hash where the key is a URL and the value is

the error message from stderr

Returns an array of external links

Set user agent

License

About

Releases

Packages

Languages

License

shawnhansen/rcrawl

Folders and files

Latest commit

History

Repository files navigation

Rcrawl web crawler for Ruby

The Rcrawl process

Examples

Instantiate a new Rcrawl object

Begin the crawl process

After the crawler is done crawling

Returns an array of visited links

Returns a hash where the key is a url and the value is the raw html from that url

Returns a hash where the key is a URL and the value is

the error message from stderr

Returns an array of external links

Set user agent

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages