web_crawler

Ruby web crawler to collect unique hrefs.

Starts from a root url and filters all hrefs. root url is the starting point from where you wish to start the crawler.
Formats URLs using the uri lib.
Exclusion list can be passed as an array if you wish to exclude any URL with certain patterns. For example, sign_in?.
Traversed urls are stored in a set and are not visited more than once.
Traversable urls are stored in a set.
Invalid urls are stored in a set.

Usage

Crawler with Nokoriri. exclusion list can be passed as an optional parameter.{:exclusion_list => }

Crawler.new(root_url).inspect

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
crawler.rb		crawler.rb