Write a simple web crawler.
- Crawler is limited to one domain. E.g. when crawling example.com it crawls all pages within the domain, but not external links, for example to the Facebook and Twitter accounts.
- Given a URL, it outputs a site map, showing which static assets each page depends on, and the links between pages.
- Write it as you would a production piece of code.
- Bonus points for tests and making it as fast as possible!
- URL Frontier Queue (Apache Kafka)
- Fetch module (worker.rb)
- Parse module (page.rb)
- Sitemap store (Redis)
This crawler uses Apache Kafka as a messaging queue.
An arbitrary number of workers can be attached to the queue, from which URLs are read, crawled and new links inserted at the back of the queue again.
Sitemap data for each page is stored in Redis.
The system is failure-tolerant and guarantees that every URL will be crawled at least once.
If a worker should die while crawling a URL, Kafka's Consumer Groups feature will automatically assign the URL to a new worker. There may be duplicated work but never lost URLs.
Speed and Scalability
Due to the distributed architecture you can run the workers on as many machines as you like. So the crawler component is as fast as you need it to be.
At very high concurrency levels Redis might conceivably be a bottleneck. It could be replaced by a distributed data store backend such as Cassandra.
Rendering performance has not been optimized at all and might be quite slow for large sites.
- Apache Kafka
- Ruby >= 2.3.0
config.yml. You will probably need to add your local Kafka and Redis configurations there.
bundle exec ruby start.rb
Note that the workers will run forever, or until you quit using Ctrl-C.
Sitemaps are output in JSON format. You can run this in a separate shell from your workers, or even on another machine.
bundle exec ruby render.rb
bundle exec rspec
The following URL responses are treated as an empty page with no links:
- Any status code other than 200
- Any Content-Type other than text/html