Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
Elasticrawl can be used with crawl data from April 2014 onwards.
A list of crawls released by Common Crawl is maintained on the wiki.
Common Crawl announce new crawls on their blog.
Ships with a default configuration that launches the elasticrawl-examples jobs. This is an implementation of the standard Hadoop Word Count example.
This blog post has a walkthrough of running the example jobs on the November 2014 crawl.
- Elasticrawl needs a Ruby installation (2.1 or higher).
- Install Ruby from RubyGems.
gem install elasticrawl --no-rdoc --no-ri
If you get the error "EMR service role arn:aws:iam::156793023547:role/EMR_DefaultRole is invalid" when launching a cluster then you don't have the necessary IAM roles. To fix this install the AWS CLI and run the command below.
aws emr create-default-roles
The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created and will store your data and logs.
~$ elasticrawl init your-s3-bucket Enter AWS Access Key ID: ************ Enter AWS Secret Access Key: ************ ... Bucket s3://elasticrawl-test created Config dir /Users/ross/.elasticrawl created Config complete
The parse command takes in the crawl name and an optional number of segments and files to parse.
~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3 Segments Segment: 1416400372202.67 Files: 150 Segment: 1416400372490.23 Files: 124 Job configuration Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment Cluster configuration Master: 1 m1.medium (Spot: 0.12) Core: 2 m1.medium (Spot: 0.12) Task: -- Launch job? (y/n) y Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
The combine command takes in the results of previous parse jobs and produces a combined set of results.
~$ elasticrawl combine --input-jobs 1420124830792 Job configuration Combining: 2 segments Cluster configuration Master: 1 m1.medium (Spot: 0.12) Core: 2 m1.medium (Spot: 0.12) Task: -- Launch job? (y/n) y Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
The status command shows crawls and your job history.
~$ elasticrawl status Crawl Status CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100 Job History (last 10) 1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment
The reset comment resets a crawl so it is parsed again.
~$ elasticrawl reset CC-MAIN-2015-48 Reset crawl? (y/n) y CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100
The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
~$ elasticrawl destroy WARNING: Bucket s3://elasticrawl-test and its data will be deleted Config dir /home/vagrant/.elasticrawl will be deleted Delete? (y/n) y Bucket s3://elasticrawl-test deleted Config dir /home/vagrant/.elasticrawl deleted Config deleted
The elasticrawl init command creates the ~/elasticrawl/ directory which contains
aws.yml - stores your AWS access credentials. Or you can set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
cluster.yml - configures the EC2 instances that are launched to form your EMR cluster
jobs.yml - stores your S3 bucket name and the config for the parse and combine jobs
Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
- Add support for Streaming and Pig jobs
- Thanks to everyone at Common Crawl for making this awesome dataset available!
- Thanks to Robert Slifka for the elasticity gem which provides a nice Ruby wrapper for the EMR REST API.
- Thanks to Phusion for creating Traveling Ruby.
- Fork it
- Create your feature branch (
git checkout -b my-new-feature)
- Commit your changes (
git commit -am 'Add some feature')
- Push to the branch (
git push origin my-new-feature)
- Create new Pull Request
This code is licensed under the MIT license.