Launch AWS Elastic MapReduce jobs that process Common Crawl data.
Ruby
Latest commit db70bb6 Feb 15, 2017 @rossf7 Bump version
Permalink
Failed to load latest commit information.
bin
db/migrate
lib
spec Correct WARC paths in test data to match new S3 location. Jul 13, 2016
templates
.gitignore
.travis.yml Remove Ruby 2.1 and add 2.4 to Travis Feb 15, 2017
CHANGELOG.md Bump version to 1.1.6 Jun 26, 2016
Cheffile Update Chef cookbook versions in librarian-chef. Jan 5, 2016
Cheffile.lock
Gemfile
LICENSE initial commit Feb 4, 2014
README.md Install from RubyGems rather than using Travelling Ruby installer. Jun 26, 2016
Rakefile initial commit Feb 4, 2014
Vagrantfile
elasticrawl.gemspec Set bundler version to 1.14.4 Feb 15, 2017

README.md

Elasticrawl

  • Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
  • Elasticrawl can be used with crawl data from April 2014 onwards.
  • A list of crawls released by Common Crawl is maintained on the wiki.
  • Common Crawl announce new crawls on their blog.

  • Ships with a default configuration that launches the elasticrawl-examples jobs. This is an implementation of the standard Hadoop Word Count example.

This blog post has a walkthrough of running the example jobs on the November 2014 crawl.

Installation

gem install elasticrawl --no-rdoc --no-ri

Troubleshooting

If you get the error "EMR service role arn:aws:iam::156793023547:role/EMR_DefaultRole is invalid" when launching a cluster then you don't have the necessary IAM roles. To fix this install the AWS CLI and run the command below.

aws emr create-default-roles 

Commands

elasticrawl init

The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created and will store your data and logs.

~$ elasticrawl init your-s3-bucket

Enter AWS Access Key ID: ************
Enter AWS Secret Access Key: ************

...

Bucket s3://elasticrawl-test created
Config dir /Users/ross/.elasticrawl created
Config complete

elasticrawl parse

The parse command takes in the crawl name and an optional number of segments and files to parse.

~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
Segments
Segment: 1416400372202.67 Files: 150
Segment: 1416400372490.23 Files: 124

Job configuration
Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB

elasticrawl combine

The combine command takes in the results of previous parse jobs and produces a combined set of results.

~$ elasticrawl combine --input-jobs 1420124830792
Job configuration
Combining: 2 segments

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL

elasticrawl status

The status command shows crawls and your job history.

~$ elasticrawl status
Crawl Status
CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100

Job History (last 10)
1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment

elasticrawl reset

The reset comment resets a crawl so it is parsed again.

~$ elasticrawl reset CC-MAIN-2015-48
Reset crawl? (y/n)
y
 CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100

elasticrawl destroy

The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.

~$ elasticrawl destroy

WARNING:
Bucket s3://elasticrawl-test and its data will be deleted
Config dir /home/vagrant/.elasticrawl will be deleted
Delete? (y/n)
y

Bucket s3://elasticrawl-test deleted
Config dir /home/vagrant/.elasticrawl deleted
Config deleted

Configuring Elasticrawl

The elasticrawl init command creates the ~/elasticrawl/ directory which contains

  • aws.yml - stores your AWS access credentials. Or you can set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

  • cluster.yml - configures the EC2 instances that are launched to form your EMR cluster

  • jobs.yml - stores your S3 bucket name and the config for the parse and combine jobs

Development

Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.

Gem Version Code Climate Build Status 2.0.0, 2.1.8, 2.2.4, 2.3.0

TODO

  • Add support for Streaming and Pig jobs

Thanks

  • Thanks to everyone at Common Crawl for making this awesome dataset available!
  • Thanks to Robert Slifka for the elasticity gem which provides a nice Ruby wrapper for the EMR REST API.
  • Thanks to Phusion for creating Traveling Ruby.

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

License

This code is licensed under the MIT license.