Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Statistics of the Common Crawl Corpus 2012


MapReduce code

The job run for creating the raw index is It is a 'map only' job which outputs a single line entry for each website found in the CC corpus.

Each line contains the public suffix, domain, media type, charset, ARC file name and byte size of a specific website, all tab separated.

Build the job jar by running:

$ ant dist

to create dist/lib/Teneo-########.jar.

Run job on AWS

The job was run on 35 subsets of 25,000 ARC files of the 2012 corpus. Results were later merged into fewer files.

A job on a subset was invoked by

elastic-mapreduce  --create --credentials credentials.json \
 --jar s3://[bucket]/Teneo-########.jar \
 --main-class com.spiegler.fastindex.FastIndexer \
 --args "[AccessKey],[SecretKey],/home/hadoop/splits/split_1,s3://[bucket]/output/split_1" \
 --instance-group master --instance-type m1.xlarge --instance-count 1 --bid-price [$$$] \
 --instance-group core   --instance-type m1.xlarge --instance-count 5 --bid-price [$$$] \
 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
 --bootstrap-action s3://[bucket]/ \
 --key-pair [YourKey] \
 --log-uri s3n://[bucket] \

where the arguments for the job are the access key, secret key, a file containing the ARC file input list (bootstrapped onto instances) and an output S3 bucket.

The bootstrapping script for copying split files onto the instances

set -e
mkdir -p /home/hadoop/splits/
hadoop fs -copyToLocal s3://[bucket]/splits/* /home/hadoop/splits/

An example for a split, e.g. split_1


Hive code

For the actual aggregation Hive was used. Some examples are provided here.

You can’t perform that action at this time.