Skip to content

andrewdefries/query_wet_news

Repository files navigation

Requirements

The requirements are WARC tools by Hanzo and s3cmd a command line tool for BASH written in python, for our AMI (debian wheezy, are taken care of by execution of:

./Bootstrap.sh

Consider using this via EMR service

This will install s3cmd via:

sudo apt-get install s3cmd

s3cmd --configure

You must proceed with your aws information by your own method to download buckets using your aws account.

Background

The Common crawl corpus (www.commoncrawl.org) is a non-profit a digital archive of "snapshots" of the web hosted as aws s3 buckets.

Common Crawl Corpus via aws s3 buckets

query_wet_news

Looking through the common crawl archives news entries

Parallelization strategy

Currently prototyping Map-Reduce principles using BASH fork. For example:


./CycleThroughFilterShuf.sh & ./CycleThroughFilterShuf.sh ... 
 

Program execution sequence

The s3 buckets in a list are used. The program CycleThroughFilterShuf.sh goes through the list and shuffles it so that you get a sample segment of the total crawl. For example, by using the BASH shuf command the list. The bucket list for 2014 is here

About

Looking through the common crawl archives news entries

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published