Requirements

The requirements are WARC tools by Hanzo and s3cmd a command line tool for BASH written in python, for our AMI (debian wheezy, are taken care of by execution of:

./Bootstrap.sh

Consider using this via EMR service

This will install s3cmd via:

sudo apt-get install s3cmd

s3cmd --configure

You must proceed with your aws information by your own method to download buckets using your aws account.

Background

The Common crawl corpus (www.commoncrawl.org) is a non-profit a digital archive of "snapshots" of the web hosted as aws s3 buckets.

query_wet_news

Looking through the common crawl archives news entries

Parallelization strategy

Currently prototyping Map-Reduce principles using BASH fork. For example:


./CycleThroughFilterShuf.sh & ./CycleThroughFilterShuf.sh ...

The s3 buckets in a list are used. The program CycleThroughFilterShuf.sh goes through the list and shuffles it so that you get a sample segment of the total crawl. For example, by using the BASH shuf command the list. The bucket list for 2014 is here

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
NewsListsFromBlekko		NewsListsFromBlekko
ShinyWetNews		ShinyWetNews
TrialRun		TrialRun
2014_WetDump		2014_WetDump
CommonCrawlCorpus.png		CommonCrawlCorpus.png
CycleThroughFilter.sh		CycleThroughFilter.sh
LICENSE		LICENSE
NewsLists		NewsLists
ParallelBatch.mm		ParallelBatch.mm
ParallelBatch.png		ParallelBatch.png
QueryList		QueryList
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Background

query_wet_news

Parallelization strategy

About

Releases

Packages

Languages

License

andrewdefries/query_wet_news

Folders and files

Latest commit

History

Repository files navigation

Requirements

Background

query_wet_news

Parallelization strategy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages