Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
29 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,29 @@ | ||
This is a simple library demonstrating the analysis of the CommonCrawl dataset. | ||
This is a simple library demonstrating the analysis of the CommonCrawl dataset | ||
through implementing the canonical Hadoop Hello World program, a simple word | ||
counter. | ||
|
||
To build | ||
-------- | ||
|
||
You'll need to have Apache Ant (http://ant.apache.org/manual/install.html) | ||
installed, and once you do, just run a: | ||
|
||
# ant dist | ||
|
||
This step will compile the libraries and Hadoop code into an Elastic MapReduce- | ||
friendly JAR at dist/lib/HelloWorld.jar, suitable for use as a custom JAR-based | ||
Elastic MapReduce workflow. | ||
|
||
To run locally | ||
-------------- | ||
|
||
You'll need to be running Hadoop, and if you don't have it installed, Cloudera | ||
provides a useful set of OS-specific Hadoop packages which will make it easy. | ||
Check out their site: | ||
|
||
https://ccp.cloudera.com/display/SUPPORT/Downloads | ||
|
||
Once you've got Hadoop installed, you can use the 'hadoop jar' task to execute | ||
the tutorial code. Here's the pattern: | ||
|
||
hadoop jar <checkout location>/dist/lib/HelloWorld.jar org.commoncrawl.tutorial.HelloWorld <Amazon AWS access key ID> <Amazon AWS secret access key> <CommonCrawl crawl files to use as input> <HDFS output location> |