hadoop python test

Python simple wordcount test based on:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

and configured to run on Gordon (HPC Cluster at the San Diego Supercomputing Center):

http://www.sdsc.edu/us/resources/gordon/gordon_hadoop.html

It uses the Hadoop streaming interface to send input and get outputs from the Python mapper and reducer. HDFS is setup on the local SSD flash drives on the computing nodes, output is then copied back to local space.

How to run:

clone the repository in your home folder
grab the input files by running download-inputs.sh in the gutemberg folder
run: qsub run.sh

Run on Amazon Elastic Map Reduce with MrJob

See the mrjob/ folder, more details on:

http://www.andreazonca.com/2013/09/run-hadoop-python-jobs-on-amazon-with-mrjob.html

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
gutemberg-output		gutemberg-output
gutemberg		gutemberg
mrjob		mrjob
README.md		README.md
mapper.py		mapper.py
reducer.py		reducer.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

hadoop python test

Run on Amazon Elastic Map Reduce with MrJob

About

Uh oh!

Releases

Packages

Languages

zonca/python-wordcount-hadoop

Folders and files

Latest commit

History

Repository files navigation

hadoop python test

Run on Amazon Elastic Map Reduce with MrJob

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages