Skip to content
An implementation of a real-world map-reduce workflow in each major framework.
Branch: master
Clone or download
rathboma Scalding - more idomatic, python - slightly simpler, better readme
@helenahm -- please see the changes I have made to the scalding solution to make it more scala-like.
Latest commit 8f569fe Feb 9, 2016
Type Name Latest commit message Commit time
Failed to load latest commit information.
java-mapreduce scoobi code Sep 8, 2013
pig Added pig script plus test Apr 8, 2013
scalding Update Main.scala Oct 7, 2015
scalding_typed Scalding - more idomatic, python - slightly simpler, better readme Feb 9, 2016
spark-scala Remove old code Dec 2, 2015
streaming-python Scalding - more idomatic, python - slightly simpler, better readme Feb 9, 2016
.gitignore Cascading solution from Elena Jun 25, 2015
LICENSE LICENSE Feb 10, 2013 Update Oct 20, 2015

Realistic Hadoop Data Processing Examples

This code is to accompany my blog post on map reduce frameworks

The point of the code in this repository is to provide an implementation for a business question (listed below) in each of the major Map Reduce frameworks.

Each implementation will get it's own subdirectory with it's own build and running instructions. Each framework will also get an accompanying test, and an in-depth walkthrough about implementation details.

The following implementations are complete:

The problem

The Data

We have two datasets: customers, and transactions.

Customer Fields:

Transaction Fields:

  • transaction-id (1)
  • product-id (1)
  • user-id (1)
  • purchase-amount (19.99)
  • product-description (a rubber chicken)

These two datasets are stored in tab-delimited files somewhere on HDFS.

The Question

For each product, we want to know the number of locations in which that product was purchased.

That's it!

In the real world, we might have other questions, like the number of purchases per location for each product.

You can’t perform that action at this time.