Motivation

Kaylee is a small MapReduce implementation mostly meant as a proof of concept to illustrate the power of ZeroMQ and for education purpose

My goal was not to write a Hadoop clone but to build a starting point that one could use to learn about MapReduce.

The main bottleneck in this implementation is that the Shuffle phase requires all data to be moved back to the Server instance which is not generally a good idea for performance. But this lets us a implement a simple shuffler using a Python defaultdict in just a few lines of code which is easy to understand.

Theory

MapReduce can be thought of on a high level as being a list homomorphism that can be written as a composition of two functions ( Reduce . Map ) . It is parallelizable because of the associativity of the of map and reduce operations.

MapReduce :: [(k1, v1)] -> [(k3, v3)]
MapReduce = Reduce .  Map

MapReduce :: a -> [(k3, v3)]
MapReduce = reducefn . shuffle . mapfn . datafn

The implementation provides two functions split ( datafn ) and shuffle.

shuffle :: [ (k2, v2) ] -> [(k2, [v2])]

The user provides map and reduce.

map :: (k1,v1) -> [ (k2, v2) ]
reduce :: (k2, [v2]) -> [ (k3, v3) ]

Directions:

Install ZeroMQ:

For Arch Linux

$ pacman -S zeromq

For Ubuntu Linux

$ add-apt-repository ppa:chris-lea/zeromq
$ apt-get update
$ apt-get install zeromq-bin libzmq-dev libzmq0

For Macintosh:

$ brew install zeromq

Build your virtualenv:

$ cd kaylee
$ virtualenv --no-site-packages env
$ source env/bin/activative

Install necessary packages:

$ pip install -r requirements.txt

Example

Let's do the 'hello world' of Map-Reduce, taking a large corpus and counting the occurrences of words in parallel.

Grab a large dataset (Moby Dick):

$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
$ mv pg2701.txt mobydick.txt

Run server:

$ python example.py

Run worker(s):

$ python -m kaylee.client

Or

>>> from kaylee import Client
>>> c = Client()
>>> c.connect()
>>> c.start()

License

Released under a MIT License. Do with it what you please.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
kaylee		kaylee
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
mobydick.txt		mobydick.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Theory

Directions:

Example

License

About

Releases

Packages

Languages

License

sdiehl/kaylee

Folders and files

Latest commit

History

Repository files navigation

Motivation

Theory

Directions:

Example

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages