PyToolz tries to support other parallel processing libraries. It does this
by ensuring easy serialization of
toolz functions and providing
architecture-agnostic parallel algorithms.
toolz is developed against
Multiprocessing or distributed computing requires the transmission of functions
between different processes or computers. This is done through serializing the
function into text, sending that text over a wire, and deserializing the text
back into a function. To the extent possible PyToolz functions are compatible
with the standard serialization library
pickle library often fails for complex functions including lambdas,
closures, and class methods. When this occurs we recommend the alternative
Example with parallel map
Most parallel processing tasks may be significantly accelerated using only a
parallel map operation. A number of high quality parallel map operations exist
in other libraries, notably
threading (if your operation is not processor bound).
In the example below we extend our wordcounting solution with a parallel map. We show how one can progress in development from sequential, to multiprocessing, to distributed computation all with the same domain code.
from toolz.curried import map from toolz import frequencies, compose, concat, merge_with def stem(word): """ Stem word to primitive form >>> stem("Hello!") 'hello' """ return word.lower().rstrip(",.!)-*_?:;$'-\"").lstrip("-*'\"(_$'") wordcount = compose(frequencies, map(stem), concat, map(str.split), open) if __name__ == '__main__': # Filenames for thousands of books from which we'd like to count words filenames = ['Book_%d.txt'%i for i in range(10000)] # Start with sequential map for development # pmap = map # Advance to Multiprocessing map for heavy computation on single machine # from multiprocessing import Pool # p = Pool(8) # pmap = p.map # Finish with distributed parallel map for big data from ipyparallel import Client p = Client()[:] pmap = p.map_sync total = merge_with(sum, pmap(wordcount, filenames))
This smooth transition is possible because
mapabstraction is a simple function call and so can be replaced. By contrast, this transformation would be difficult if we had written our code with a for loop or list comprehension.
- The operation
wordcountis separate from the parallel solution.
- The task is embarrassingly parallel, needing only a very simple parallel strategy. Fortunately this is the common case.
PyToolz does not implement parallel processing systems. It does however provide parallel algorithms that can extend existing parallel systems. Our general solution is to build algorithms that operate around a user-supplied parallel map function.
In particular we provide a parallel
This fold can work equally well with