multiprocessing/threading #24

mrocklin · 2013-09-26T18:01:07Z

One benefit of naming and abstracting away common control structures (e.g. map) is that we can swap out their implementations for new technologies. In particular we may want to implement parallel versions of many operations including map, filter and groupby using multiprocessing. These could exist in a separate namespace so that code could be parallelized simply by changing imports.

The text was updated successfully, but these errors were encountered:

mrocklin · 2013-09-26T19:06:41Z

One issue here is that immutable data is copied. This could be avoided by copy-on-write in Unix forks. This is probably impossible due to Python reference counting in the same memory space as the data. https://mail.python.org/pipermail/python-ideas/2012-March/014484.html

eigenhombre · 2013-09-27T01:34:10Z

Question: processes, threads, or both?

mrocklin · 2013-10-06T03:05:05Z

Question: processes, threads, or both?

I'm personally more motivated by processes (what I do is often compute bound) but both (and more) are good directions to pursue. I suspect that what gets done will depend on the enthusiasm of who does it.

mrocklin · 2013-10-06T03:17:41Z

A potential issue here is the interface between sequences that are produced by parallel operations.

For example in wordcount

wordcount = comp(frequencies, partial(map, stem), concat, partial(map, str.split))
with open(corpus) as f:
    counts = wordcount(f)

the operation partial(map, stem) does work in parallel but is forced to produce a sequential stream of values on the master process which are then consumed by frequencies. If frequencies was parallel then this data would again have to be distributed to different processes. This has two problems

Data is unnecessarily copied back to the master process and sent out again
Flow out of the sequence can be blocked by one slow process

Queues can help with (2) by providing a buffer.

Some thoughts:

Often we only want to deal with sets or collections, not with sequences. We don't care about the sequential nature of map in this case. An unordered map would be more robust.

The serialization point at each parallel function interface is a pain both because it halts on the slowest process and because it causes unnecessary copies (the bane of python multiprocessing IMO). It would be nicer if we could collect several mapped functions and filtered predicates together and execute them on the destination process and only reconcile/sequentialize at the very end. This would require a sequence abstraction (the thing we pass between partial(map, stem) and frequencies) that carried along function information.

eigenhombre · 2013-10-06T03:35:11Z

I think the threading + queues situation is quite simple and interesting, and am preparing an example gist for you to play with and poke at.

eigenhombre · 2013-10-06T03:42:00Z

https://gist.github.com/eigenhombre/6849176

Curious about your thoughts.

eigenhombre · 2013-10-06T03:46:34Z

All this is GIL, etc. notwithstanding, obviously.

eigenhombre · 2013-10-06T14:01:10Z

Aaaaand here's a prototype of pmap: https://gist.github.com/eigenhombre/6854438

If you think this is interesting I can try cleaning it up into a PR.

mrocklin · 2013-10-06T14:20:58Z

This breaks a little from tradition but I'm in favor of calling any implementation of map, just map, and not pmap.

The goal here is that someone could write sequential code and then do

# from toolz import map
from toolz.threading import map

And see the effect on their entire program.

You could always rename imports or import a toolz.threading namespace if you want to have both map and pmap active.

The philisophical reason here is that map is an abstraction for which many implementations can exist. I think that all of those implementations should have the same name.

eigenhombre · 2013-10-06T15:29:10Z

On Oct 6, 2013, at 9:20 AM, Matthew Rocklin notifications@github.com wrote:

This breaks a little from tradition but I'm in favor of calling any implementation of map, just map, and not pmap.

The goal here is that someone could write sequential code and then do

from toolz import map

from toolz.threading import map
And see the effect on their entire program.

You could always rename imports or import a toolz.threading namespace if you want to have both map and pmap active.

The philisophical reason here is that map is an abstraction for which many implementations can exist. I think that all of those implementations should have the same name.

OK. Why? It seems like more than a small break from tradition.

I saw this philosophy in action with curried. I don't understand the motivation. It violates, among other things, "explicit is better than implicit", from the Zen of Python.

I can easily imagine code which would want both map and pmap, even in the same function.
I like that in Clojure, map -> pmap is a one-letter change. A zero-letter change (or indirect multi-letter change via an import statement far from the code in question) obfuscates more than clarifies, IMO. Having to say from toolz import map; from toolz.threading import map as pmap is something I guess I can live with, but it feels wrong to me, and I would argue that such a shift in philosophy away from both Python and Clojure tradition needs a clear written motivation at a fairly high level (maybe even in the README).

mrocklin · 2013-10-06T23:24:58Z

Philosophical:

The term map is an abstract operation that extends beyond either the sequential or parallel implementations. Better explicit names for map would be __builtin__.map, threaded.map, or multiproc.map. map should not be reserved by a particular implementation except out of convenience by the programmer e.g. with from foo import map as is done by __builtin__.map by default.

Practical:

The multiprocessing library uses the term map. The IPython parallel library uses the term map. I don't think that either have a pmap. However, each of these map operators is a method on a parallel computation object, avoiding the overlap with __builtin__.map

After not looking very hard at all I can't find a pmap in the standard library. Thus I conclusively claim that using map instead of pmap doesn't break with tradition; it follows it.

I would suggest that suggested syntax be something like the following:

import toolz.multiprocessing as mp
import toolz.threading as thrd

result = mp.map(...)
result = thrd.map(...)

mrocklin · 2013-10-08T14:01:21Z

A good multithreading example might include the shelve module. It'd be nice to time a particular operation on some dataset in memory and then time the same operation on a much larger dataset on disk, perhaps using seque to serve as a threaded IO buffer.

It'd be really nice to see that we can obtain in-memory performance on large out-of-memory datasets.

mrocklin · 2013-10-08T14:19:02Z

This benefit might be more noticable on a spinning disk hard drive if you still have one of those around. In particular shelve is nicer for this than a normal file open due to the performance differences of random access vs sequential access.

eigenhombre · 2013-10-08T14:40:50Z

sounds like a good test / benchmark.

On Oct 8, 2013, at 9:01 AM, Matthew Rocklin notifications@github.com wrote:

A good multithreading example might include the shelve module. It'd be nice to time a particular operation on some dataset in memory and then time the same operation on a much larger dataset on disk, perhaps using seque to serve as a threaded IO buffer.

It'd be really nice to see that we can obtain in-memory performance on large out-of-memory datasets.

—
Reply to this email directly or view it on GitHub.

mrocklin · 2014-03-28T18:29:30Z

Closing this. Practically speaking toolz doesn't implement explicit parallel operators like pmap, we do do what we can to support them. More here http://toolz.readthedocs.org/en/latest/parallelism.html

If anyone wants to discuss this further please reopen the issue.

eriknw mentioned this issue Nov 5, 2013

jackknife function #66

Closed

mrocklin closed this as completed Mar 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiprocessing/threading #24

multiprocessing/threading #24

mrocklin commented Sep 26, 2013

mrocklin commented Sep 26, 2013

eigenhombre commented Sep 27, 2013

mrocklin commented Oct 6, 2013

mrocklin commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

mrocklin commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

from toolz import map

mrocklin commented Oct 6, 2013

mrocklin commented Oct 8, 2013

mrocklin commented Oct 8, 2013

eigenhombre commented Oct 8, 2013

mrocklin commented Mar 28, 2014

multiprocessing/threading #24

multiprocessing/threading #24

Comments

mrocklin commented Sep 26, 2013

mrocklin commented Sep 26, 2013

eigenhombre commented Sep 27, 2013

mrocklin commented Oct 6, 2013

mrocklin commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

mrocklin commented Oct 6, 2013

eigenhombre commented Oct 6, 2013

from toolz import map

mrocklin commented Oct 6, 2013

mrocklin commented Oct 8, 2013

mrocklin commented Oct 8, 2013

eigenhombre commented Oct 8, 2013

mrocklin commented Mar 28, 2014