Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing/threading #24

Closed
mrocklin opened this issue Sep 26, 2013 · 15 comments
Closed

multiprocessing/threading #24

mrocklin opened this issue Sep 26, 2013 · 15 comments

Comments

@mrocklin
Copy link
Member

One benefit of naming and abstracting away common control structures (e.g. map) is that we can swap out their implementations for new technologies. In particular we may want to implement parallel versions of many operations including map, filter and groupby using multiprocessing. These could exist in a separate namespace so that code could be parallelized simply by changing imports.

@mrocklin
Copy link
Member Author

One issue here is that immutable data is copied. This could be avoided by copy-on-write in Unix forks. This is probably impossible due to Python reference counting in the same memory space as the data. https://mail.python.org/pipermail/python-ideas/2012-March/014484.html

@eigenhombre
Copy link
Member

Question: processes, threads, or both?

@mrocklin
Copy link
Member Author

mrocklin commented Oct 6, 2013

Question: processes, threads, or both?

I'm personally more motivated by processes (what I do is often compute bound) but both (and more) are good directions to pursue. I suspect that what gets done will depend on the enthusiasm of who does it.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 6, 2013

A potential issue here is the interface between sequences that are produced by parallel operations.

For example in wordcount

wordcount = comp(frequencies, partial(map, stem), concat, partial(map, str.split))
with open(corpus) as f:
    counts = wordcount(f)

the operation partial(map, stem) does work in parallel but is forced to produce a sequential stream of values on the master process which are then consumed by frequencies. If frequencies was parallel then this data would again have to be distributed to different processes. This has two problems

  1. Data is unnecessarily copied back to the master process and sent out again
  2. Flow out of the sequence can be blocked by one slow process

Queues can help with (2) by providing a buffer.

Some thoughts:

Often we only want to deal with sets or collections, not with sequences. We don't care about the sequential nature of map in this case. An unordered map would be more robust.

The serialization point at each parallel function interface is a pain both because it halts on the slowest process and because it causes unnecessary copies (the bane of python multiprocessing IMO). It would be nicer if we could collect several mapped functions and filtered predicates together and execute them on the destination process and only reconcile/sequentialize at the very end. This would require a sequence abstraction (the thing we pass between partial(map, stem) and frequencies) that carried along function information.

@eigenhombre
Copy link
Member

I think the threading + queues situation is quite simple and interesting, and am preparing an example gist for you to play with and poke at.

@eigenhombre
Copy link
Member

https://gist.github.com/eigenhombre/6849176

Curious about your thoughts.

@eigenhombre
Copy link
Member

All this is GIL, etc. notwithstanding, obviously.

@eigenhombre
Copy link
Member

Aaaaand here's a prototype of pmap: https://gist.github.com/eigenhombre/6854438

If you think this is interesting I can try cleaning it up into a PR.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 6, 2013

This breaks a little from tradition but I'm in favor of calling any implementation of map, just map, and not pmap.

The goal here is that someone could write sequential code and then do

# from toolz import map
from toolz.threading import map

And see the effect on their entire program.

You could always rename imports or import a toolz.threading namespace if you want to have both map and pmap active.

The philisophical reason here is that map is an abstraction for which many implementations can exist. I think that all of those implementations should have the same name.

@eigenhombre
Copy link
Member

On Oct 6, 2013, at 9:20 AM, Matthew Rocklin notifications@github.com wrote:

This breaks a little from tradition but I'm in favor of calling any implementation of map, just map, and not pmap.

The goal here is that someone could write sequential code and then do

from toolz import map

from toolz.threading import map
And see the effect on their entire program.

You could always rename imports or import a toolz.threading namespace if you want to have both map and pmap active.

The philisophical reason here is that map is an abstraction for which many implementations can exist. I think that all of those implementations should have the same name.

OK. Why? It seems like more than a small break from tradition.

I saw this philosophy in action with curried. I don't understand the motivation. It violates, among other things, "explicit is better than implicit", from the Zen of Python.

I can easily imagine code which would want both map and pmap, even in the same function.
I like that in Clojure, map -> pmap is a one-letter change. A zero-letter change (or indirect multi-letter change via an import statement far from the code in question) obfuscates more than clarifies, IMO. Having to say from toolz import map; from toolz.threading import map as pmap is something I guess I can live with, but it feels wrong to me, and I would argue that such a shift in philosophy away from both Python and Clojure tradition needs a clear written motivation at a fairly high level (maybe even in the README).

@mrocklin
Copy link
Member Author

mrocklin commented Oct 6, 2013

Philosophical:

The term map is an abstract operation that extends beyond either the sequential or parallel implementations. Better explicit names for map would be __builtin__.map, threaded.map, or multiproc.map. map should not be reserved by a particular implementation except out of convenience by the programmer e.g. with from foo import map as is done by __builtin__.map by default.

Practical:

The multiprocessing library uses the term map. The IPython parallel library uses the term map. I don't think that either have a pmap. However, each of these map operators is a method on a parallel computation object, avoiding the overlap with __builtin__.map

After not looking very hard at all I can't find a pmap in the standard library. Thus I conclusively claim that using map instead of pmap doesn't break with tradition; it follows it.

I would suggest that suggested syntax be something like the following:

import toolz.multiprocessing as mp
import toolz.threading as thrd

result = mp.map(...)
result = thrd.map(...)

@mrocklin
Copy link
Member Author

mrocklin commented Oct 8, 2013

A good multithreading example might include the shelve module. It'd be nice to time a particular operation on some dataset in memory and then time the same operation on a much larger dataset on disk, perhaps using seque to serve as a threaded IO buffer.

It'd be really nice to see that we can obtain in-memory performance on large out-of-memory datasets.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 8, 2013

This benefit might be more noticable on a spinning disk hard drive if you still have one of those around. In particular shelve is nicer for this than a normal file open due to the performance differences of random access vs sequential access.

@eigenhombre
Copy link
Member

sounds like a good test / benchmark.

On Oct 8, 2013, at 9:01 AM, Matthew Rocklin notifications@github.com wrote:

A good multithreading example might include the shelve module. It'd be nice to time a particular operation on some dataset in memory and then time the same operation on a much larger dataset on disk, perhaps using seque to serve as a threaded IO buffer.

It'd be really nice to see that we can obtain in-memory performance on large out-of-memory datasets.


Reply to this email directly or view it on GitHub.

@mrocklin
Copy link
Member Author

Closing this. Practically speaking toolz doesn't implement explicit parallel operators like pmap, we do do what we can to support them. More here http://toolz.readthedocs.org/en/latest/parallelism.html

If anyone wants to discuss this further please reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants