-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing/threading #24
Comments
One issue here is that immutable data is copied. This could be avoided by copy-on-write in Unix forks. This is probably impossible due to Python reference counting in the same memory space as the data. https://mail.python.org/pipermail/python-ideas/2012-March/014484.html |
Question: processes, threads, or both? |
I'm personally more motivated by processes (what I do is often compute bound) but both (and more) are good directions to pursue. I suspect that what gets done will depend on the enthusiasm of who does it. |
A potential issue here is the interface between sequences that are produced by parallel operations. For example in wordcount
the operation
Queues can help with (2) by providing a buffer. Some thoughts: Often we only want to deal with sets or collections, not with sequences. We don't care about the sequential nature of The serialization point at each parallel function interface is a pain both because it halts on the slowest process and because it causes unnecessary copies (the bane of python multiprocessing IMO). It would be nicer if we could collect several mapped functions and filtered predicates together and execute them on the destination process and only reconcile/sequentialize at the very end. This would require a sequence abstraction (the thing we pass between |
I think the threading + queues situation is quite simple and interesting, and am preparing an example gist for you to play with and poke at. |
https://gist.github.com/eigenhombre/6849176 Curious about your thoughts. |
All this is GIL, etc. notwithstanding, obviously. |
Aaaaand here's a prototype of If you think this is interesting I can try cleaning it up into a PR. |
This breaks a little from tradition but I'm in favor of calling any implementation of The goal here is that someone could write sequential code and then do
And see the effect on their entire program. You could always rename imports or import a toolz.threading namespace if you want to have both The philisophical reason here is that |
On Oct 6, 2013, at 9:20 AM, Matthew Rocklin notifications@github.com wrote:
OK. Why? It seems like more than a small break from tradition. I saw this philosophy in action with I can easily imagine code which would want both |
Philosophical: The term Practical: The After not looking very hard at all I can't find a I would suggest that suggested syntax be something like the following: import toolz.multiprocessing as mp
import toolz.threading as thrd
result = mp.map(...)
result = thrd.map(...) |
A good multithreading example might include the It'd be really nice to see that we can obtain in-memory performance on large out-of-memory datasets. |
This benefit might be more noticable on a spinning disk hard drive if you still have one of those around. In particular |
sounds like a good test / benchmark. On Oct 8, 2013, at 9:01 AM, Matthew Rocklin notifications@github.com wrote:
|
Closing this. Practically speaking toolz doesn't implement explicit parallel operators like If anyone wants to discuss this further please reopen the issue. |
One benefit of naming and abstracting away common control structures (e.g.
map
) is that we can swap out their implementations for new technologies. In particular we may want to implement parallel versions of many operations includingmap
,filter
andgroupby
usingmultiprocessing
. These could exist in a separate namespace so that code could be parallelized simply by changing imports.The text was updated successfully, but these errors were encountered: