Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map should iterate #51

Open
joernhees opened this issue Feb 4, 2017 · 3 comments
Open

map should iterate #51

joernhees opened this issue Feb 4, 2017 · 3 comments

Comments

@joernhees
Copy link
Contributor

passing huge lists / iterables into map or map_as_completed will first "register" them all for computation and only after it exhausted them all, compute them in parallel.

try running the following with python -m scoop example.py and notice how nothing is printed for a long time:

#from scoop.futures import map as parallel_map
from scoop.futures import map_as_completed as parallel_map


def square(x):
    return x * x

if __name__ == '__main__':
    squares = parallel_map(square, range(1000000))
    for sq in squares:
        print(sq)

I think the reason is in https://github.com/soravux/scoop/blob/master/scoop/futures.py#L94 . In python 3 map returns an iterable. Even for python 2 it would be cool if the internal function would batch the input.

@soravux
Copy link
Owner

soravux commented Feb 18, 2017

You are right, SCOOP generates futures from every element of the iterable first. The reason is to be able to distribute those tasks to remote workers. If map_as_completed returned an iterable, tasks would be emitted one after another, making the computation serial and not parallel, unless we batch the inputs, as you said.

Batching the input explicitly would require the user to manually emit batches when he desires. I haven't found a way to provide a clean and simple API to perform this.

Batching implicitly (on SCOOP's end) would delay the scheduling of later tasks, which could hinder load balancing. I am not formally against this, as long as the default value does not cause surprise to the power user and performs suboptimally for the beginner.
The emission of tasks should be done in the communication thread, though, it should not wait for a call to SCOOP API from the user's code.
I'm open to pull requests in this direction.

@mjmdavis
Copy link

Perhaps a solution would be for the user to provide an argument.
'take=1000'
Then the map could take 1000, balance and dispatch then take another 1000 until stop iteration.

@mjmdavis
Copy link

My system runs out of RAM when the iterator is consumed all at once 🤗🤓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants