Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default weight of parallel iterators to assume expensive ops #49

Closed
nikomatsakis opened this issue Feb 7, 2016 · 11 comments
Closed
Milestone

Comments

@nikomatsakis
Copy link
Member

Currently parallel iterators assume cheap operations. I am thinking that should be changed to assume expensive operations (i.e., fine-grained parallel splits), and have people opt-in to the current behavior by manually adjusting the weights or calling weight_min. My reasoning is:

  1. The worst case behavior today is that you don't see any parallel speedup, which sucks. I see a lot of questions bout this.
  2. The worst case behavior with the new default would be that you see less speedup than you expected. If we do a good job optimizing, this should be fairly low.
  3. It seems to be what I want more often.

Thoughts?

@dirvine
Copy link

dirvine commented Feb 7, 2016

Intel et al' parallel_for etc. do not seem to assume cheap operations and in c++ at least force refactor of some code to allow parallelism. I wonder if there is a link between being completely automatic and specific. I see this lib and think wow imagine if it were just a drop in for iter() or similar than think a bit and realise folk need to code algorithms really to take advantage.

I know this is not a direct correlation to the question, but I feel it is linked. In re-factoring for parallel many times code gets cleaner. It forces/allows the programmer to reason. This is my quandary at the moment to be honest. I love the idea of automating parallelism but have found in the past being told by the compiler I need to refactor to be handy.

In essence I feel an opt-in to be the way forward here, possibly later giving hints at optimisation of algorithms for parallelism at compile tim . Hope that helps a little anyway

@nikomatsakis
Copy link
Member Author

@dirvine thanks for the input. A few thoughts:

  1. I'm not trying to make anything completely automatic -- you have to choose to write par_iter. But I do want to make it easy.
  2. Rust's type system has the nice benefit of basically encouraging you (though not forcing you) to write parallel-safe from the get-go, so I do hope that only minimal refactoring is needed.
  3. Depending on your loop and what it does, though, you may still need to do some manual refactoring. The more that you are relying on the iterator adapters, vs writing ad-hoc code in your for loop body, the better off you are, basically (though some adapters, like fold, are not very parallelizable, but hopefully you can make do with reduce).

@nikomatsakis nikomatsakis added this to the 1.0 milestone May 16, 2016
@nikomatsakis
Copy link
Member Author

Hmm. I have been experimenting with this in a branch. One interesting result that I found was that, when I ported the nbody demo, the par-reduce variant (which can generate quite a lot of inexpensive tasks...) ran ridiculously slow until I raised up the sequential threshold. This isn't really surprising I guess -- the defaults are very wrong for this case -- but it did point out of course the danger of changing our weights.

@nikomatsakis
Copy link
Member Author

I guess if we did more work on making task spawning cheap (work that would be very profitable in any case) that might help out here. (For that matter, par-reduce is still always slower than the more coarse-grained version.)

@nikomatsakis
Copy link
Member Author

The branch (for the record) is no-more-weight.

@nikomatsakis
Copy link
Member Author

@nikomatsakis
Copy link
Member Author

Definitely significant progress here with @cuviper's https://github.com/nikomatsakis/rayon/pull/106. I still think we want to remove the existing weight stuff before 1.0 -- and maybe add back with some other APIs.

@edre
Copy link
Contributor

edre commented Oct 21, 2016

Maybe rayon could sample how long leaf nodes take to run and dynamically adjust? Of course some elements may require much more processing than others, but starting with fine grained splitting and dynamically increasing splits may get the best of both worlds.

@nikomatsakis
Copy link
Member Author

@edre we already do this, effectively, via the mechanism of work stealing as well as the adaptive changes. What we are talking about is tuning that mechanism.

@nikomatsakis
Copy link
Member Author

In particular I think the current mechanism should work pretty well except for when things are both highly variable between tasks and bigger tasks are clumped together.

@nikomatsakis
Copy link
Member Author

Now that @cuviper added dynamic balancing, I think this is basically all done. Or done enough. Closing in favor of #111.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants