Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include parallel collections in Toolkit? #31

Open
SethTisue opened this issue Sep 14, 2023 · 31 comments
Open

Include parallel collections in Toolkit? #31

SethTisue opened this issue Sep 14, 2023 · 31 comments
Labels
enhancement New feature or request

Comments

@SethTisue
Copy link
Member

SethTisue commented Sep 14, 2023

I was a bit surprised recently to realize that we didn't include scala-parallel-collections in the Toolkit.

Just now I looked at the old spreadsheet of candidate libraries, assuming I'd see it was considered and rejected or postponed, but I don't even see a spreadsheet entry for it? Was it really never considered?

I think we should include it, or at least discuss including it. It has a lot of things going for it:

  • It's in the scala. namespace
  • It's applicable in many/all problem domains
  • It doesn't really have any direct competition. (Only heavyweight competition, such as parallel streaming in fs2.)
  • It's mature, good quality (few known bugs), and has needed little maintenance
  • It's cross-published for Scala 2.13 and 3
  • Although it's community-maintained, the Scala org has a strong interest in keeping it alive, as it's mentioned in Scala books and such. It's already a Scala "module" which is the next thing down from being in the standard library
  • We have a strong team of volunteers at @scala/collections who are interested in helping with collections stuff

What about the library's usefulness/importance?

For some tasks, it's extremely convenient and can give a large speed benefit. I think it didn't end up being quite as widely used as we'd originally envisioned, but when the unit of work that you need to happen in parallel is large enough, it's super easy to use parallel collections to get a big speedup.

Here's an example. I have a Scala-CLI script that does several hundred independent GitHub API queries. With only this small diff:

+//> using dep org.scala-lang.modules::scala-parallel-collections:1.0.4
...
+import scala.collection.parallel.CollectionConverters.* // for .par
...
-for file <- files
+for file <- files.par

I sped up the script from taking 5 minutes to taking less than 30 seconds.

@iusildra
Copy link

Would be nice to have indeed

For bigger projects that would require things like fs2, is the toolkit used over the alternatives from the same stack ?

@SethTisue
Copy link
Member Author

If you depend on Toolkit and other libraries too, it all just ends up piled into your classpath and your application code is free to use or ignore or whatever it wants.

@armanbilge
Copy link

For bigger projects that would require things like fs2, is the toolkit used over the alternatives from the same stack ?

@iusildra see also the typelevel toolkit.

https://typelevel.org/toolkit/

@lihaoyi
Copy link

lihaoyi commented Sep 16, 2023

I feel like something like parallel collections would be nice, but probably not parallel collections themselves.

IIRC the codebase was pretty old, had a lot of weird non-idiomatic/no-longer-idiomatic APIs (e.g. configuring things via mutability?), wasn't super well maintained, and didn't get widely adopted even after 10+ years.

AFAIK the recommendation for years about using scala-parallel-collections was "Don't", so unless that has changed for some reason it feels very weird to bundle it with the toolkit

@OndrejSpanel
Copy link
Member

@lihaoyi I am not using toolkit and I am not likely to use it in a near future, but I am using parallel collections in my application to parallelize a few inner loops processing thousands of items. If I am not to use parallel collections, what should I use instead?

recommendation for years about using scala-parallel-collections was "Don't"

Could you provide some link? I do not see any kind of warning in https://docs.scala-lang.org/overviews/parallel-collections/overview.html. It says:

The hope was, and still is, that implicit parallelism behind a collections abstraction will bring reliable parallel execution one step closer to the workflow of mainstream developers.

@SethTisue
Copy link
Member Author

SethTisue commented Sep 17, 2023

I'll answer this point by point.

IIRC the codebase was pretty old

I would describe it as mature and well-established. No one has needed to write a competitor, because this one does the job. I don't see why it matters how old the code is unless there is actually something wrong with the code.

had a lot of weird non-idiomatic/no-longer-idiomatic APIs (e.g. configuring things via mutability?)

I acknowledge that the configuring the thread pool usage is a little weird, but it's rare that anyone doing Toolkit-y, script-y type things would need to configure it at all. Most of the time you just want to do some stuff in parallel and it does that. Easy things easy, harder things possible.

wasn't super well maintained

It doesn't need any maintenance. There aren't any open bugs to speak of.

didn't get widely adopted even after 10+ years.

No? There are 3500 hits at https://github.com/search?q=scala-parallel-collections&type=code .

You might have this impression if people aren't talking about it, but sometimes there isn't a lot to say about a workhorse library like this.

AFAIK the recommendation for years about using scala-parallel-collections was "Don't"

I agree with Ondrej. There is no such recommendation.

One aspect of the parallel collections that did have a negative reputation was how much they deepened and complicated the collections hierarchy when they were introduced. But we fixed that in Scala 2.13, when the parallel collections were re-engineered, became a separate library, and stopped being intertwined with serial collections.

@JD557
Copy link

JD557 commented Sep 17, 2023

From the toolkit introduction:

The Toolkit supports:

  • Scala 3 and Scala 2
  • JVM, Scala.js, and Scala Native

I was under the impression that only libraries supporting Scala.js and Scala Native would be included, which I don't think is the case for the parallel collections.

Although I guess one could publish them for Scala.js and Scala Native 0.4.x with dummy implementations that just wrap the sequential collections.

(As an aside, I do use the parallel collections a lot on scala-cli scripts, so I'm excited to see this landing)

@lihaoyi
Copy link

lihaoyi commented Sep 17, 2023

One aspect of the parallel collections that did have a negative reputation was how much they deepened and complicated the collections hierarchy when they were introduced. But we fixed that in Scala 2.13, when the parallel collections were re-engineered, became a separate library, and stopped being intertwined with serial collections.

This may be where my impression came from. I must admit, I don't have any concrete objections here, other than a vague feeling. Perhaps it's no longer a problem since it's been modularized.

How do parallel collections play with scala.concurrent.ExecutionContext? My understanding is that implicit ec: ExecutionContext is the de-facto standard in way to handle thread pools and executors in Scala, being in the standard library and all.

If parallel collections are now a module, does that mean they are open to modifications? If so, then we should definitely consider sanding off some of the rough edges as part of including it in the toolkit, e.g. replacing the mutable threadpool thing with an implicit ec: ExecutionContext, and going over the rest of the doc/API/architecture to see if there's anything else we could modernize. This would be similar to what we're doing with uPickle and OS-Lib and others. If we do that, then I have no objection to including it in the toolkit

If I am not to use parallel collections, what should I use instead?

I've generally used scala.concurrent.Future here:

val foo = items.map(x => Future{doThing(x)}).map(Await.result(_, Duration.Inf))

I admit it's a bit more clunky to use than .par, but the API can definitely be simplified if necessary, and having a consistent ExecutionContext interface that all your asynchronous operations use is definitely nice.

@He-Pin
Copy link

He-Pin commented Sep 17, 2023

Nowadays, there are fs2, zio , pekko/akka streams out there, -1 for including parallel collection

@Ichoran
Copy link

Ichoran commented Sep 17, 2023

There isn't anything I'm aware of that matches the simplicity of the parallel collections, so I think it's worth including for that reason.

@som-snytt
Copy link

Just to echo Mr Li's impression, but maybe it's just that (as happens) the real-world effectiveness did not match heightened expectations.

As usual, Mr Ichoran's summation is especially succinct and persuasive.

The reply to He-Pin's objection, and one is liable to mix up one's kerrs, is that the toolkit is about simplicity and scope. So an easy-to-use solution of limited scope is desirable.

@He-Pin
Copy link

He-Pin commented Sep 17, 2023

What about sending a MR to scala/dotty compiler and let the compiler use it first?
@som-snytt

@SethTisue submit a poll on reddit and see?

@som-snytt
Copy link

I forgot to say that I checked
https://mvnrepository.com/artifact/org.scala-lang.modules/scala-parallel-collections/1.0.4/usages?p=1
but I wasn't sure what it implies.

Released for Scala 3 nullifies any suggestion the project is moribund.

Actually, #14 in this list is impressive:
https://mvnrepository.com/open-source/collections

Worth adding that lack of tickets does not directly imply quality, but may imply usability in proportion to the use cases people use it for.

@He-Pin also worth adding that suitability for a limited use case does not imply suitability for others. But one may ask if "inclusion in the toolkit" constitutes an endorsement, and should the user receive further guidance.

I'm sorry this topic missed the recent survey that folks were complaining was already too long. "And before we let you hit submit, please tell us your thoughts on the parallel collections. Yes, that one. Good for noobs who have no opinions about concurrency and/or parallelism? or is par short for paradigmatic?"

@He-Pin
Copy link

He-Pin commented Sep 18, 2023

@som-snytt Hum, interesting , Zio is on the list too, IIRC, zio has zero dependency.

@SethTisue What's the status of this scala/scala-parallel-collections#22

@SethTisue
Copy link
Member Author

SethTisue commented Sep 18, 2023

@SethTisue What's the status of this scala/scala-parallel-collections#22

That's about 2.12 and 2.12 isn't relevant to the Toolkit, which only targets 2.13 and 3

I was under the impression that only libraries supporting Scala.js and Scala Native would be included, which I don't think is the case for the parallel collections. Although I guess one could publish them for Scala.js and Scala Native 0.4.x with dummy implementations that just wrap the sequential collections.

Interesting. I've recorded the suggestion at scala/scala-parallel-collections#251 .

Note that the Toolkit includes os-lib, so that's a major precedent for including something that does stuff that only works on the JVM.

@SethTisue
Copy link
Member Author

I put a little straw poll up at https://twitter.com/SethTisue/status/1703774958789255199 to see what people think

@eed3si9n
Copy link
Member

The point of having Toolkit is to be batteries-included for day-to-day operation, like parsing JSON a la Python, and hopefully set the new and experienced users on the paved path of Scala.

Parallel operation is an interesting one because it is one of the recognizable benefit of adopting functional programming, and to some extent it's the battle ground of "how we do things" in Scala for the last decade, especially in terms of balancing high volume of request and slow/limited IO operations. I feel like, either you're coming from Akka, Typelevel, or even plain ExecutionContext, one consensus is to avoid performing blocking operation without marking it as such.

In other words, the simplicity of parallel collection might do the wrong thing for exactly the kind of use cases that one would want to use parallel collection, like reading many files for file <- files.par and performing side effects as Seth listed as example. Maybe as they say, the answer is always traverse.

@philipschwarz
Copy link

philipschwarz commented Sep 18, 2023

the Scala org has a strong interest in keeping it alive, as it's mentioned in Scala books and such.

There is also this course: https://www.coursera.org/learn/scala-parallel-programming.

IIRC, when I went through the course, it wasn't spelled out too clearly/often that while the parallel collections used to be integrated with the main Scala release, they later got separated out into module https://github.com/scala/scala-parallel-collections.

image

I see in https://github.com/scala/scala-parallel-collections/blob/439b9c6e7e68c0407d69f7b09074ed03c82271aa/README.md?plain=1#L36 that in older versions of Scala one could just invoke .par on a collection, whereas in later versions, the following import is needed to be able to do that:

import scala.collection.parallel.CollectionConverters._

One thing that might discourage some people from looking further into parallel collections is the sentence in bold in the following passage in https://www.packtpub.com/product/learning-concurrent-programming-in-scala/9781783281411 (1st edition):

Even when you are sure that parallel collections improve program performance, you should think twice before using them. Donald Knuth once coined the phrase Premature optimization is the root of all evil. It is neither desirable nor necessary to use parallel collections wherever possible. In some cases, parallel collections give negligible or no increase in speed. In other situations, they could be speeding up a part of the program that is not the real bottleneck. Before using parallel collections, make sure to investigate which part of the program takes the most time, and whether it is worth parallelizing. The only practical way of doing so is by correctly measuring the running time of the parts of your application. In Chapter 9, Concurrency in Practice, we will introduce a framework called ScalaMeter, which offers a more robust way to measure program performance than what we saw in this chapter.

Not sure if the second edition differs in this respect.

P.S. if there is ever a third edition, updated for Scala 3, I suggested a photo for the front cover: https://twitter.com/philip_schwarz/status/1530584650481127430

@szymon-rd
Copy link
Member

szymon-rd commented Sep 19, 2023

parallel-collections may be useful in scripting and prototyping, as it's often the case that even a simple automation app does a batch of IO operations or processing. If we were to choose a library for this purpose, parallel-collections would probably make the most sense - it's very conceptually lightweight. It uses Scala's interfaces and introduces a par method and a few types. That's all, no new concepts. It requires a bit of a deeper look to ensure it's always the case, but it fits well from the perspective of usability and employed approach.

However, it's released only for the JVM. It makes sense, as there is currently no way to do that on other platforms. It will change after the next minor of Scala Native, so we may consider releasing parallel-collections on Native. If that's done, then two out of three platforms are supported. Nevertheless, for Scala.js, it likely never will be supported. We may skip the Scala.js support as precedence (os-lib) already exists, but that will again require carefully weighing the pros and cons.

Adding a dummy implementation for any platform would do more harm than good. One would expect that .par actually makes the code parallel. A transparent lack of support may lead to confusion and frustration. It's better not to add it to the classpath altogether.

If we added parallel-collections before releasing it for the Native, we would break our promise of supporting other Scala platforms. I would heavily argue against that.

@lihaoyi
Copy link

lihaoyi commented Sep 20, 2023

One question here: does scala-parallel-collections have any issues around blocking?

e.g. A naive Future-based approach val foo = items.map(x => Future{doThing(x)}).map(Await.result(_, Duration.Inf)) works fine as a top-level call to parallelize things, but it causes the calling-thread to be blocked while executing. One thread being blocked isn't a big deal, but if naively nested (e.g. if doThing does its own Futures and Await.result inside), that can result in an arbitrarily large number of threads being blocked, which before Loom/VirtualThreads cost 1-4mb each and can add up to a lot of resource footprint quickly when parallelizing a large collection.

When you use .par.map, it becomes second nature to just treat parallel collections transformations the same way as normal collections transformations. Thus it would be expected that people would started nesting them, intentionally or unintentionally, the same way that normal collection transformations can be nested. Is this a footgun we need to be worried about?

@SethTisue
Copy link
Member Author

scala-parallel-collections is implemented using java.util.concurrent.ForkJoinPool

@szymon-rd will the new Scala Native support ForkJoinPool? if so, we have a path forward that wouldn't cause undue delay

@lihaoyi I suspect this answers your question as well. the implementation isn't naive

I don't know if @axel22 sees GitHub notifications, but perhaps he'd like to weigh in, as the first author of https://infoscience.epfl.ch/record/165523/files/techrep.pdf ("On a Generic Parallel Collection Framework", Aleksandar Prokopec, Phil Bawgell, Tiark Rompf, Martin Odersky, June 2011)

@lihaoyi
Copy link

lihaoyi commented Sep 20, 2023

That it uses forkjoinpool doesnt really answer my question; Futures can use forkjoin pool too, and still suffer this failure mode. One option is you run out of threads, one option is you spawn more threads and eventually run out of memory because threads are expensive. I'm not aware of any third option apart from Loom virtual threads that are a lot cheaper.

Naive or not, these are hard problems in the design space for concurrency/threading frameworks, so I'd want to know how this stuff works. The fact that someone probably thought very hard about it a decade ago doesn't tell me what tradeoffs they ended up choosing, and thus what pitfalls any user-land code will have to be careful to avoid. A quick skim of the docs didnt pull up anything here, maybe it's not a problem, but it's worth confirming

@armanbilge
Copy link

will the new Scala Native support ForkJoinPool

Yes, it already does in the 0.5.0 snapshots.

@som-snytt
Copy link

Probably I'm the only one who ported https://github.com/axel22/ScalaDays2012-TrieMap shortly before bed last night. It was a no-brainer and worked the first time in 2023 under WSL, the requisite decade since

Author: Aleksandar Prokopec <axel22@gmail.com>
Date:   Mon May 13 17:34:35 2013 +0200

    Update version to 2.10.0

I wonder if toolkit includes scala-swing and how would I know.

@SethTisue
Copy link
Member Author

the voting here was sort of incidental and what I really hoped is that more folks would come by here (I included the link) and give some more feedback beyond just a vote 🤷 anyway, it does show substantial usage

there are some "dislike/avoid" answers but we don't know what their reasons are. (one possible explanation is that they're might be fs2/Akka/ZIO users who prefer working with a richer API. which is a valid position, but also out of scope for Toolkit)

Screenshot 2023-09-25 at 9 25 12 AM

@som-snytt
Copy link

occasional/casual users are still a use case

@adpi2 adpi2 added the enhancement New feature or request label Oct 18, 2023
@SethTisue
Copy link
Member Author

some support at https://users.scala-lang.org/t/scala-toolkit-0-2-0-is-out-discussion/9355/4 :

Another good addition would be to make .par available in the Toolkit for some quick and dirty parallel computations in the REPL or scripts for everyday usage. Currently I can do it with a using directive in Scala-cli but it’s worth considering adding it to the Toolkit, so it can feel like parallel collections are still part of the standard library like they used to be (it would also help students if used in the online courses).

@spamegg1
Copy link

spamegg1 commented Nov 5, 2023

Yes I support this, especially currently the students of the Parallel Programming course are in dire need of an easy quick way to use .par along with the now-outdated lecture video code. (I'm telling them to use Scastie or Scala-cli)

@philipschwarz

This comment was marked as off-topic.

@SethTisue

This comment was marked as off-topic.

@philipschwarz

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests