Skip to content
This repository has been archived by the owner. It is now read-only.

Add "Generator" / "Foreachable" types? #15

Closed
lihaoyi opened this issue Jan 14, 2017 · 18 comments

Comments

@lihaoyi
Copy link

commented Jan 14, 2017

I don't know if this is the right place for this discussion/proposal/idea, so feel free to close this issue if it's not.

I have a Geny library that provides a Generator type, which is basically the core of Traversable: a simple data-type based around .foreach, whose operations are fused/lazy until you call a terminal operation. It is essentially the push-based dual to the existing pull-based scala.Iterators, and is similar to what java.util.Stream provides.

It's currently lacking an abstract interface similar to TraversableOnce, but one could be added that would codify the .foreach method without all the existing fused/lazy transformation methods. For the purpose of this issue let's call this hypothetical interface Foreachable.

trait Foreachable[+A]{
  def foreach(f: A => Unit): Unit
}

Would there be any interest in adding a Foreachable interface or Generator data-type to the collections library?

Where would Foreachable be useful?

Every Iterator/Iterable would be a Foreachable, but not every Foreachable would be an Iterable.

I think having a pure trait that defines foreach would be useful, as there are tons of cases where you can define a foreach method but cannot realistically define an Iterator:

  • The renderTo(sb: StringBuilder) method of a JSON AST, which can be wrapped in an iterable, but needs some non-trivial optics/lenses/paths to make it work and probably at a significant performance cost

  • The renderTo(sb: StringBuilder) method of scalatags.Text.Frag template fragments, which could in theory be wrapped in an iterable, but at a terrible performance cost, as well as requiring quite an extensive rewrite of the internal implementation to make it happen.

  • Third-party SAX parsers, or other event-based recursive-descent parsers. These cannot be easily wrapped in an iterable.

  • The streaming output of the Scala compiler, coming from the scala.tools.nsc.Global object's def compile method. This could be trivially be converted into a Traversable[String], but is impossible to convert to an Iterable[String] without extreme measures (described below)

(taken from a post on reddit I made that goes into more details why a Generator/Foreachable is not "just" an Iterable)

If I wanted to deal with the streaming results from any of these sources, I would not be able to use an Iterator/Iterable (see link above), but with a Foreachable interface I could apply that to all these data-sources and then transform/manipulate them generically just as you would transform/manipulate iterators generically. I think that would be nice.

Where would Generator be useful?

Apart from the broad generality of this Foreachable data-type (not sure you can really call it a collection, since it doesn't hold elements, but neither does Iterator) it is also useful because a fused/lazy implementation such as geny.Generator can guarantee cleanup operations run after the terminal operations are complete. For example, if I call:

val x: C[String] = readFileLines()
println(x.slice(10, 20).mkString("\n"))

To print the lines 10-20 of a particular file, if...

  • If C is an Iterator, we will leak the file handle

  • If C is a "self-closing-iterator" that is common in Scala libraries, it will still leak a file handle since those typically only close when the iterator completes. Even if you try to be clever and override slice (and take and init and ...) someone else using "raw" next/hasNext operations will likely end up leaking your file handle

  • If C is a more concrete collection (Vector, List, etc.) you can ensure that the file gets closed by the time readFileLines() returns, but it will end up loading the whole file into memory just to take the first 20 lines.

  • If C is a geny.Generator, the file will only get opened during the terminal operation mkString, which iterates-and-discards the first 10 lines, reads lines 10-20, joins them together before closing the file.


In conclusion, Foreachable/Generator provides iterator-like "streaming"/fusing/laziness together, with broader and more general applicability than Iterator/Iterable provides (use cases described above) and the potential for deterministic-cleanup (which is a perennial pain, e.g. in scala.io.Source, at least in my experience). Ammonite-Ops now uses geny.Generators as the return type for all it's lazy/"streaming" operations, so it can ensure that the proper resources get released and you don't run out of file descriptors because you did read.iter(path).take(10) to read the first 10 lines of many different files.

The Scala 2.x incantation of Traversable is not suitable for this task, because it brings along with it dozens (hundreds?) of concrete method implementations, most of which are eager/copying, build up the "entire" data-structure in memory each time, and are not compatible with e.g. the concrete fused/lazy implementation of Generator I've linked above.

WDYT? Is this something worth continuing discussing in the context of this strawman?

@odersky

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2017

I think it would definitely be worth discussion. We have already started on reddit, but for more details here is the right place.

Here are some initial thoughts:

  • There are fewer operations that make sense on Traversable than on Iterable
  • Operations should have different implementations
  • Both of these things mean that the "new" Traversable will be hard to integrate in a collection hierarchy: If we implement traversable ops as methods on Traversable, we have to override them all on Iterable, and we are back to the fat classes mess that we want to get away from. If we implement Traversable ops as decorators, we get a better fit, but it means that we lose runtime overriding so when the static type is Traversable, all operations use the Traversable implementations which might be less efficient than the Iterable ones.
  • So maybe a separate type is better, even though it means it will be a huge pain to keep operations in sync between Iterable and Traversable.
@lihaoyi

This comment has been minimized.

Copy link
Author

commented Jan 15, 2017

I'm not really familiar with the current way the strawman works, but the way I imagined this fitting in, relative to how things worked in Scala 2.x (let's call this imaginary world lihaoyi-strawman collections) would be:

  • TraversableOnce stays more-or-less unchanged, with Iterator still inheriting from it

  • Traversable's concrete implementations of map filter etc (which in Scala 2.x all create concrete copies using builders) get moved down the hierarchy into Seq or some other trait.

  • Traversable is now left with foreach, and abstract methods like map and filter. Thus in my head, Traversable would become the same as Foreachable described in the initial issue.

  • Seq, or some other class down the hierarchy, override these to use their current (in Scala 2.x) builder-taking implementations

  • Generator is now a subclass of Traversable, and overrides the abstract methods like map and filter etc. with "lazy"/fusing implementations like those I've implemented in geny.Generator

Thus, in lihaoyi-strawman collections, Traversable is now a mostly-abstract interface similar to what TraversableOnce is in Scala 2.x collections. The logic of implementing a lazy, fusing, possibly resource-clean-upping Traversable is left to the Generator implementation.

In lihaoyi-strawman collections,

  • If you want something with lazy/fusing operations that may need resource cleanup, you ask for a Generator.

  • A method asking for a Traversable means it does not care whether you get a Seq or Iterable or Generator

  • A method asking for a TraversableOnce does not care if you get a Seq or Iterable or Generator or Iterator


Now, that's only what I had in my head as lihaoyi-collections, abstractly, using the Scala 2.x collections as a baseline. I'm not sure what parts of that would or wouldn't work in practice, nor do I know how it would fit into the "new world" of collection-strawman with your decorators and re-organized trait hierarchy and other things I'm not familiar with =P

@szeiger

This comment has been minimized.

Copy link
Collaborator

commented Jan 16, 2017

What operations would you define on Traversable and TraversableOnce? We'd either have to stick to a small set of methods or require lots of reimplementation for your Generator type.

How do you distinguish between terminal and non-terminal operations? Java 8 Streams makes this very explicit but Scala collections don't. For example, does map on a Traversable return a lazy view or a proper collection?

@lihaoyi

This comment has been minimized.

Copy link
Author

commented Jan 16, 2017

What operations would you define on Traversable and TraversableOnce? We'd either have to stick to a small set of methods or require lots of reimplementation for your Generator type.

I'd expect to see most of the terminal operations (mkString, sum, reduce, fold, ...) implemented on Traversable/TraversableOnce, since those can be implemented the same way on both Generator and all the traditional collections using foreach or similar. I'd expect to see the transformations (map, filter, collect, ...) defined but not implemented, so they can be implemented in different ways (lazy vs eager-with-builder) on Generator and traditional collections

How do you distinguish between terminal and non-terminal operations?

Good question. I don't have an answer here, but how does the distinction between scala.Iterator/scala.Stream and the rest of the collections work in collections-strawman? Iterator/Stream also have this "terminal operation" distinction where some things force (and even exhaust, for Iterator) the computation while others just build up the transformation-chain. If there's some mechanism that collections-strawman uses to distinguish terminal operations for those two, maybe we could do the same thing

@szeiger

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2017

I'd expect to see most of the terminal operations (mkString, sum, reduce, fold, ...) implemented on Traversable/TraversableOnce, since those can be implemented the same way on both Generator and all the traditional collections using foreach or similar.

Having basic implementations based on foreach is what we do in the current collections library but we should avoid that in the new design because it is inefficient. It may be faster to have two separate implementations for Generator and Iterable but there's still the risk of losing optimizations because some calls are not guaranteed to be monomorphic anymore. We should have benchmarks to evaluate the performance impacts of such a change.

but how does the distinction between scala.Iterator/scala.Stream and the rest of the collections work in collections-strawman?

It's not made explicit but seems to come down to using "Ops" classes for terminal operations (but there are also operations that do not force a result) and "MonoTransforms" / "PolyTransforms" classes for transformations (which are lazy for lazy collection types).

@Atry

This comment has been minimized.

Copy link

commented Mar 8, 2017

I have a Fastring library, which depends on Traversable as well.

@Atry

This comment has been minimized.

Copy link

commented Mar 8, 2017

I guess a @inline final def foreach on Iterable would resolve the performance issue related to monomorphism. It's not necessary to delete Traversable

@szeiger

This comment has been minimized.

Copy link
Collaborator

commented Mar 9, 2017

Having Traversable and TraversableOnce at the top of the hierarchy wouldn't hurt the new design. Ideally we'd keep them around deprecated for 2.13 to provide a smoother migration path but I don't think we have the necessary features to prevent deprecation warnings from leaking into all kinds of code that isn't deprecated.

@biboudis

This comment has been minimized.

Copy link
Collaborator

commented Apr 27, 2017

I would like to bump this discussion since I think the push vs pull discussion (or producer-owned vs consumer-owned stack) is something that concerns a lot of communities!

For instance, Rust went from push to pull in 2013 [1] and [2]. Recently claims for their inefficiencies re-emerged as discussed in Rust’s iterators are inefficient emphasizing that there is still room for improvement with internal iterators. Databases went from pull to push and Spark is doing as well [3]. Java has push, C# has pull-y LINQ and F# has also pull-y Seq.

Push-based designs may be able to produce stellar performance in some cases when method inlining succeeds for cases of maps, filters, sum of squares demonstrating stream fusion (sufficiently small method sizes needed for inlining to succeed (relevant flags on the JVM MaxInlineSize, MinInliningThreshold, etc)). At the Java side, John Rose, on his perspectives on streams performance mail, wrote about j.u.stream as a retrospection:

External iterators are easier to optimize.

(pull)

HotSpot are less good at internal iterators. If the original point of the user request fails to inline all the way into the internal looping part of the algorithm (a hidden "for" loop), the quality of the loop will be very poor. (Exception: With micro-benchmarks, the loop quality can be partially recovered by relying on a pure, clean profile. But with real system code, the hidden "for" loop will have a polluted profile.) This is the problem I have referred to as "loop customization", even though it applies to non-looping code as well (as long as there is a template that needs expanding in order to gain performance).

(push)

We can carefully study that and take under consideration his points 1, 3 and 4 for our benchmarks.

From my experience, when it comes to parallel loops (i.e., zip) the implementation must transform both streams into pull-based in order to accommodate proper termination checks (in the presence oftake operators, short-ranging the stream) and to maintain proper lazy characteristics.

Last week I created a benchmarking suite porting some of our benchmarks from our strymonas project (for the record it is a staged library based on a pull-based design with elements of push-based composition. Since we are not talking about a solution based on partial evaluation here I am not going into details).

I ported some of our benchmarks in stream-benchmarks just to motivate comparisons between the new and the old design for lazy Views. My solely purpose is to initiate discussion. Here are some very quick results (without fixing my CPU clock and performing three forked-vm executions). Baselines are hand-written fused loops, semantically equivalent with pipelines. Views are the old ones, Strawman are the new ones. I have also included a pull- and a push-based reference implementation (which captures the essence of Java 8 streams as well) to compare them quickly side by side. JVM version (1.8.0_65-b17), scalac (2.12.1) and TieredCompilation off. I am using Long integers everywhere.

preliminary

We can discuss further anything you would recommend in terms of benchmarking. If you have any request, proposal, fix, etc for lazy collections benchmarks (for ArrayViews etc) let me know here #62.

@julienrf

This comment has been minimized.

Copy link
Contributor

commented Apr 27, 2017

Thanks @biboudis for your comment.

Can you also comment on the following?

Every Iterator/Iterable would be a Foreachable, but not every Foreachable would be an Iterable.

In other words, can we safely turn a pull-based collection into a push-based one, and vice versa?

Do you think it is a good idea to have one extend the other or should we model them with unrelated types?

@Ichoran

This comment has been minimized.

Copy link
Contributor

commented Apr 27, 2017

You can't go back and forth. External iteration is trivially internalizable. Internal iteration is inscrutable (save for things like CPS). You can fake the API with caching, in some cases, but it's a lousy solution.

@Ichoran

This comment has been minimized.

Copy link
Contributor

commented Apr 27, 2017

Example:

trait External[A] {
  def hasNext: Boolean
  def next: A
}

trait Internal[A] extends External[A] {
  def foreach[U](f: A => U) { while(hasNext) f(next) }
}
@biboudis

This comment has been minimized.

Copy link
Collaborator

commented Apr 27, 2017

Nice question and accurate answer by @Ichoran! Totally agree.

Without some way to play the JITter yourself (like partial evaluation) you can't do that in an optimized way (maybe not with reasonable effort). You can go from push to pull with caching/buffering and you have to find an elegant and fast solution to short-circuiting on top of that! Check the still open issue in java world JDK-8075939 and also check that this is still unresolved.

Stream.of(new Long[]{1L,2L})
  .flatMap(x -> Stream.iterate(0L, i -> i + 2).map(y -> x * y))
  .iterator()	
  .hasNext();
}

Defunctionalizing Push Arrays shows a unified solution but this cost is noted!

Converting a Pull array to a Push array is cheap and is also subject to fusion. [..] Converting a Push array to a Pull array, however, requires computing and storing all elements to memory [..] Few such conversions is likely to be better.

@szeiger szeiger added this to the Scala 2.13.0-M4 milestone Apr 24, 2018

@SethTisue

This comment has been minimized.

Copy link
Member

commented Jun 27, 2018

not happening for 2.13. perhaps the idea still has a future. regardless, closing the ticket since we are in the process of closing out this repo.

@SethTisue SethTisue closed this Jun 27, 2018

@Atry

This comment has been minimized.

Copy link

commented Jun 27, 2018

@SethTisue Does "not happening" mean "not removal of Traversable"?

@julienrf

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

No, it means “no re-introduction of Traversable”.

@Atry

This comment has been minimized.

Copy link

commented Jun 27, 2018

Is there any difficulty to keep source-level backward compatibility here?

@julienrf

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2018

Traversable is defined as a deprecated alias to Iterable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
8 participants
You can’t perform that action at this time.