Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds splitBy extension method on scala.collection.Iterator #4

Merged
merged 1 commit into from
Aug 21, 2019

Conversation

jeantil
Copy link
Contributor

@jeantil jeantil commented May 18, 2018

Iterator#splitBy constructs an iterator where consecutive
elements of the original iterator are accumulated as long as the output
of a key function for each element doesn't change. They are emitted as
an Iterable as soon as the key function changes.

This operation makes sense as soon as you are trying to process an
iterator where you know the elements will be sorted in a certain way and
you need to group them without loading all the data in memory.

For instance

  • processing a file where the ordering is guaranteed but the file
    doesn't fit in the heap,
  • processing a streaming resultset where the underlying database
    guarantees the ordering because of a sort clause.

The same operation is added on Iterables with the difference that the
specific container type of the input is preserved for both collection
levels of the output, thus

  • Set(1,2,3).splitBy(identity) returns
    Set(Set(1), Set(2), Set(3))
  • Vector(1,2,3).splitBy(identity) returns
    Vector(Vector1), Vector2), Vector3))
  • etc.

Copy link

@thebignet thebignet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good method to add and the use cases are fairly common to add to this lib.

@jeantil jeantil force-pushed the iterator-group_until_changed branch 2 times, most recently from 7a3f8ec to 7e5521a Compare May 20, 2018 20:15
@jeantil
Copy link
Contributor Author

jeantil commented May 20, 2018

I made cosmetic changes after the first review :

  • inlined local val key as it was used only once
  • replace custom-defined readHead with Iterator#nextOption since the behaviour is exactly the same.

@jeantil
Copy link
Contributor Author

jeantil commented Jun 29, 2018

Hello

@julienrf I don't mind if this isn't a priority, but just want to make sure this PR was submitted at the right place (not to mention I would love feedback on the actual implementation :) )

cheers

@julienrf
Copy link
Collaborator

Hello Jean! Sorry for being silent. Yes, the PR looks great! We are still unsure about what will happen to scala-collection-contrib, though. I will come back to you soon. Is that ok for you?

@jeantil
Copy link
Contributor Author

jeantil commented Jun 29, 2018

Hi Julien, thats perfectly ok for me. I'll keep monitoring the thread here and at contributors.scala-lang.org

cheers :)

@joshlemer
Copy link
Member

If we're gonna add this splitBy method on Iterators, it probably makes sense to add it on the Iterables as well, which delegates to basically `this`.iterator.splitBy(f).map(`this`.fromSpecific) or something like that. See for instance, how this PR adds both implementations https://github.com/scala/scala-collection-contrib/pull/18/files#diff-ea5808f2cdf7aad91c2879f38e526320R34

@jeantil jeantil force-pushed the iterator-group_until_changed branch from 7e5521a to d659dd1 Compare May 2, 2019 15:44
@jeantil
Copy link
Contributor Author

jeantil commented May 3, 2019

@joshlemer I gave a try to adding this to IterableDecorator.
I am not entirely sure what the resulting signature should be. I went for Iterator[C[A]] according to your hint so that Vector(1,1,2,2,3,3).splitBy(identity) would return Iterator[Vector[A]] but I don't know if rewrapping the inner collections in a Vector actually brings much value.

@julienrf
Copy link
Collaborator

@jeantil I think it should return at least Iterable[CC[A]], but maybe CC[CC[A]] is a more obvious default?

Copy link
Collaborator

@julienrf julienrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having the following signature as a Seq[A] decorator?

def splitBy[B](f: A => B): CC[CC[A]]

And, similarly, on Iterator[A]:

def splitBy[B](f: A => B): Iterator[Iterator[A]]

@jeantil jeantil changed the title adds groupUntilChanged extension method on scala.collection.Iterator adds splitBy extension method on scala.collection.Iterator Jul 30, 2019
@joshlemer
Copy link
Member

@julienrf
for Iterators:

We could aim for consistency with the most closely related method on iterator, grouped, it would then return an Iterator[immutable.Seq[A]]

for Seq's:

Since it's not changing the underlying element type, I think it could return CC[C]:
def splitBy[B](f: A => B): CC[C]

This would also be more consistent with def splitAt(n: Int): (C, C)

@jeantil
Copy link
Contributor Author

jeantil commented Jul 30, 2019

Here is what I came up with for a quick shake up :

def splitBy[K, That,CC](f: seq.A => K)(implicit bf: BuildFrom[C, seq.A, That], bff: BuildFrom[C,That,CC]):CC =
    bff.fromSpecific(coll)(seq(coll).iterator.splitBy(f).map(bf.fromSpecific(coll)))

this compiles, but then I get compilation errors on the tests because of ambiguous implicits

[error] Note that implicit conversions are not applicable because they are ambiguous:
[error]  both method IterableDecorator in package decorators of type [C](coll: C)(implicit it: scala.collection.generic.IsIterable[C])scala.collection.decorators.IterableDecorator[C,it.type]
[error]  and method SeqDecorator in package decorators of type [C](coll: C)(implicit seq: scala.collection.generic.IsSeq[C])scala.collection.decorators.SeqDecorator[C,seq.type]
[error]  are possible conversion functions from scala.collection.immutable.Vector[A] to ?{def splitBy: ?}
[error]     val groupedVector:Vector[Vector[Nothing]] = Vector.empty.splitBy(identity)
[error]                                                        ^

But that won't work with CC[C] since there is no guarantee that C is preserved (as per @joshlemer's proposal)

I tried getting it to work on Iterable instead but I get in other kind of errors and I'm short on time to dig deeper :(

@julienrf
Copy link
Collaborator

@jeantil is splitBy defined on both IterableDecorator and SeqDecorator? It should be defined only on one of those.

@julienrf
Copy link
Collaborator

@joshlemer You raised good points ;)

We could aim for consistency with the most closely related method on iterator, grouped, it would then return an Iterator[immutable.Seq[A]]

I don’t know what is the reason for grouped to have this return type, though.

for Seq's:

Since it's not changing the underlying element type, I think it could return CC[C]

👍

@jeantil
Copy link
Contributor Author

jeantil commented Jul 30, 2019

I realize I may not have pushed everything, I had started adding it on IterableDecorator and it seems i never pushed the commit (it was in response to #4 (comment))
(sorry about the account juggling)

def splitBy[K, That](f: it.A => K)(implicit bf: BuildFrom[C, it.A, That]):Iterator[That]

is as far as I managed to compile for IterableDecorator.

@jeantil jeantil force-pushed the iterator-group_until_changed branch from 8a007ef to 226accb Compare August 17, 2019 12:39
@jeantil
Copy link
Contributor Author

jeantil commented Aug 17, 2019

Hi,
I have squashed the iterations on IteratoDecorator#splitBy, narrowed it's signature to Iterator[immutable.Seq[A]]and changed IterableDecorator#splitBy to return CC[CC[A]] when it is given a C[A].

Doing the latter, I switched from delegating to iterator to a specific implementation which
I found easier to work with to achieve the expected type signature.

The corresponding tests have been updated too and I applied a formatter for
consistency.

@jeantil jeantil force-pushed the iterator-group_until_changed branch from 226accb to b732778 Compare August 17, 2019 12:52
Copy link
Collaborator

@julienrf julienrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your patience, Jean!

Doing the latter, I switched from delegating to iterator to a specific implementation which
I found easier to work with to achieve the expected type signature.

I understand, however the drawback is that your implementation is strict by default. I think it would be nice if it was lazy on LazyList.

@jeantil
Copy link
Contributor Author

jeantil commented Aug 19, 2019 via email

@jeantil
Copy link
Contributor Author

jeantil commented Aug 20, 2019

I had some free time this morning thanks to kids sleeping in :)

@julienrf
Copy link
Collaborator

Looks good to me, thanks! Do you mind squashing the commits into a single one?

@jeantil
Copy link
Contributor Author

jeantil commented Aug 20, 2019 via email

`Iterator#splitBy` constructs an iterator where consecutive elements of
the original iterator are accumulated as long as the output of a key
function for each element doesn't change.

This operation makes sense as soon as you are trying to process an
iterator where you know the elements will be sorted in a certain way and
you need to group them without loading all the data in memory.

For instance
 * processing a file where the ordering is guaranteed but the file
 doesn't fit in the heap,
 * processing a streaming resultset where the underlying database
 guarantees the ordering because of a sort clause.

The same operation is added to `Iterable` with the difference that the
specific container type of the input is preserved for both collection
levels of the output, thus

* `Set(1,2,3).splitBy(identity)` returns
    `Set(Set(1), Set(2), Set(3))`
* `Vector(1,2,3).splitBy(identity)` returns
    `Vector(Vector1), Vector2), Vector3))`
* etc.
@jeantil jeantil force-pushed the iterator-group_until_changed branch from 718abf1 to c57a652 Compare August 21, 2019 07:02
@jeantil
Copy link
Contributor Author

jeantil commented Aug 21, 2019

Done I squashed everything to a single commit

@julienrf julienrf merged commit aa605f7 into scala:master Aug 21, 2019
@julienrf
Copy link
Collaborator

🚀

@nafg
Copy link

nafg commented Sep 12, 2019

How is this different than span?

@jeantil
Copy link
Contributor Author

jeantil commented Sep 12, 2019

The simplified type signatures for both operations look like the following which should give a hint as to the difference

def span(p: (A)  Boolean): (Iterable[A], Iterable[A]) 

vs

def splitBy[K](f: (A) => K):Iterable[Iterable[A]]

But the difference increases in the details :

  • span accumulates all the elements as long as the predicate is satisfied, however it assumes the accumulation condition can be derived from a single element
  • splitBy accumulates all elements which successively resolve to the same key. Writing a predicate for this requires access to the previous element to know if the key changed which isn't available in span.
  • groupBy accumulates all elements which resolve to the same key, consuming all the input in the process.

I think of splitBy as closer to a lazy groupBy that avoids consuming the whole input than to span.

@nafg
Copy link

nafg commented Sep 12, 2019 via email

@jeantil
Copy link
Contributor Author

jeantil commented Sep 12, 2019

It was initially called groupUntilChanged and was renamed to splitBy as per comments during the review process...

I think that chunkBy is an interesting proposal though. I don't mind making another PR to rename it. I would prefer to have @julienrf and @joshlemer 's thoughts on this before I make it :)


edit
I found the discussion that led to the renaming, it seems that the name splitby is consistent with the naming in other tools including Mathematica...

@nafg
Copy link

nafg commented Sep 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants