Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Open
2 of 3 tasks
Kerollmops opened this issue Dec 31, 2020 · 23 comments
Open
2 of 3 tasks

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Kerollmops opened this issue Dec 31, 2020 · 23 comments
Labels
A-iterators Area: Iterators A-slice Area: [T] C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Kerollmops
Copy link
Contributor

Kerollmops commented Dec 31, 2020

Feature gate: #![feature(slice_group_by)]

This is a tracking issue for the GroupBy and GroupByMut iterators.

This feature exposes the group_by and group_by_mut methods on the slice and mutable slice types, these methods return the GroupBy and GroupByMut iterators structs respectively. Those two iterators return subslices of the original slice where a user-defined function returns true for two following elements of the slice.

Public API

These methods can return subslices that contains equal elements:

let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = slice.group_by(|a, b| a == b);

assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);

they can also be used to extract the sorted subslices:

let slice = &[1, 1, 2, 3, 2, 3, 2, 3, 4];

let mut iter = slice.group_by(|a, b| a <= b);

assert_eq!(iter.next(), Some(&[1, 1, 2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3, 4][..]));
assert_eq!(iter.next(), None);

Steps / History

Unresolved Questions

@Kerollmops Kerollmops added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Dec 31, 2020
@m-ou-se m-ou-se added A-iterators Area: Iterators A-slice Area: [T] Libs-Tracked Libs issues that are tracked on the team's project board. labels Dec 31, 2020
@purpleP
Copy link

purpleP commented Apr 24, 2021

It's probably too late, but would you consider renaming this functions to a more proper name like split_on_change or group_on_change?

group_by kind of should give a transitive closure of a relationship.

@mrpink76
Copy link

Just an alternative name suggestion:
split_by_neighbour

@marcospb19
Copy link
Contributor

I think that split_on_change and split_by_neighbour are confusing because other split_* iterators skip elements, except for split_inclusive*.

Here a "Group" is a subslice of contiguous elements, each of the subslices is delimited by the element(s) where the predicate fails. Is there other important definition of a group that should be prioritized over this?

Minor thing, but I would prefer is there was a indication that the elements are an adjacent/contiguous subslice, is that a common definition of a group? Some names I thought about:

  • group_subslices
  • group_contiguous
  • subslice_by
  • group_subsequences

@SkiFire13
Copy link
Contributor

Is there other important definition of a group that should be prioritized over this?

I usually think of a group as a set of elements, not necessarily contiguous. This is the meaning it has for example in Java and SQL.

In the original RFC someone proposed calling them chunks_by or split_by, which IMO seems more fitting. Between the two I would go for the chunks_by. split_by seems to be overlapping with the Needle API, although you could also argue that this may be part of it.

@lacasaprivata2
Copy link

Yes - group_by in most languages (and my own macros in c/c++ by example) does not require a contiguous range but rather the entire width of the iterator

@lacasaprivata2
Copy link

lacasaprivata2 commented Sep 6, 2021

In response to the comment in Java, the common approach in that language is to apply a

collect(___)

to the end of a stream (iterator)

and inside the parens choose the type of grouping, such as Collections.toList() or Collections.toMap(key_fn, value_fn)

@vbfox
Copy link

vbfox commented Sep 23, 2021

I'm looking at this issue after a confusion with itertool's group_by on Vec that behave the same (group only contiguous elements) and was wondering if the naming of the std proposal was better.

So i'm adding my voice here to a change of name, group_by doesn't evoke the fact that only contiguous elements are grouped. group_contiguous from @marcospb19 suggestions would feel clearer to me.

@mjhanninen
Copy link

mjhanninen commented Oct 17, 2021

Couple more candidates

  • runs_by
  • streaks_by

@rsalmei
Copy link

rsalmei commented Dec 2, 2021

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby
Looking forward to this!

@purpleP
Copy link

purpleP commented Dec 3, 2021

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby
Looking forward to this!

Not to argue with you as you’re just expressing your subjective opinion, but it got me thinking about it more deeply.

I’m not a linguist, but I think the function with group inside it’s name should produce groups by a particular criterion which doesn’t have anything to do with the ordering (kind of what SQL group by clause). And what pythons groupby (which I think was inspired by Haskell) is doing is more appropriately described as splitting the sequence and not grouping it’s elements.

I would like to use objective criteria here and not historical reasons. Are there any linguists or mathematicians here?

By the way I almost never seen pythons groupby used without sorting the sequence first. I mean it’s main use IMHO is to produce equivalence classes from the sequence.

@obscurans
Copy link

Mathematician here - don't think you're going to get much from math:

  1. The overriding meaning of "group" is the noun for abstract algebra
  2. Haskellers had no problem with going for groupBy

The clearest parallel, in my 2c, is to the str::split series, which returns iterator over slices. My proposal is to invert the sense of the predicate and make it split_between. group_between also seems reasonable.

@jblachly
Copy link

jblachly commented Jan 2, 2022

D calls this operation chunkBy, although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

Re @purpleP comment about language and meaning:
Agree that "group" does not semantically convey precisely what's happening. chunk_by IMO better conveys that the operation will "fail" if the Vec is not already sorted. However, given naming constraints (chunk already having a separate meaning in rust) and other languages' naming conventions, group_by seems the most sense at this time.

PS: Looking very much forward to this feature as I'm porting functional-style D to Rust; what's blocking besides naming?

[0] https://dlang.org/library/std/algorithm/iteration/chunk_by.html

@SkiFire13
Copy link
Contributor

SkiFire13 commented Jan 2, 2022

In my opinion split_by (or some other variation starting with split) is the one that makes most sense.

  • split because we already have split, which has very similar semantics with the proposed method. The only difference between them is that split takes an FnMut(&T) -> bool while the proposed method takes an FnMut(&T, &T) -> bool;
  • _by because we already have other methods that end in _by and take a closure that takes two consecutive elements, the most clear example being Vec::dedup_by. Following this naming a _by_key would also make sense.

I would like to voice against group_by because even though some languages use it for this exact same function, others don't. Many people have the expectation that a group is not limited to consecutive elements, and it would be broken if this method was named group_by.
The current code examples (the first one in particular) aggravate this naming issue even more by making it look like the group will contain all the elements that match the predicate, matching the expectation of users whose previous language gave another meaning to group_by.

although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

That's not entirely true, chunks can return a slice of less than N elements in the last iteration. And if you take away the fixed size constraint then the meaning is the same, just an iterator yielding subslices.

@leonardo-m
Copy link

I'm finding this function quite useful in my code.

It's similar to a split, but it takes a function (predicate) with two arguments (diadic) instead of a function with one argument. So if you don't like the "group_by" Python-inspired name (that takes a single argument predicate) then do you like split_on_adjacent? :-)

Regarding the implementation, currently it performs a linear scan:

if (self.predicate)(l, r) { len += 1 } else { break }

When (after the sort) the runs are long enough, a more aggressive strategy could be faster. Something like galloping of D language (exponential grow, followed by binary search), or an arithmetic grow followed by binary search. This can't work if the input slice isn't sorted.

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Jan 18, 2022

Hey @leonardo-m,

If you need something more polished than this nightly method in the standard library, I have also worked on a crate with something similar to the galloping algorithm you describe here. The library is called slice-group-by and is available on crates.io.

@leonardo-m
Copy link

I've just tried that crate and it works well (in my case it gives no speedup probably because my runs are generally very short).

@leonardo-m
Copy link

leonardo-m commented Jan 18, 2022

Another note, inside next() of GroupBy there's this line:

let (head, tail) = self.slice.split_at(len);

Inside slice::split_at there's this assert:

assert!(mid <= self.len());

I've seen that such assert isn't removed in the asm of my group_by usages.

I think it's because of this missed LLVM optimization:
https://users.rust-lang.org/t/hare-and-tortoise-indexes-analysis/63985

So are benchmarks telling us that it is performance-wise worth giving a hint (inside next() of GroupBy) to help LLVM remove that assert on mid?

@SkiFire13
Copy link
Contributor

Note that with a carefully placed assert you can convince LLVM to remove any unnecessary bound check. https://godbolt.org/z/c4TxMM9Tr

@leonardo-m
Copy link

SkiFire13, despite I have some experience with optimizing Rust code and finding ways to remove bound tests, your code feels a bit magical. Very nice. I think that change is worth pushing in the Rust stdlib.

@sdroege
Copy link
Contributor

sdroege commented May 9, 2022

One thing that I needed in my use-case in addition to what GroupBy (and the slice-group-by crate) provide is access to the remainder of the iterator at any point in time, i.e. an accessor for the GroupBy::slice field.

@arniu
Copy link

arniu commented Sep 4, 2022

The chunks are slices and do not overlap.

So the name chunks_by / chunks_by_mut is a better one.

@devsnek
Copy link
Contributor

devsnek commented Feb 7, 2023

i wanted to sort items from an iterator into groups so i searched "rust group iterator", found these methods, and was extremely confused for about 90 seconds until i had processed that these do something entirely different from what i am trying to do.

Thinking about this for a bit, find_runs or runs_by would be better names, in my opinion at least.

@beck
Copy link

beck commented Feb 7, 2023

Taking a cue from more-itertools, call it split_when and negate the predicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators A-slice Area: [T] C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests