Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for slice::partition_dedup/by/by_key #54279

Open
1 of 4 tasks
Kerollmops opened this issue Sep 16, 2018 · 27 comments
Open
1 of 4 tasks

Tracking issue for slice::partition_dedup/by/by_key #54279

Kerollmops opened this issue Sep 16, 2018 · 27 comments

Comments

@Kerollmops
Copy link
Contributor

@Kerollmops Kerollmops commented Sep 16, 2018

Add three methods to the slice type, the partition_dedup, partition_dedup_by and partition_dedup_by_key. The two other methods are based on slice::partition_dedup_by.

  • Implement the feature as a PR (#54058)
  • Should the methods only return one slice instead of a tuple with two slices? (comment)
  • Annotate it with #[must_use]?
  • Stabilization PR (see instructions on forge)
@SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Jan 9, 2020

This issue tracks:

impl<T> [T] {
    pub fn partition_dedup(&mut self) -> (&mut [T], &mut [T])
    where
        T: PartialEq,
    {
        self.partition_dedup_by(|a, b| a == b)
    }

    pub fn partition_dedup_by<F>(&mut self, mut same_bucket: F) -> (&mut [T], &mut [T])
    where
        F: FnMut(&mut T, &mut T) -> bool,
    {…}

    pub fn partition_dedup_by_key<K, F>(&mut self, mut key: F) -> (&mut [T], &mut [T])
    where
        F: FnMut(&mut T) -> K,
        K: PartialEq,
    {
        self.partition_dedup_by(|a, b| key(a) == key(b))
    }
}

The doc-comment of partition_dedup_by (the others are similar) is:

Moves all but the first of consecutive elements to the end of the slice satisfying a given equality relation.

Returns two slices. The first contains no consecutive repeated elements. The second contains all the duplicates in no specified order.

The same_bucket function is passed references to two elements from the slice and must determine if the elements compare equal. The elements are passed in opposite order from their order in the slice, so if same_bucket(a, b) returns true, a is moved at the end of the slice.

If the slice is sorted, the first returned slice contains no duplicates.

Examples

#![feature(slice_partition_dedup)]

let mut slice = ["foo", "Foo", "BAZ", "Bar", "bar", "baz", "BAZ"];

let (dedup, duplicates) = slice.partition_dedup_by(|a, b| a.eq_ignore_ascii_case(b));

assert_eq!(dedup, ["foo", "BAZ", "Bar", "baz"]);
assert_eq!(duplicates, ["bar", "Foo", "BAZ"]);

From the implementation PR:

The benefits of adding these methods is that it is now possible to:

  • deduplicate a slice without having to allocate and possibly clone elements on the heap, really useful for embedded stuff that can't allocate for example.
  • not loose duplicate elements, because, when using Vec::dedup, duplicates elements are dropped. These methods add more flexibillity to the user.

@SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Jan 9, 2020

These methods feel very niche to me (given we already have the Vec::dedup family of methods), similar to some things we currently have in https://crates.io/crates/itertools rather then std::iter. Still, multiple people were involved in the implementation PR and were evidently happy enough to land it. So I’d be inclined to propose FCP to stabilize.

@rust-lang/libs, any thoughts?

@Lokathor
Copy link
Contributor

@Lokathor Lokathor commented Jan 9, 2020

Note that Vec::dedup is actually implemented in terms of these methods, so it's very important to anyone trying to implement vec-like replacements that these become available on stable.

@LukasKalbertodt
Copy link
Member

@LukasKalbertodt LukasKalbertodt commented Jan 9, 2020

@leonardo-m brought up the question of whether the second element of the returned tuple is even used. If not, it would simplify the API to return only one slice instead of a tuple. So it would be really useful to know how people use this feature. If it's merely used as a way to dedup without allocation, then yes, the first slice is sufficient. All usages I was able to find via a quick GH search indeed only used the first returned slice.

But as I agree this is pretty niche, I think it's fine if the API is a bit more complicated. Niche APIs have a stronger focus on being powerful than being easy to use, I think.

My initial thought also was that this API is too niche and shouldn't be included, but apparently it is used a lot (even excluding Rust forks, there are quite a few projects using it on the first few pages of the search results). So I'd be fine with stabilizing this as soon as the return type question is resolved.

@SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Jan 9, 2020

As far as I can tell if the method returns one slice the other behavior can still be achieved by only keeping the length of that returned slice and then using split_at_mut on the original slice.

@Kerollmops
Copy link
Contributor Author

@Kerollmops Kerollmops commented Jan 9, 2020

I thought that it would be great to also add the #[must_use] attribute to these functions.

@kennytm
Copy link
Member

@kennytm kennytm commented Jan 9, 2020

The link above gives ~1500 results. Excluding the word "slice_internals" and "fmt_internals" and forcing the language to be Rust leaves only 12 results. So 99% of this "lot" are very likely just rustc clones.

A more accurate search would be:

https://github.com/search?l=Rust&q=partition_dedup+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details&type=Code (12 results)
https://github.com/search?l=Rust&q=partition_dedup_by+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details+NOT+vec_intoiter_debug&type=Code (10 results)
https://github.com/search?l=Rust&q=partition_dedup_by_key+NOT+slice_internals+NOT+fmt_internals+NOT+vec_intoiter_debug&type=Code (1 result)

Only 3 examples actually try to use the second element:

  1. https://github.com/treuherz/AdventOfCode/blob/152da4d32c4d6a258ec669f2fc175db99a4e1df8/18/rs/src/bin/2.rs#L66-L84 uses the second part of partition_dedup() to locate the first duplicated element.

        let (_, dupes) = working.partition_dedup();
        if !dupes.is_empty() {
            assert_eq!(dupes.len(), 1);
            return dupes.first().unwrap().to_string();
        }

    but I don't think it is a good example, as it's easier to just write

        if let Some(first_dupe) = working.windows(2).find(|(p, q)| p == q) {
            return first_dupe.to_string();
        }
  2. https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/bin.rs uses the second part to determine if the poker hand has pairs/triples/4 of a kind. But the message already shows how unreliable it is for the purpose.

    let (dedup, _dup) = numbers.partition_dedup();
    println!("dedup {:?} dup {:?}",dedup, _dup );
    match _dup.len() {
         0 => println!("no pair"),
         1 => println!("one pair"),
         2 => println!("three of a kind or two pairs, do an other match"),
         3 => println!("four of a kind"),
         _ => println!("not handled"),
    }

    The actual implementation in https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/hand/mod.rs uses two calls of partition_dedup to distinguish two-pair (AA22) vs triples (AAA) and four-of-a-kind (AAAA) vs full-house (AAA22).

    IMO partition_dedup is far from useful as a poker hand classifier since it can't handle flushes or straights.

  3. https://github.com/smmalis37/aoc2019/blob/c1015c7d2bba8075176650ac84f4f3c53b63e1d3/src/days/day10.rs#L86-L101 Probably the only legit use case though I can't quite figure out what it's trying to do.

    loop {
        subangles.sort_unstable_by_key(|ad| ad.1);
        subangles.sort_by_key(|ad| ad.0);
        let (left, right) =
            subangles.partition_dedup_by(|&mut ad1, &mut ad2| float_equals(ad1.0, ad2.0));
        if seek < left.len() {
            let (angle, distance) = left[seek];
            let coord = angle_distance_to_coord(part1_coord, angle, distance);
            return (coord, coord.x * 100 + coord.y);
        } else {
            seek -= left.len();
            subangles = right;
        }
    }

Conclusion: The second element is extremely niche, and encourages bad algorithm. As mentioned in #54279 (comment), the original slice is not destroyed, so the two-element version can be easily reimplemented by the one-element version using split_at_mut (heck, the current implementation also uses split_at_mut):

fn partition_dedup(&mut self) -> (&mut Self, &mut Self) {
    let p = self.dedup_in_place().len();
    self.split_at_mut(p)
}

There is no reason to keep the second element from the return type.

@LukasKalbertodt
Copy link
Member

@LukasKalbertodt LukasKalbertodt commented Jan 9, 2020

@kennytm Wow, thanks for the research! I clearly didn't put enough time into my comment and was indeed fooled by my search.

In case we change the return type to only return the first part of the slice, maybe we should also think about renaming the methods. Shepmaster suggested the current name partially because of the 2-tuple return type. I don't mind the current name (the algorithm still does some kind of partitioning), but another name might be more appropriate: the return type is different than Iterator::partition and as we seem to have established, almost no one is interested in the second bucket. What about dedup_in_place?

@Lokathor
Copy link
Contributor

@Lokathor Lokathor commented Jan 9, 2020

Note: The side effect of this method is that the input slice is swapped all around on success so that the initial span of the spice is the deduped portion. Thus, must_use would be inappropriate.

@smmalis37
Copy link
Contributor

@smmalis37 smmalis37 commented Feb 8, 2020

Oh hey I've been mentioned. Yeah that use case is very much artificial, it was part of the Advent of Code, a series of puzzles. The puzzle for day 10 can be found at https://adventofcode.com/2019/day/10, although you'd need to solve part 1 before it reveals part 2 which this was part of. However as you said i could easily just use split_at_mut and be fine.

@dtolnay
Copy link
Member

@dtolnay dtolnay commented Feb 8, 2020

I would be inclined to remove these methods. This is an uncommon use case. The algorithm can be implemented in ~20 lines of safe code if you don't mind the bounds checks. I think the amount of benefit we'd deliver by stabilizing these is much less than what it costs to put these fairly complicated names and signatures in front of everyone in the documentation of slices.

@smmalis37
Copy link
Contributor

@smmalis37 smmalis37 commented Feb 8, 2020

I'd vote against removing these methods. Even if their use is uncommon they are genuinely useful. They also fit with other dedup* methods that are already in std.

@Lokathor
Copy link
Contributor

@Lokathor Lokathor commented Feb 9, 2020

As I said, those are implemented in terms of these, so these cannot be removed entirely.

Users of Vec already see this complexity, in Stable, there's no reason that it can't be available to slices too.

@dtolnay
Copy link
Member

@dtolnay dtolnay commented Feb 9, 2020

The ones on Vec are simpler signatures, simpler names, and simpler behavior to understand, and serve a somewhat more common use case. So I don't find that those necessarily provide justification for the new slice methods.

Regarding "cannot be removed entirely": they can be removed entirely from the API of the standard library.

@shepmaster
Copy link
Member

@shepmaster shepmaster commented Feb 9, 2020

I would be inclined to remove these methods.

Are you talking about only the "complicated" version:

fn partition_dedup(&mut self) -> (&mut [T], &mut [T])

or also the "simplified" version

fn partition_dedup(&mut self) -> &mut [T]

@Lokathor
Copy link
Contributor

@Lokathor Lokathor commented Feb 9, 2020

I would like to see them put on normal slices so that other crates with vec-like APIs can simply use the slice versions.

But i'm fine with the simple version, since that's what Vec offers.

@dtolnay
Copy link
Member

@dtolnay dtolnay commented Feb 9, 2020

If the intended audience is crates that provide a vec-like API, I really don't think that justifies adding these to slice. It's an uncommon use case and they can write the 20 lines themselves. I don't think providing these methods makes a dent in the actual hard parts of designing and implementing a vec-like type downstream.

@Lokathor
Copy link
Contributor

@Lokathor Lokathor commented Feb 9, 2020

Or just anyone using slices who wants to be able to do this.

Unless your argument is that adding this in the first place to Vec was itself a mistake, then the same methods should be on slices too (the simplified versions that is).

@Joey9801
Copy link

@Joey9801 Joey9801 commented Feb 10, 2020

FWIW I found a partition_dedup method useful just now when completing an advent-of-code problem, for stably rearranging of a slice like [a, a, a, b, b, c, c, c] into [a, b, c, a, b, c, a, c]. (I'm sure there's a word for that rearrangement, but can't think of it right now...)

I know it's not a significant use case, but it was convenient to have it available to me (albeit only in nightly), and I did make use of both returned slices. It allowed me to express the rearrangement in only couple of lines of relatively simple code that was O(N) time / constant memory overhead.

Ignore me, I failed to read the documentation properly:

The second contains all the duplicates in no specified order.

Indeed, the relative order of the duplicates is not preserved, and I just got lucky on the AoC problem :(

@Kerollmops
Copy link
Contributor Author

@Kerollmops Kerollmops commented Apr 24, 2020

Hey @SimonSapin,

Do you think that if I update the function name to be dedup_in_place and the signature to return a single mutable slice it could be possible to merge this PR soon or later?

@MarinPostma
Copy link

@MarinPostma MarinPostma commented Apr 24, 2020

I think dedup_in_place should return a tuple. The reason is that you'd expect it to work like it's Vec counterpart, whereas it doesn't, and it would lead to potentially hard to debug bugs:

//this is ok to do with vec:
let vec = vec![1, 1, 2, 3];
vec.dedup()
// can use vec, no problem
// same case with slices
let mut slice = [1, 1, 2, 3];
slice.dedup_in_place();
// uses slice, oops
// or worst:
let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
//  one may use slice, ignore new and refer to “removed” elements

returning a tuple will prevent the later, making it #[must_use] shall prevent the former

@kennytm
Copy link
Member

@kennytm kennytm commented Apr 24, 2020

let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
//  one may use slice, ignore new and refer to “removed” elements

You can't ignore new without also ignoring the unused_variables warning.

@MarinPostma
Copy link

@MarinPostma MarinPostma commented Apr 25, 2020

first case holds

@kennytm
Copy link
Member

@kennytm kennytm commented Apr 25, 2020

which suggests #[must_use] is enough.

and returning a tuple prevents none of the issues raised in your sample code anyway.

@MarinPostma
Copy link

@MarinPostma MarinPostma commented Apr 25, 2020

Fair enough. I am just concerned with the potential misuse of this function, but #[must_use] should be enough.

@Robbepop
Copy link
Contributor

@Robbepop Robbepop commented Aug 28, 2020

One of my projects could definitely profit from having these functions stabilized.
To answer the above questions from my point of view as potential user:

Should the methods only return one slice instead of a tuple with two slices? (comment)

For my personal use case I could use the second returned value but infering all information I need can also be done by only the first without overhead.
The main use of these functions is for constrained or performance critical code. If we can infer all information from just the deduplicated slice without noticable overhead we do not need the second slice returned imo.

Annotate it with #[must_use]?

As said earlier in this discussion I do not see a value in putting #[must_use] here since its side effects of scrambling the underlying slice in a certain way could already be useful to some users. Could we not stabilize it with #[must_use] and later remove it if we find use cases that would profit from having it removed? Should be a backwards compatible change, right? So if we stabilize this with #[must_use] we simply delay decisions for or against it. Does not need to be answered in order to stabilize this.

I see a clear value in having those definitions available since they are 20 lines of code that I do not have to write myself and that I can simply trust in my code base to be correctly implemented. If I were to write them myself I would maybe even end up with an unsafe implementation (after all it needs to be efficient) and beyond those 20 lines of mentioned code there is sitting technical doubt and thus a lot of test code.

@Stargateur
Copy link
Contributor

@Stargateur Stargateur commented Aug 23, 2021

If we choice to only return the dedup part, I believe we should rename dedup_in_place or something like that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet