Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-RFC: HashSet::pop() #1800

Open
joshtriplett opened this issue Nov 24, 2016 · 28 comments
Open

pre-RFC: HashSet::pop() #1800

joshtriplett opened this issue Nov 24, 2016 · 28 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@joshtriplett
Copy link
Member

(Originally filed as rust-lang/rust#37986, but a pre-RFC belongs in the RFC repo, not the rust repo. Re-filing it here.)

I'd like to have a pop() method available for HashSet, which removes an item from the set if any, and returns an Option<T> (returning None if empty). That would make it easy to iterate over a set with while let element = set.pop() { ... }, without holding a mutable reference to the set as HashSet::drain() does, so you can insert more items into the set as you iterate.

Does that seem like a reasonable method to add to HashSet? Does this need an RFC, or just a patch to std?

@comex
Copy link

comex commented Nov 25, 2016

Only issue is that it'd have to search linearly through the table for a non-empty bucket, so calling it in a loop would be O(n^2)ish.

@joshtriplett
Copy link
Member Author

@comex If you stored a "hint" index for the last location with an item, you should get O(n) amortized time to pop every item out of a set.

If you had a loop that also inserted new items, you should still avoid quadratic behavior, as long as the hash doesn't place items immediately behind the hint index every time.

@eddyb
Copy link
Member

eddyb commented Nov 25, 2016

@joshtriplett Sounds more like what a draining iterator might do.

@nrc nrc added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 27, 2016
@clarfonthey
Copy link
Contributor

@eddyb, @joshtriplett specifically requested this so that they can add more items into the set during iteration, so, drain is out for this use case

@bluss
Copy link
Member

bluss commented Feb 21, 2017

This is efficient (O(1) average) with a hashmap implementation like OrderMap's, but not with Rust's current HashMap.

crates.io already provides ordermap and linked-hash-map for two different use cases, but both have O(1) pop.

@joshtriplett
Copy link
Member Author

@bluss An OrderSet would solve this for me. It'd be nice to have those in core Rust, but I don't mind pulling in another crate.

@clarfonthey
Copy link
Contributor

@joshtriplett These are all part of the contain-rs organisation, so, I'd talk to them about that. A lot of these data structures are deemed as non-essential and many were actually removed from libstd, but if you can make a compelling case otherwise you could probably write up an RFC for it.

I don't know where the original discussion on this was but you could probably search for it.

@Centril
Copy link
Contributor

Centril commented Apr 26, 2018

Triage ping @joshtriplett

@joshtriplett
Copy link
Member Author

I'm still interested in this and would still like to see such a method exist, or some other alternative that allows iteration without holding a reference.

@Centril
Copy link
Contributor

Centril commented Apr 26, 2018

@joshtriplett What I mean is: do you have any plans on making a PR / writing an RFC? 😃

@joshtriplett
Copy link
Member Author

@Centril Making a PR, no. Writing an RFC, maybe; depends heavily on whether one would be welcomed or not, and based on this pre-RFC I'm not sure.

@Centril
Copy link
Contributor

Centril commented Apr 26, 2018

@joshtriplett Personally I like the idea. cc @rust-lang/libs -- what do you think?

@joshtriplett
Copy link
Member Author

Does the switch to hashbrown affect this at all? Would that make this easier/harder?

@eddyb
Copy link
Member

eddyb commented Apr 20, 2019

@joshtriplett I suspect it could make this significantly faster, at least the naive linear scan.
cc @Amanieu

@Amanieu
Copy link
Member

Amanieu commented Apr 20, 2019

Yes the linear scan would be a bit faster, but you would still end up with O(n^2) complexity for a pop loop. I don't think that this is something that we want to encourage.

@jonhoo
Copy link
Contributor

jonhoo commented Apr 23, 2019

I wonder if it would be feasible to add this under a name like scan_remove_next to highlight the potential cost?

@Amanieu
Copy link
Member

Amanieu commented Apr 23, 2019

IMO anyone that needs this functionality should be using IndexMap instead, which guarantees O(1) pop. Incidentally this is similar to the hash table that Python uses for its dict.

@jonhoo
Copy link
Contributor

jonhoo commented Apr 23, 2019

That's what I'm now using in the places where I can, and you're right that that's what to use where possible! It's perhaps somewhat less discoverable, so maybe we should point people towards that from the HashMap docs (there are other "nice" features it adds that people may be used to coming from things like dict)? Unfortunately, IndexMap has the downside that items are popped in (reverse?) insertion order, which means it doesn't fit all use-cases (see, e.g., https://github.com/Amanieu/hashbrown/issues/43). That said, maybe those cases are too niche to really worry about.

@Diggsey
Copy link
Contributor

Diggsey commented Apr 23, 2019

IMO anyone that needs this functionality should be using IndexMap instead, which guarantees O(1) pop. Incidentally this is similar to the hash table that Python uses for its dict.

Unfortunately that's not very discoverable, if all you know is that you need a pop function.

@burdges
Copy link

burdges commented Apr 23, 2019

I'm nervous about such an innocuous sounding method being O(n) time too.

After NLL lands then you could write pop() like

hm.keys().next().and_then( |k| hm.remove_entry(k) )

If you need several sequential pops, or need to skip some, then adapting this code snippet gives better performance. Could both it and IndexMap, etc. be mentioned in some book?

@joshtriplett
Copy link
Member Author

joshtriplett commented Apr 23, 2019 via email

@jonhoo
Copy link
Contributor

jonhoo commented Apr 23, 2019

@joshtriplett I don't quite see how that's true? After the value in question has been removed, you need to search for the next hint value, which is the same cost as the original search anyway.

@joshtriplett
Copy link
Member Author

Not quite; you can skip re-searching large empty areas of the table. That'd help avoid O(n**2) behavior when using pop in a loop, as long as you didn't happen to insert new elements right behind the hint every time.

(I'd also be happy with an equivalent to drain() that doesn't prevent adding new items in the body of the loop.)

@burdges
Copy link

burdges commented Apr 24, 2019

I'd think some Drain::abort method achieves that easiest, no?

I'm worried that adding your hint hurts almost all HashMap usages. Also, inserting corrupts the hint for some use cases like resuming after a partial drain. It's kinda neat your hint makes pop like an undo for insert, except this sounds fragile and encourages bugs.

Instead, one could maybe expose the index, like returning it from Drain::abort, and provide some Drain::resume(HashMap<..>, usize) -> Drain method? In resume, there are explicitly no guarantees about behavior since resizing could make an index invalid, but if you want that behavior then the method exists.

@joshtriplett
Copy link
Member Author

joshtriplett commented Apr 24, 2019

I wonder if it could work to have the drain iterator provide a couple of methods to modify the HashMap that it holds a mutable reference to?

Suppose you could do this:

let iter = foo.drain();
for x in iter {
    // ...
    if bar {
        iter.modify(|foo| foo.insert(...))
    }
}

@Amanieu
Copy link
Member

Amanieu commented Apr 24, 2019

I don't think Drain is the right place for this since it keeps the table in an invalid state while iterating (items have been moved out but not marked as such in the metadata).

@avl
Copy link

avl commented May 19, 2019

Wouldn't it be good if there existed a data structure like indexmap::IndexMap in Rust std library? I've always found it surprising that the default hashmap in Rust has outright bad performance for iteration.

I suppose std HashMap is much faster for inserts and lookups, which means we can't just change std HashMap implementation to that of IndexMap.

But, for certain kinds of problems, for instance breadth-first searches in graphs, the usecase where IndexMap excels is very common. It's unsatisfactory that standard C# beats Rust performance wise for these problems.

Couldn't we arrange for IndexMap to be available in the standard library? And have the documentation for HashMap hint that if iterations is needed, IndexMap may give better performance?

@alecmocatta
Copy link

The O(n^2) for a pop loop is sufficiently surprising that I don't believe pop should be in that form in std.

As best I recall, every time I've wanted pop it's been for a pop loop that may return early and can't take ownership of the remainder of the entries. Is it possible to expose iteration of OccupiedEntrys? In which case that use case could be handled like this:

for (k, v) in map.entries().map(OccupiedEntry::remove_entry) {
    ...
}

Or, to avoid borrowing map. O(n^2) again but at least it's more obvious:

while let Some((k, v)) = map.entries().map(OccupiedEntry::remove_entry).next() {
    ...
}

In many cases of course IndexSet/Map is more appropriate. Perhaps that crate can be referenced in the HashSet/Map docs as providing .pop()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests