Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add item recovery collection APIs #1194

Merged
merged 6 commits into from Aug 27, 2015

Conversation

Projects
None yet
@apasel422
Copy link
Member

apasel422 commented Jul 8, 2015

@Gankro Gankro self-assigned this Jul 8, 2015

@Gankro Gankro added the T-libs label Jul 8, 2015

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Jul 8, 2015

CC @cmr @eddyb @bluss @seppo0010 (can't remember all the people interested...)


# Alternatives

Do nothing.

This comment has been minimized.

@pnkfelix

pnkfelix Jul 9, 2015

Member

hold on, to be clear: "Do nothing" here means "Do nothing and let users write such caches via, e.g., HashMap<T, ()> " ... right?

I don't particular mind adding the functionality described here to HashSet, but I'm also not sure its strictly necessary, unless I have missed something with how HashMap<T, ()> would work.

Update: Ah, re-reading the RFC, I now see that our current HashMap API would not support that

This comment has been minimized.

@apasel422

apasel422 Jul 9, 2015

Author Member

It's not possible to use HashMap that way, because it doesn't provide any methods that return &K (or K) other than via its iterators.

@blaenk

This comment has been minimized.

Copy link
Contributor

blaenk commented Jul 16, 2015

It'd be cool if you could add another copy of the code that demonstrates the problem but showing how to use your proposed APIs to resolve it.

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 16, 2015

@blaenk I'd actually like to use a more concrete motivating example, but I'll add the revised code as well.

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 17, 2015

@Gankro Do you have any ideas for a better motivating example, like an algorithm that uses a set as a cache?

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Jul 17, 2015

@apasel422 that's the usecase in the compiler: sets of hundreds of thousands of elements, used for interning/caching, that have to be identity maps right now, wasting some memory space.

@Gankro Gankro referenced this pull request Jul 17, 2015

Closed

B+ Tree in BTreeMap #27090

@benaryorg

This comment has been minimized.

Copy link
Contributor

benaryorg commented Jul 17, 2015

@eddyb Does that mean, if this RFC is implemented, the compiler can be tweaked to use less memory?

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 19, 2015

I've added a WIP implementation of this RFC for BTreeMap at https://github.com/apasel422/rust/tree/rfc-1194.

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Jul 20, 2015

@apasel422 It could be argued by metaphor to the current naming in #1195 that these methods could just called get_eq, remove_eq, etc...

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Jul 20, 2015

This is a bit more dubious for HashMap; but not crazy.

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 20, 2015

@Gankro I actually had that same thought a little while ago, but I don't think it fully translates, and would be even weirder for entries:

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    fn get(&self) -> &V;
    fn get_eq(&self) -> (&K, &V); // what does eq have to do with this?
    ...
}

{key_value, key_value_mut, remove_key_value} are quite verbose, but have parallel set names {element, remove_element}. {kv, kv_mut, remove_kv} are more succinct, but don't have an obvious set version.

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Jul 20, 2015

Hmm... Entry does seem to mess things up. That said... is it a tragedy if it's a bit misaligned from everything else?

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 20, 2015

I think they should be consistent.

Here are some options that work for both maps and occupied entries (in addition to Map::replace, which I think is non-controversial):

  1. {kv, kv_mut, remove_kv} with VacantEntry::insert_kv
  2. {get_kv, get_kv_mut, remove_kv} with VacantEntry::insert_kv
  3. {key_val, key_val_mut, remove_key_val} with VacantEntry::insert_key_val
  4. {key_value, key_value_mut, remove_key_value} with VacantEntry::insert_key_value

And here are some options for sets (assuming that the changes in rust-lang/rust#27135 canonicalize "element" over "value" when referring to sets):

  1. {elem, remove_elem}
  2. {element, remove_element}

@apasel422 apasel422 referenced this pull request Jul 23, 2015

Closed

Add HashMap::get_pair? #27237

@i30817

This comment has been minimized.

Copy link

i30817 commented Jul 23, 2015

One common optimization that can't be done in java because of set item recovery is to just store hashes instead of elements, for the case where identity-mapping is not desirable. Are you going to give up this special case?

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 23, 2015

@i30817 Rust's HashMap and HashSet already prevent that optimization: The keys (elements) must implement Eq in addition to Hash, and the types provide iterators over their contents. A hypothetical map or set providing that optimization could omit these operations, but would also have to omit iteration. In any case, just storing hashes seems more like a bloom filter than a hash table, due to the absence of a check on Eq.

@i30817

This comment has been minimized.

Copy link

i30817 commented Jul 23, 2015

Mmm makes sense. Still, it's a somewhat common optimization, maybe a bloom filter type could be added to the language.

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 23, 2015

@i30817 Off-topic, but [https://crates.io/search?q=bloom filter](https://crates.io/search?q=bloom filter)

@shepmaster

This comment has been minimized.

Copy link
Member

shepmaster commented Jul 26, 2015

I'm definitely in favor of the ideas laid out here. I have the same motivating problem - a cache of strings. I've used some unsafe code to avoid double-allocating the strings, but I still have essentially a HashMap<&str, &str>.

@bkoropoff

This comment has been minimized.

Copy link

bkoropoff commented Jul 28, 2015

How would you feel about modifying OccupiedEntry to hold on to the key passed to entry so you can recover it?

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Jul 28, 2015

@bkoropoff That could be done, but it has the problem that a new key is only present for OccupiedEntrys that are obtained through Map::entry. The theoretical {max_entry, min_entry, ...} provided by RFC #1195 will not have a new key to return.

I'm therefore more inclined to add that kind of key-recovery functionality as

pub enum Entry<'a, K: 'a, V: 'a> {
    Occupied(OccupiedEntry<'a, K, V>, K),
    Vacant(VacantEntry<'a, K, V>),
}

instead of

pub struct OccupiedEntry<'a, K: 'a, V: 'a> {
    new_key: K,
    // ...
}

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    /// Returns the key that was used to acquire this entry.
    // This could return `Option<K>` in order to better model the `max_entry` situation
    pub fn into_new_key(self) -> K { self.key }
}

but that would not be a backwards-compatible change. Additionally, storing the new key in the struct itself has the benefit of allowing us to provide an additional OccupiedEntry::replace_key method that updates the new key and returns the old one, much as Map::replace would:

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    /// Replaces the entry's key with the one that was used to acquire this entry, if any, and
    /// returns the old key.
    ///
    /// This method always return `None` after the first call to it and for all entries
    /// acquired through `max_entry` etc.
    pub fn replace_key(&mut self) -> Option<K>;
}

This adds some complexity to the API surface and makes it harder to reason about what the behavior is. It's possible that we could add what you're proposing in a subsequent RFC instead.

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Jul 29, 2015

🔔 HERE YE HERE YE THIS RFC IS ENTERING ITS FINAL COMMENT PERIOD 🔔

@aturon

This comment has been minimized.

Copy link
Member

aturon commented Aug 7, 2015

Sorry to be late to this party (I also had to miss the libs meeting this week). I'm on board with the basic motivation here, and regret the stabilization of the bool-centric methods on sets. As @Gankro said, the team decided to go forward with this RFC, modulo some bikeshedding.

That said, I feel like the RFC is proposing significantly more API expansion than is actually needed to solve the original problem -- in particular, I don't see why any changes to the entry API are needed. Could we instead take the following as a starting point (bikesheds painted in my favorite colors):

impl<T> Set<T> {
    // Like `contains`, but returns a reference to the element if the set contains it.
    fn get<Q: ?Sized>(&self, element: &Q) -> Option<&T>;

    // Like `remove`, but returns the element if the set contained it.
    fn take<Q: ?Sized>(&mut self, element: &Q) -> Option<T>;

    // Like `insert`, but replaces the element with the given one and returns the previous element
    // if the set contained it.
    fn replace(&mut self, element: T) -> Option<T>;
}

impl<K, V> Map<K, V> {
    // Like `get`, but additionally returns a reference to the entry's key.
    fn key_value<Q: ?Sized>(&self, key: &Q) -> Option<(&K, &V)>;

    // Like `get_mut`, but additionally returns a reference to the entry's key.
    fn key_value_mut<Q: ?Sized>(&mut self, key: &Q) -> Option<(&K, &mut V)>;

    // Like `remove`, but additionally returns the entry's key.
    fn remove_key_value<Q: ?Sized>(&mut self, key: &Q) -> Option<(K, V)>;

    // Like `insert`, but additionally replaces the key with the given one and returns the previous
    // key and value if the map contained it.
    fn replace(&mut self, key: K, value: V) -> Option<(K, V)>;
}

In particular, the fact that the entry APIs need an owned key to use (today, at least) seems to make the key-accessing functionality questionable. But maybe I'm missing something?

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Aug 7, 2015

@aturon We will presumably want the entry methods once #1195 is accepted, but they could be omitted for now. I think that both RFCs need to be considered together though, and it probably makes sense to avoid a proliferation of {get, get_mut, remove, and entry} methods when just {get, entry} suffice. We don't need to add take or key_value_mut on Map directly if we just add them to OccupiedEntry now, and this will benefit the queries added in #1195 as well. For example:

impl<K, V> Map<K, V> {
    fn get_pair(&self, key: &Q) -> Option<(&K, &V)>;
    fn get_entry(&mut self, key: &Q) -> Option<OccupiedEntry<K, V>>;
    fn replace(&mut self, key: K, val: V) -> Option<(K, V)>;

    fn get_max(&self) -> Option<(&K, &V)>;
    fn max_entry(&mut self) -> Option<OccupiedEntry<K, V>>>;

    fn get_lt(&self, key: &Q) -> Option<(&K, &V)>;
    fn lt_entry(&mut self, key: &Q) -> Option<OccupiedEntry<K, V>>>;

    // get_* and *_entry for le, ge, gt, min
}

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    fn pair(&self) -> (&K, &V);
    fn pair_mut(&mut self) -> (&K, &mut V);
    fn into_pair_mut(self) -> (&'a K, &'a mut V);
    fn take(self) -> (K, V);
}
@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 12, 2015

The libs team discussed this RFC today, and our conclusion was that it may be best to hone this down to what's precisely necessary to satisfy the motivation in the outset. To that end would it be possible to only include the set methods? Specifically:

impl<T> Set<T> {
    fn get<Q: ?Sized>(&self, element: &Q) -> Option<&T>;
    fn take<Q: ?Sized>(&mut self, element: &Q) -> Option<T>;
    fn replace(&mut self, element: T) -> Option<T>;
}
@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Aug 12, 2015

Specifically, I believe the supporting map methods were also decided to just be doc(hidden) and unstable.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 12, 2015

I'd personally prefer the methods to be freestanding in the module so they're private to the outside world but public to the crate rather than having them in the inherent API at all.

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Aug 13, 2015

@alexcrichton How does that work? Privacy can only reach up, and not down or sideways. Maps and Sets are defined in sibling modules.

@sfackler

This comment has been minimized.

Copy link
Member

sfackler commented Aug 13, 2015

They could be defined in a crate private trait. That's how I've gotten
around visibility issues before.

On Wed, Aug 12, 2015, 8:01 PM Alexis Beingessner notifications@github.com
wrote:

@alexcrichton https://github.com/alexcrichton How does that work?
Privacy can only reach up, and not down or sideways. Maps and Sets are
defined in sibling modules.


Reply to this email directly or view it on GitHub
#1194 (comment).

@Gankro

This comment has been minimized.

Copy link
Contributor

Gankro commented Aug 20, 2015

@apasel422 Can you amend the RFC to be minimal per aturon's request? I think we're good to go when that's done.

@seppo0010

This comment has been minimized.

Copy link

seppo0010 commented Aug 20, 2015

I don't understand the motivation to allow item recover from a Set but not key recover from a Map.

I was actually expecting that feature to move items from one Map to another without cloning its keys.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Aug 20, 2015

Same here, the use case that got me here was with a HashMap, not a set.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Aug 20, 2015

@seppo0010 The usecase I hit that needed the feature for Set was sort of a cache.
One quick example would be HashSet<Rc<str>>, you want to be able to find an existing key with &str and clone the Rc handle.
To get closer to the compiler, HashSet<&'arena str> could be used as a cache, queried with a temporary &str, that would get copied on the arena if not found in the set.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 27, 2015

@apasel422 ping about the RFC update, would love to merge!

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Aug 27, 2015

@alexcrichton I haven't updated yet because it seems like there's still some dissent, based on the last few comments.

@shepmaster

This comment has been minimized.

Copy link
Member

shepmaster commented Aug 27, 2015

As another voice, only having it on sets would be acceptable for me. I am in the same boat as @eddyb — a cache.

@apasel422

This comment has been minimized.

Copy link
Member Author

apasel422 commented Aug 27, 2015

@alexcrichton Updated.

@apasel422 apasel422 referenced this pull request Aug 27, 2015

Merged

implement RFC 1194 #28043

@alexcrichton alexcrichton merged commit b8648e4 into rust-lang:master Aug 27, 2015

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 27, 2015

Ok, thanks @apasel422! The consensus of the libs team is that this is a great step forward for sets and we can continue to explore the problem space for maps as the needs arise, but it seems like the most pressing parts to work with are sets today.

And of course, thanks again for the RFC @apasel422!

@apasel422 apasel422 deleted the apasel422:collection-recovery branch Aug 28, 2015

@rpjohnst

This comment has been minimized.

Copy link

rpjohnst commented Oct 3, 2017

I find myself needing this for maps.

In my case, I am building a string interner using a HashMap<Box<str>, u32> where the u32s track the order of insertion. I use this to determine whether a string is in one of a few fixed sets of symbols by doing range checks on its corresponding u32.

The OrderMap crate provides this part of the API under the name get_pair, so that name now has some precedent.

Alternatively, it would be useful for the Entry API to provide a reference to the key as well as the value after an or_insert/or_insert_with-like operation. This might be the better option since it allows the lookup to be reused when inserting new entries.

What is the best route forward here? Should I write up a new RFC?

fschutt added a commit to maps4print/polyclip that referenced this pull request Dec 16, 2017

@fschutt

This comment has been minimized.

Copy link

fschutt commented Dec 16, 2017

Well, I needed this for a function where I insert into a set, but then I immediately want an iterator to that last, inserted element in the set. Since a BTreeSet is just a BTreeMap internally, this should be possible.

The application is a scanline algorithm, where the set consists out of ordered points. I need to insert a point into a scanline and then know where it has been inserted (the position), so that I can construct an iterator to the next and previous point in the (ordered) scanline.

So for now I've forked the std::collections::BTreeSet. Watching this for when it eventually gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.