Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify HashMap's capacity handling. #36766

Merged
merged 2 commits into from Oct 3, 2016

Conversation

nnethercote
Copy link
Contributor

HashMap has two notions of "capacity":

  • "Usable capacity": the number of elements a hash map can hold without
    resizing. This is the meaning of "capacity" used in HashMap's API,
    e.g. the with_capacity() function.
  • "Internal capacity": the number of allocated slots. Except for the
    zero case, it is always larger than the usable capacity (because some
    slots must be left empty) and is always a power of two.

HashMap's code is confusing because it does a poor job of
distinguishing these two meanings. I propose using two different terms
for these two concepts. Because "capacity" is already used in HashMap's
API to mean "usable capacity", I will use a different word for "internal
capacity". I propose "span", though I'm happy to consider other names.

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@shepmaster
Copy link
Member

I propose "span",

I'm not in the compiler code much, but IIRC "span" is used during parsing to indicate a section of text (such as to report for an error). That particular usage of the word is fairly far away from this particular usage though, so there may not be a conflict in practice.

@aturon
Copy link
Member

aturon commented Sep 27, 2016

r? @arthurprs

@aturon aturon removed their assignment Sep 27, 2016
@arthurprs
Copy link
Contributor

arthurprs commented Sep 28, 2016

The PR has merits, it's a bit confusing whatever capacity is referring to the hashmap capacity (the exposed one) or the rawtable capacity.

But the proposed naming is a bit odd to me, maybe adding some table_/internal_ prefixes around is enough.

@nnethercote
Copy link
Contributor Author

The PR has merits, it's a bit confusing whatever capacity is referring to the hashmap capacity (the exposed one) or the rawtable capacity.

But the proposed naming is a bit odd to me, maybe adding some table_/internal_ prefixes around is enough.

It's true that the meaning of RawTable::capacity() is different to the meaning of HashMap::capacity(). That is pre-existing, and I haven't changed it with this PR, because I didn't want to change either type's API. What I have done in this PR, however, is introduce HashMap::span(), which is now the one function in the entire file that calls RawTable::capacity(), and is therefore the single place in map.rs in which this difference in meaning is exposed. (I could add a comment to span explaining this.) This is a big improvement over the current code, in which the three meanings of "capacity" (HashMap's two meanings, and RawTable's one meaning) are mixed with abandon.

If you are still unhappy, I could change all uses of "capacity" within RawTable to "span" as well. That would certainly clarify code like this:

    /// The hashtable's capacity, similar to a vector's.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

As for adding prefixes, I think that's less effective. The use of an unprefixed "capacity" is baked into HashMap's API. If we add prefixes to the uses of "capacity" that aren't in the API names, that still leaves us with a mixture of prefixed and unprefixed uses, which leaves the code less clear than this PR does.

Finally, I assume this is the review for my patch? AFAICT it's an r-, but the suggestions on how to change things to move towards an r+ are quite vague. When I r- patches for Firefox I always do my best to make it clear that (a) it is an r-, and (b) what changes are required to move towards r+, or if the patch is irredeemably flawed, make that fact clear.

@arthurprs
Copy link
Contributor

That wasn't a formal review, just a comment trying to create some discussion around the problem at hand.

I think we should preferably avoid introducing another concept (span), even if it's a simple one.
It isn't immediately clear and one would need to look into the span() function to understand it's relation to the rawtable capacity. capacity is a defacto name so replacing it in the table adds yet more friction.

I propose internal_capacity (or a variation) instead, as it has a better chance of being understood upfront. It's a trade-off of course, between disambiguating and not introducing another concept.
The name is already used in the codebase, but not consistently (probably a result of the various refactors throughout the years).

If you don't agree we can always wait/weight more opinions.

@nnethercote
Copy link
Contributor Author

I propose internal_capacity

I can live with that. Do you want it to be used for RawTable as well? I.e. change RawTable::capacity() to RawTable::internal_capacity(), and other corresponding changes?

@arthurprs
Copy link
Contributor

I don't think we need any changes to RawTable. What do you think?

@nnethercote
Copy link
Contributor Author

I'm considering raw_capacity instead of internal_capacity. It's shorter, and ties in nicely with RawTable. Though I'm still unsure if we should change RawTable::capacity() to RawTable::raw_capacity() for consistency.

@arthurprs
Copy link
Contributor

+1 for raw_capacity.

I don't consistency is in play there. RawTable::capacity is fine conceptually and it also shouldn't be aware how it's called elsewhere.

@nnethercote
Copy link
Contributor Author

@arthurps: I have updated the commit to use "capacity" and "raw capacity". I left RawTable unchanged.

Copy link
Contributor

@arthurprs arthurprs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one optional nit.

This also makes reserve(0) a noop on an empty HashMap, which I think is good.

@@ -34,13 +34,9 @@ use super::table::BucketState::{
Full,
};

const INITIAL_LOG2_CAP: usize = 5;
const INITIAL_CAPACITY: usize = 1 << INITIAL_LOG2_CAP; // 2^5
const MIN_NONZERO_RAW_CAPACITY: usize = 32;
Copy link
Contributor

@arthurprs arthurprs Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should leave something to remind that it must be a power of 2? Like the // 2^5 before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the commit to add a comment saying it must be a power of two.

This commit does the following.

- Changes the terminology for capacities used within HashMap's code.
  "Internal capacity" is now consistently "raw capacity", and "usable
  capacity" is now consistently just "capacity". This makes the code
  easier to understand.

- Reworks capacity and raw capacity computations. Raw capacity
  computations are now handled in a single place:
  `DefaultResizePolicy::raw_capacity()`. This function correctly returns
  zero when given zero, which means that the following cases now result
  in a capacity of zero when they previously did not.

  * `Hash{Map,Set}::with_capacity(0)`
  * `Hash{Map,Set}::with_capacity_and_hasher(0)`
  * `Hash{Map,Set}::shrink_to_fit()`, when used with a hash map/set whose
    elements have all been removed

- Strengthens the language used in the comments describing the above
  functions, to make it clearer when they will result in a map/set with
  a capacity of zero. The new language is based on the language used for
  the corresponding functions in `Vec`.

- Adds tests for the above zero-capacity cases.

- Removes `test_resize_policy` because it is no longer useful.
// 2. Ensure it is a power of two.
// 3. Ensure it is at least the minimum size.
let mut raw_cap = len * 11 / 10;
assert!(raw_cap >= len, "raw_cap overflow");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is a once-off (on hashmap creation from capacity), isn't it? So it should have no major impact on performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to double check if this adds more work to the pub fn reserve, it's called before every insert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I was just looking at that. I don't think it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only called if capacity is insufficient, so in resizes from insert. That's ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the PR actually already changes reserve to do less work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called from with_capacity_and_header, reserve, and shrink_to_fit. Prior to this PR, those functions had an overflow assertion. (shrink_to_fits's was a debug_assert!, the others were an assert!.)

In the new code, the assertion has moved out of those functions, into DefaultResizePolicy::raw_capacity. So it shouldn't have any effect.

@@ -667,28 +667,23 @@ impl<K, V, S> HashMap<K, V, S>
/// ```
#[stable(feature = "rust1", since = "1.0.0")]
pub fn reserve(&mut self, additional: usize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, reserve should have a fast path that returns quickly if the additional capacity is already present. This fast path should not have branches to panic. I'm not sure how big a difference it would make, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

let remaining = self.capacity() - self.len(); # this can't overflow
if remaining < additional {
....
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nnethercote would you like to do this in the PR or want to wrap it up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to wrap it up, but might as well do it properly. I'll take a look at this on Monday.

@nnethercote
Copy link
Contributor Author

r? arthurps for the additional commit that avoids the overflow check on reserve's fast path.

Copy link
Contributor

@arthurprs arthurprs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nnethercote
Copy link
Contributor Author

Thank you for the review. Do we need to tell bors?

@arthurprs
Copy link
Contributor

I think it's reserved for the organization members.

@bluss
Copy link
Member

bluss commented Oct 3, 2016

@bors r+

Thank you @nnethercote and @arthurprs

@bors
Copy link
Contributor

bors commented Oct 3, 2016

📌 Commit 607d297 has been approved by bluss

@bors
Copy link
Contributor

bors commented Oct 3, 2016

⌛ Testing commit 607d297 with merge 75df685...

bors added a commit that referenced this pull request Oct 3, 2016
Clarify HashMap's capacity handling.

HashMap has two notions of "capacity":

- "Usable capacity": the number of elements a hash map can hold without
  resizing. This is the meaning of "capacity" used in HashMap's API,
  e.g. the `with_capacity()` function.

- "Internal capacity": the number of allocated slots. Except for the
  zero case, it is always larger than the usable capacity (because some
  slots must be left empty) and is always a power of two.

HashMap's code is confusing because it does a poor job of
distinguishing these two meanings. I propose using two different terms
for these two concepts. Because "capacity" is already used in HashMap's
API to mean "usable capacity", I will use a different word for "internal
capacity". I propose "span", though I'm happy to consider other names.
@bors bors merged commit 607d297 into rust-lang:master Oct 3, 2016
@nnethercote nnethercote deleted the hash-span-capacity branch October 3, 2016 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants