New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify HashMap's capacity handling. #36766

Merged
merged 2 commits into from Oct 3, 2016

Conversation

Projects
None yet
7 participants
@nnethercote
Contributor

nnethercote commented Sep 27, 2016

HashMap has two notions of "capacity":

  • "Usable capacity": the number of elements a hash map can hold without
    resizing. This is the meaning of "capacity" used in HashMap's API,
    e.g. the with_capacity() function.
  • "Internal capacity": the number of allocated slots. Except for the
    zero case, it is always larger than the usable capacity (because some
    slots must be left empty) and is always a power of two.

HashMap's code is confusing because it does a poor job of
distinguishing these two meanings. I propose using two different terms
for these two concepts. Because "capacity" is already used in HashMap's
API to mean "usable capacity", I will use a different word for "internal
capacity". I propose "span", though I'm happy to consider other names.

@rust-highfive

This comment has been minimized.

Collaborator

rust-highfive commented Sep 27, 2016

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@shepmaster

This comment has been minimized.

Member

shepmaster commented Sep 27, 2016

I propose "span",

I'm not in the compiler code much, but IIRC "span" is used during parsing to indicate a section of text (such as to report for an error). That particular usage of the word is fairly far away from this particular usage though, so there may not be a conflict in practice.

@aturon

This comment has been minimized.

Member

aturon commented Sep 27, 2016

@aturon aturon removed their assignment Sep 27, 2016

@arthurprs

This comment has been minimized.

Contributor

arthurprs commented Sep 28, 2016

The PR has merits, it's a bit confusing whatever capacity is referring to the hashmap capacity (the exposed one) or the rawtable capacity.

But the proposed naming is a bit odd to me, maybe adding some table_/internal_ prefixes around is enough.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 28, 2016

The PR has merits, it's a bit confusing whatever capacity is referring to the hashmap capacity (the exposed one) or the rawtable capacity.

But the proposed naming is a bit odd to me, maybe adding some table_/internal_ prefixes around is enough.

It's true that the meaning of RawTable::capacity() is different to the meaning of HashMap::capacity(). That is pre-existing, and I haven't changed it with this PR, because I didn't want to change either type's API. What I have done in this PR, however, is introduce HashMap::span(), which is now the one function in the entire file that calls RawTable::capacity(), and is therefore the single place in map.rs in which this difference in meaning is exposed. (I could add a comment to span explaining this.) This is a big improvement over the current code, in which the three meanings of "capacity" (HashMap's two meanings, and RawTable's one meaning) are mixed with abandon.

If you are still unhappy, I could change all uses of "capacity" within RawTable to "span" as well. That would certainly clarify code like this:

    /// The hashtable's capacity, similar to a vector's.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

As for adding prefixes, I think that's less effective. The use of an unprefixed "capacity" is baked into HashMap's API. If we add prefixes to the uses of "capacity" that aren't in the API names, that still leaves us with a mixture of prefixed and unprefixed uses, which leaves the code less clear than this PR does.

Finally, I assume this is the review for my patch? AFAICT it's an r-, but the suggestions on how to change things to move towards an r+ are quite vague. When I r- patches for Firefox I always do my best to make it clear that (a) it is an r-, and (b) what changes are required to move towards r+, or if the patch is irredeemably flawed, make that fact clear.

@arthurprs

This comment has been minimized.

Contributor

arthurprs commented Sep 28, 2016

That wasn't a formal review, just a comment trying to create some discussion around the problem at hand.

I think we should preferably avoid introducing another concept (span), even if it's a simple one.
It isn't immediately clear and one would need to look into the span() function to understand it's relation to the rawtable capacity. capacity is a defacto name so replacing it in the table adds yet more friction.

I propose internal_capacity (or a variation) instead, as it has a better chance of being understood upfront. It's a trade-off of course, between disambiguating and not introducing another concept.
The name is already used in the codebase, but not consistently (probably a result of the various refactors throughout the years).

If you don't agree we can always wait/weight more opinions.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 28, 2016

I propose internal_capacity

I can live with that. Do you want it to be used for RawTable as well? I.e. change RawTable::capacity() to RawTable::internal_capacity(), and other corresponding changes?

@arthurprs

This comment has been minimized.

Contributor

arthurprs commented Sep 28, 2016

I don't think we need any changes to RawTable. What do you think?

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 28, 2016

I'm considering raw_capacity instead of internal_capacity. It's shorter, and ties in nicely with RawTable. Though I'm still unsure if we should change RawTable::capacity() to RawTable::raw_capacity() for consistency.

@arthurprs

This comment has been minimized.

Contributor

arthurprs commented Sep 28, 2016

+1 for raw_capacity.

I don't consistency is in play there. RawTable::capacity is fine conceptually and it also shouldn't be aware how it's called elsewhere.

@nnethercote nnethercote force-pushed the nnethercote:hash-span-capacity branch from e55bb04 to f6f66aa Sep 29, 2016

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 29, 2016

@arthurps: I have updated the commit to use "capacity" and "raw capacity". I left RawTable unchanged.

@arthurprs

LGTM, only one optional nit.

This also makes reserve(0) a noop on an empty HashMap, which I think is good.

@@ -34,13 +34,9 @@ use super::table::BucketState::{
Full,
};
const INITIAL_LOG2_CAP: usize = 5;
const INITIAL_CAPACITY: usize = 1 << INITIAL_LOG2_CAP; // 2^5
const MIN_NONZERO_RAW_CAPACITY: usize = 32;

This comment has been minimized.

@arthurprs

arthurprs Sep 29, 2016

Contributor

Maybe we should leave something to remind that it must be a power of 2? Like the // 2^5 before

This comment has been minimized.

@nnethercote

nnethercote Sep 29, 2016

Contributor

I've updated the commit to add a comment saying it must be a power of two.

Clarify HashMap's capacity handling.
This commit does the following.

- Changes the terminology for capacities used within HashMap's code.
  "Internal capacity" is now consistently "raw capacity", and "usable
  capacity" is now consistently just "capacity". This makes the code
  easier to understand.

- Reworks capacity and raw capacity computations. Raw capacity
  computations are now handled in a single place:
  `DefaultResizePolicy::raw_capacity()`. This function correctly returns
  zero when given zero, which means that the following cases now result
  in a capacity of zero when they previously did not.

  * `Hash{Map,Set}::with_capacity(0)`
  * `Hash{Map,Set}::with_capacity_and_hasher(0)`
  * `Hash{Map,Set}::shrink_to_fit()`, when used with a hash map/set whose
    elements have all been removed

- Strengthens the language used in the comments describing the above
  functions, to make it clearer when they will result in a map/set with
  a capacity of zero. The new language is based on the language used for
  the corresponding functions in `Vec`.

- Adds tests for the above zero-capacity cases.

- Removes `test_resize_policy` because it is no longer useful.

@nnethercote nnethercote force-pushed the nnethercote:hash-span-capacity branch from f6f66aa to 6a9b5e4 Sep 29, 2016

// 2. Ensure it is a power of two.
// 3. Ensure it is at least the minimum size.
let mut raw_cap = len * 11 / 10;
assert!(raw_cap >= len, "raw_cap overflow");

This comment has been minimized.

@bluss

bluss Sep 30, 2016

Contributor

This assertion is a once-off (on hashmap creation from capacity), isn't it? So it should have no major impact on performance?

This comment has been minimized.

@arthurprs

arthurprs Sep 30, 2016

Contributor

I think we need to double check if this adds more work to the pub fn reserve, it's called before every insert.

This comment has been minimized.

@bluss

bluss Sep 30, 2016

Contributor

Is it? I was just looking at that. I don't think it is.

This comment has been minimized.

@bluss

bluss Sep 30, 2016

Contributor

It's only called if capacity is insufficient, so in resizes from insert. That's ok.

This comment has been minimized.

@arthurprs

arthurprs Sep 30, 2016

Contributor

Yeah you're right.

This comment has been minimized.

@bluss

bluss Sep 30, 2016

Contributor

Ah the PR actually already changes reserve to do less work

This comment has been minimized.

@nnethercote

nnethercote Oct 3, 2016

Contributor

This function is called from with_capacity_and_header, reserve, and shrink_to_fit. Prior to this PR, those functions had an overflow assertion. (shrink_to_fits's was a debug_assert!, the others were an assert!.)

In the new code, the assertion has moved out of those functions, into DefaultResizePolicy::raw_capacity. So it shouldn't have any effect.

@@ -667,28 +667,23 @@ impl<K, V, S> HashMap<K, V, S>
/// ```
#[stable(feature = "rust1", since = "1.0.0")]
pub fn reserve(&mut self, additional: usize) {

This comment has been minimized.

@bluss

bluss Sep 30, 2016

Contributor

Ideally, reserve should have a fast path that returns quickly if the additional capacity is already present. This fast path should not have branches to panic. I'm not sure how big a difference it would make, though.

This comment has been minimized.

@arthurprs

arthurprs Sep 30, 2016

Contributor

What about

let remaining = self.capacity() - self.len(); # this can't overflow
if remaining < additional {
....
}

This comment has been minimized.

@arthurprs

arthurprs Sep 30, 2016

Contributor

@nnethercote would you like to do this in the PR or want to wrap it up?

This comment has been minimized.

@nnethercote

nnethercote Sep 30, 2016

Contributor

I do want to wrap it up, but might as well do it properly. I'll take a look at this on Monday.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Oct 3, 2016

r? arthurps for the additional commit that avoids the overflow check on reserve's fast path.

@arthurprs

LGTM

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Oct 3, 2016

Thank you for the review. Do we need to tell bors?

@arthurprs

This comment has been minimized.

Contributor

arthurprs commented Oct 3, 2016

I think it's reserved for the organization members.

@bluss

This comment has been minimized.

Contributor

bluss commented Oct 3, 2016

@bors r+

Thank you @nnethercote and @arthurprs

@bors

This comment has been minimized.

Contributor

bors commented Oct 3, 2016

📌 Commit 607d297 has been approved by bluss

@bors

This comment has been minimized.

Contributor

bors commented Oct 3, 2016

⌛️ Testing commit 607d297 with merge 75df685...

bors added a commit that referenced this pull request Oct 3, 2016

Auto merge of #36766 - nnethercote:hash-span-capacity, r=bluss
Clarify HashMap's capacity handling.

HashMap has two notions of "capacity":

- "Usable capacity": the number of elements a hash map can hold without
  resizing. This is the meaning of "capacity" used in HashMap's API,
  e.g. the `with_capacity()` function.

- "Internal capacity": the number of allocated slots. Except for the
  zero case, it is always larger than the usable capacity (because some
  slots must be left empty) and is always a power of two.

HashMap's code is confusing because it does a poor job of
distinguishing these two meanings. I propose using two different terms
for these two concepts. Because "capacity" is already used in HashMap's
API to mean "usable capacity", I will use a different word for "internal
capacity". I propose "span", though I'm happy to consider other names.

@bors bors merged commit 607d297 into rust-lang:master Oct 3, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

@nnethercote nnethercote deleted the nnethercote:hash-span-capacity branch Oct 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment