-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CollisionProofHashMap, a mutable HashMap that degrades to red-black trees in the worst case #7633
Conversation
This comment has been minimized.
This comment has been minimized.
@szeiger is this ready for merge? do you want more review? |
a483afe
to
b149414
Compare
It is ready to merge but it hasn't been approved yet. I just rebased it. The old version would not have compiled after merging. I also investigated adding a special factory type for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions/suggestions for improvement but this looks good overall. I think it's a really important addition to the library for code deployed in a hostile environment. I especially like that this implements a principled, knowably-reasonable-worst-case-performance map instead of the Java solution of hoping for the best and disaster at runtime if you aren't sortable. 👍
* @define mayNotTerminateInf | ||
* @define willNotTerminateInf | ||
*/ | ||
final class OrderAwareHashMap[K, V](initialCapacity: Int, loadFactor: Double)(implicit ordering: Ordering[K]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be named after what it's good for, like CollisionProofMap
or RobustMap
. You want people who should be using it to pick it out as interesting when glancing through the collections names; I'm not sure "OrderAwareHashMap" immediately makes one think, "Oh, this is good for user-facing code where bad actors might use hash collisions in a DOS attack".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it also makes sense to put some info in the docs of mutable HashMap to say something like.
This map is open to hash collision DOS attacks. For a more robust HashMap implementation use @see [Name]
I imagine that most people will naturally go to HashMap first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with OrderAwareHashMap
, either. Concrete implementations are usually named after the data structures, whereas abstract traits are named after the properties of the interface. Something like TreeAssistedHashMap
would be better in my opinion. OrderAwareMap
could be the name of a common interface for multiple implementations.
/** The next size value at which to resize (capacity * load factor). */ | ||
private[this] var threshold: Int = newThreshold(table.length) | ||
|
||
private[this] var contentSize = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is enough; in the case where there are a lot of elements in very few bins, splitting based on the content size alone isn't likely to be a winning strategy. Instead, knowing the number of free slots is probably a good thing also (or maybe the only thing to know). Or maybe we want to know the number of detected hash collisions that can't be resolved by adding more bins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strategy for expanding the hash table is the same as in the standard HashMap
where it performs quite well in the benchmarks with the badly distributed hashes. It does mean that hash tables with random bad distributions can become unnecessarily large in both implementations but it's only a linear factor. In a deliberate collision attack you rely on far-worse-than-linear performance degradation from actual hash collisions (not just bucket collisions).
} | ||
*/ | ||
|
||
///////////////////// RedBlackTree code derived from mutable.RedBlackTree: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of duplicate code. How big of a difference does it make to have this duplicated rather than simply reusing the hopefully-already-optimized RedBlackTree
code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are substantial differences between the two implementations. RedBlackTree
has a separate Tree
object at the root which gets updates when the root node changes and which also keeps track of the size. The implementation here removes this extra Tree
object and either returns the root node from operations (in cases where they don't have to return anything else) or updates the hash table directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken another look to see if we could generalize the implementation here. Removing the common Node
superclass doesn't appear to make difference. I wanted to benchmark the impact of abstracting over an interface to get and set the root node but then I realized that RBNode
also contains the hash of the key. We'd need a lot of key + hash tuples and/or subclasses of RBNode
to abstract over this, both of which has a high impact on performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, and I guess rehashing is also a bad idea. I just worry that RB tree code is nontrivial enough, and here is in a place hard enough to test, that we might be vulnerable to bugs. But I guess a common implementation has to wait for Valhalla so the tuples are approximately zero-cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wouldn't do any rehashing unless you convert back from tree to linked list (which we don't support anyway). The tree stores the hashes because they are used for ordering in the tree before falling back to the "real" Ordering
. We can do this because we don't actually need the tree to be sensibly ordered and comparing hashes is usually much faster than comparing the actual keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the benchmarks the difference from rehashing is quite small but that's to be expected since all keys cache their hash codes anyway. For keys with expensive hash calculations this could lead to surprising performance issues because the regular LLNodes cache the hash codes and so do all of our other hash map/set implementations, and Java's as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some preliminary refactoring that would allow abstracting over different nodes with a generic red-black-tree implementation. The performance penalty is around 10 to 20% which is quite significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A generic red-black-tree implementation" meaning the one we have in the standard library, or yet another one? Anyway, I agree that the slowdown sounds significant (and thanks for testing!). Then again, if it only hits the hopefully rare heavily-collided case, maybe the slowdown is acceptable if the implementation is simplified sufficiently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation was based on the one here but allows for different node types with or without hashes and values. You would get the same performance degradation if you used it for TreeMap and TreeSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess to my mind the only advantage comes if we literally use scala.collection.mutable.RedBlackTree
for the impacted nodes here. Anything else involves significant code duplication; if you're already duplicating code, may as well make it ideal for the application (as you already did).
c647c4c
to
b36c246
Compare
The implementation is based on my new HashMap. Hash buckets use linked lists ordered by hashcode by default. When a bucket gets too big due to hash collisions it is converted into a red-black tree based on an implicit Ordering for the keys. This is similar to java.util.HashMap except we are using an Ordering instead of requiring the key type to implement Comparable. Equality semantics are determined by the Ordering, which has to be consistent with the key’s hashCode.
b36c246
to
2a167ee
Compare
Bikeshedding: maybe call it |
@Ichoran Relative to the old name or the new one? Did you notice that I renamed it to |
Relative to the new one, which definitely gets the idea across but is very long. I was hoping to find a name that succinctly attracts attention to the docs rather than putting the doc string as the class name. |
The current name was the preferred one at the last team meeting, so I'm merging this. If anyone wants to rename it, please open a PR. |
gentlemen, start your thesauruses |
Here's a first working version of a mutable
OrderAwareHashMap
. It could use some more polishing and there may be room for additional performance optimizations.There's also a limitation in the design at the moment: We have no abstraction to represent a
Map
that requires anOrdering
other thanSortedMap
, but this is not aSortedMap
. We need something likeSortedMap
with all the same methods (that take an extraOrdering
) and thenSortedMap
can be a subclass of it. It's not terribly important for this mutable version because there is no need to abstract over it (since it's the only one of its kind) and it doesn't pose a problem for the mutating methods, but an immutable version will require this additional abstraction.The implementation is based on my new HashMap. Hash buckets use linked
lists ordered by hashcode by default. When a bucket gets too big due to
hash collisions it is converted into a red-black tree based on an
implicit Ordering for the keys. This is similar to java.util.HashMap
except we are using an Ordering instead of requiring the key type to
implement Comparable.
Equality semantics are determined by the Ordering, which has to be
consistent with the key’s hashCode.
Here are some benchmark results comparing the new
scala.collection.mutable.HashMap
("hm"),java.util.HashMap
("java") andOrderAwareHashMap
("oa"):