Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CollisionProofHashMap, a mutable HashMap that degrades to red-black trees in the worst case #7633

Merged
merged 1 commit into from
Mar 12, 2019

Conversation

szeiger
Copy link
Member

@szeiger szeiger commented Jan 11, 2019

Here's a first working version of a mutable OrderAwareHashMap. It could use some more polishing and there may be room for additional performance optimizations.

There's also a limitation in the design at the moment: We have no abstraction to represent a Map that requires an Ordering other than SortedMap, but this is not a SortedMap. We need something like SortedMap with all the same methods (that take an extra Ordering) and then SortedMap can be a subclass of it. It's not terribly important for this mutable version because there is no need to abstract over it (since it's the only one of its kind) and it doesn't pose a problem for the mutating methods, but an immutable version will require this additional abstraction.


The implementation is based on my new HashMap. Hash buckets use linked
lists ordered by hashcode by default. When a bucket gets too big due to
hash collisions it is converted into a red-black tree based on an
implicit Ordering for the keys. This is similar to java.util.HashMap
except we are using an Ordering instead of requiring the key type to
implement Comparable.

Equality semantics are determined by the Ordering, which has to be
consistent with the key’s hashCode.


Here are some benchmark results comparing the new scala.collection.mutable.HashMap ("hm"), java.util.HashMap ("java") and OrderAwareHashMap ("oa"):

[info] Benchmark                                      (size)  Mode  Cnt         Score        Error  Units
[info] OrderAwareHashMapBenchmark.hmBuild                100  avgt   20      1278.382 ±      8.998  ns/op
[info] OrderAwareHashMapBenchmark.hmBuild               1000  avgt   20     14432.822 ±    132.735  ns/op
[info] OrderAwareHashMapBenchmark.hmBuild              10000  avgt   20    151727.174 ±   1066.466  ns/op
[info] OrderAwareHashMapBenchmark.hmFillColliding        100  avgt   20      7658.501 ±     77.745  ns/op
[info] OrderAwareHashMapBenchmark.hmFillColliding       1000  avgt   20    529964.492 ±   4271.993  ns/op
[info] OrderAwareHashMapBenchmark.hmFillColliding      10000  avgt   20  28287147.289 ± 668258.012  ns/op
[info] OrderAwareHashMapBenchmark.hmFillRegular          100  avgt   20      2203.677 ±     22.506  ns/op
[info] OrderAwareHashMapBenchmark.hmFillRegular         1000  avgt   20     20344.985 ±    294.703  ns/op
[info] OrderAwareHashMapBenchmark.hmFillRegular        10000  avgt   20    232439.374 ±   6201.773  ns/op
[info] OrderAwareHashMapBenchmark.hmGetExisting          100  avgt   20       781.545 ±      6.442  ns/op
[info] OrderAwareHashMapBenchmark.hmGetExisting         1000  avgt   20      8905.764 ±     71.248  ns/op
[info] OrderAwareHashMapBenchmark.hmGetExisting        10000  avgt   20    104811.034 ±   5294.304  ns/op
[info] OrderAwareHashMapBenchmark.hmGetNone              100  avgt   20       726.723 ±     52.499  ns/op
[info] OrderAwareHashMapBenchmark.hmGetNone             1000  avgt   20      8283.629 ±    224.081  ns/op
[info] OrderAwareHashMapBenchmark.hmGetNone            10000  avgt   20    122303.024 ±   4705.909  ns/op
[info] OrderAwareHashMapBenchmark.hmIterateEntries       100  avgt   20      1052.933 ±     38.532  ns/op
[info] OrderAwareHashMapBenchmark.hmIterateEntries      1000  avgt   20     11403.256 ±    134.154  ns/op
[info] OrderAwareHashMapBenchmark.hmIterateEntries     10000  avgt   20    117159.529 ±   2443.455  ns/op
[info] OrderAwareHashMapBenchmark.hmIterateKeys          100  avgt   20       732.929 ±      9.441  ns/op
[info] OrderAwareHashMapBenchmark.hmIterateKeys         1000  avgt   20      8651.052 ±    142.330  ns/op
[info] OrderAwareHashMapBenchmark.hmIterateKeys        10000  avgt   20     87809.306 ±    836.859  ns/op
[info] OrderAwareHashMapBenchmark.javaBuild              100  avgt   20      1064.255 ±     23.366  ns/op
[info] OrderAwareHashMapBenchmark.javaBuild             1000  avgt   20     11196.882 ±     58.611  ns/op
[info] OrderAwareHashMapBenchmark.javaBuild            10000  avgt   20    143506.719 ±  19448.291  ns/op
[info] OrderAwareHashMapBenchmark.javaFillColliding      100  avgt   20     12671.089 ±    244.150  ns/op
[info] OrderAwareHashMapBenchmark.javaFillColliding     1000  avgt   20    254598.450 ±   2771.674  ns/op
[info] OrderAwareHashMapBenchmark.javaFillColliding    10000  avgt   20   4147371.899 ±  95794.889  ns/op
[info] OrderAwareHashMapBenchmark.javaFillDoS            100  avgt   20     18455.807 ±   1458.840  ns/op
[info] OrderAwareHashMapBenchmark.javaFillDoS           1000  avgt   20    317349.231 ±   5974.337  ns/op
[info] OrderAwareHashMapBenchmark.javaFillDoS          10000  avgt   20   3931170.230 ± 291880.463  ns/op
[info] OrderAwareHashMapBenchmark.javaFillRegular        100  avgt   20      1895.835 ±     12.992  ns/op
[info] OrderAwareHashMapBenchmark.javaFillRegular       1000  avgt   20     19282.599 ±    199.564  ns/op
[info] OrderAwareHashMapBenchmark.javaFillRegular      10000  avgt   20    254465.350 ±   2953.456  ns/op
[info] OrderAwareHashMapBenchmark.javaGetExisting        100  avgt   20       933.256 ±     10.986  ns/op
[info] OrderAwareHashMapBenchmark.javaGetExisting       1000  avgt   20     10032.797 ±     95.663  ns/op
[info] OrderAwareHashMapBenchmark.javaGetExisting      10000  avgt   20    116287.184 ±   1580.840  ns/op
[info] OrderAwareHashMapBenchmark.javaGetNone            100  avgt   20       748.969 ±     12.333  ns/op
[info] OrderAwareHashMapBenchmark.javaGetNone           1000  avgt   20      8660.554 ±    115.736  ns/op
[info] OrderAwareHashMapBenchmark.javaGetNone          10000  avgt   20    124993.543 ±   4842.760  ns/op
[info] OrderAwareHashMapBenchmark.javaIterateEntries     100  avgt   20       646.117 ±      9.328  ns/op
[info] OrderAwareHashMapBenchmark.javaIterateEntries    1000  avgt   20      7053.389 ±     49.033  ns/op
[info] OrderAwareHashMapBenchmark.javaIterateEntries   10000  avgt   20     91628.388 ±   1493.768  ns/op
[info] OrderAwareHashMapBenchmark.javaIterateKeys        100  avgt   20       660.341 ±      9.498  ns/op
[info] OrderAwareHashMapBenchmark.javaIterateKeys       1000  avgt   20      7221.019 ±    120.893  ns/op
[info] OrderAwareHashMapBenchmark.javaIterateKeys      10000  avgt   20     88045.309 ±   1060.025  ns/op
[info] OrderAwareHashMapBenchmark.oaBuild                100  avgt   20      1478.120 ±     11.852  ns/op
[info] OrderAwareHashMapBenchmark.oaBuild               1000  avgt   20     14882.973 ±    128.385  ns/op
[info] OrderAwareHashMapBenchmark.oaBuild              10000  avgt   20    162302.955 ±   1962.691  ns/op
[info] OrderAwareHashMapBenchmark.oaFillColliding        100  avgt   20      8687.708 ±     68.726  ns/op
[info] OrderAwareHashMapBenchmark.oaFillColliding       1000  avgt   20    194464.736 ±   2540.554  ns/op
[info] OrderAwareHashMapBenchmark.oaFillColliding      10000  avgt   20   2713781.773 ±  37586.106  ns/op
[info] OrderAwareHashMapBenchmark.oaFillDoS              100  avgt   20     12770.699 ±    123.557  ns/op
[info] OrderAwareHashMapBenchmark.oaFillDoS             1000  avgt   20    247698.804 ±   4469.197  ns/op
[info] OrderAwareHashMapBenchmark.oaFillDoS            10000  avgt   20   3452773.773 ±  68316.753  ns/op
[info] OrderAwareHashMapBenchmark.oaFillRegular          100  avgt   20      3511.814 ±     40.394  ns/op
[info] OrderAwareHashMapBenchmark.oaFillRegular         1000  avgt   20     30937.939 ±    341.542  ns/op
[info] OrderAwareHashMapBenchmark.oaFillRegular        10000  avgt   20    303128.611 ±   3761.797  ns/op
[info] OrderAwareHashMapBenchmark.oaGetExisting          100  avgt   20       901.374 ±     56.706  ns/op
[info] OrderAwareHashMapBenchmark.oaGetExisting         1000  avgt   20      9397.086 ±    169.353  ns/op
[info] OrderAwareHashMapBenchmark.oaGetExisting        10000  avgt   20    101547.858 ±   1534.589  ns/op
[info] OrderAwareHashMapBenchmark.oaGetNone              100  avgt   20       725.748 ±     22.510  ns/op
[info] OrderAwareHashMapBenchmark.oaGetNone             1000  avgt   20      8458.053 ±    542.173  ns/op
[info] OrderAwareHashMapBenchmark.oaGetNone            10000  avgt   20    119077.530 ±   2957.570  ns/op
[info] OrderAwareHashMapBenchmark.oaIterateEntries       100  avgt   20      1267.887 ±     15.472  ns/op
[info] OrderAwareHashMapBenchmark.oaIterateEntries      1000  avgt   20     12647.641 ±    194.028  ns/op
[info] OrderAwareHashMapBenchmark.oaIterateEntries     10000  avgt   20    146443.705 ±   8787.324  ns/op
[info] OrderAwareHashMapBenchmark.oaIterateKeys          100  avgt   20       765.175 ±     17.808  ns/op
[info] OrderAwareHashMapBenchmark.oaIterateKeys         1000  avgt   20      7667.263 ±    100.420  ns/op
[info] OrderAwareHashMapBenchmark.oaIterateKeys        10000  avgt   20     90741.446 ±   1834.352  ns/op

@scala-jenkins scala-jenkins added this to the 2.13.0-RC1 milestone Jan 11, 2019
@diesalbla diesalbla added the library:collections PRs involving changes to the standard collection library label Jan 15, 2019
@szeiger

This comment has been minimized.

@SethTisue
Copy link
Member

@szeiger is this ready for merge? do you want more review?

@szeiger
Copy link
Member Author

szeiger commented Feb 15, 2019

It is ready to merge but it hasn't been approved yet. I just rebased it. The old version would not have compiled after merging.

I also investigated adding a special factory type for mutable.Map that would automatically choose an OrderAwareHashMap when an Ordering for the key type is available by using an implicit ord: Ordering[K] = null parameter. This does not work in practice because it leads the compiler on a wild goose chase for an Ordering which ends in a diverging implicit expansion, rather than just giving up and using the null default.

Copy link
Contributor

@Ichoran Ichoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions/suggestions for improvement but this looks good overall. I think it's a really important addition to the library for code deployed in a hostile environment. I especially like that this implements a principled, knowably-reasonable-worst-case-performance map instead of the Java solution of hoping for the best and disaster at runtime if you aren't sortable. 👍

* @define mayNotTerminateInf
* @define willNotTerminateInf
*/
final class OrderAwareHashMap[K, V](initialCapacity: Int, loadFactor: Double)(implicit ordering: Ordering[K])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be named after what it's good for, like CollisionProofMap or RobustMap. You want people who should be using it to pick it out as interesting when glancing through the collections names; I'm not sure "OrderAwareHashMap" immediately makes one think, "Oh, this is good for user-facing code where bad actors might use hash collisions in a DOS attack".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it also makes sense to put some info in the docs of mutable HashMap to say something like.

This map is open to hash collision DOS attacks. For a more robust HashMap implementation use @see [Name]

I imagine that most people will naturally go to HashMap first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with OrderAwareHashMap, either. Concrete implementations are usually named after the data structures, whereas abstract traits are named after the properties of the interface. Something like TreeAssistedHashMap would be better in my opinion. OrderAwareMap could be the name of a common interface for multiple implementations.

/** The next size value at which to resize (capacity * load factor). */
private[this] var threshold: Int = newThreshold(table.length)

private[this] var contentSize = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is enough; in the case where there are a lot of elements in very few bins, splitting based on the content size alone isn't likely to be a winning strategy. Instead, knowing the number of free slots is probably a good thing also (or maybe the only thing to know). Or maybe we want to know the number of detected hash collisions that can't be resolved by adding more bins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy for expanding the hash table is the same as in the standard HashMap where it performs quite well in the benchmarks with the badly distributed hashes. It does mean that hash tables with random bad distributions can become unnecessarily large in both implementations but it's only a linear factor. In a deliberate collision attack you rely on far-worse-than-linear performance degradation from actual hash collisions (not just bucket collisions).

}
*/

///////////////////// RedBlackTree code derived from mutable.RedBlackTree:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of duplicate code. How big of a difference does it make to have this duplicated rather than simply reusing the hopefully-already-optimized RedBlackTree code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are substantial differences between the two implementations. RedBlackTree has a separate Tree object at the root which gets updates when the root node changes and which also keeps track of the size. The implementation here removes this extra Tree object and either returns the root node from operations (in cases where they don't have to return anything else) or updates the hash table directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken another look to see if we could generalize the implementation here. Removing the common Node superclass doesn't appear to make difference. I wanted to benchmark the impact of abstracting over an interface to get and set the root node but then I realized that RBNode also contains the hash of the key. We'd need a lot of key + hash tuples and/or subclasses of RBNode to abstract over this, both of which has a high impact on performance.

Copy link
Contributor

@Ichoran Ichoran Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, and I guess rehashing is also a bad idea. I just worry that RB tree code is nontrivial enough, and here is in a place hard enough to test, that we might be vulnerable to bugs. But I guess a common implementation has to wait for Valhalla so the tuples are approximately zero-cost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't do any rehashing unless you convert back from tree to linked list (which we don't support anyway). The tree stores the hashes because they are used for ordering in the tree before falling back to the "real" Ordering. We can do this because we don't actually need the tree to be sensibly ordered and comparing hashes is usually much faster than comparing the actual keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the benchmarks the difference from rehashing is quite small but that's to be expected since all keys cache their hash codes anyway. For keys with expensive hash calculations this could lead to surprising performance issues because the regular LLNodes cache the hash codes and so do all of our other hash map/set implementations, and Java's as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some preliminary refactoring that would allow abstracting over different nodes with a generic red-black-tree implementation. The performance penalty is around 10 to 20% which is quite significant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A generic red-black-tree implementation" meaning the one we have in the standard library, or yet another one? Anyway, I agree that the slowdown sounds significant (and thanks for testing!). Then again, if it only hits the hopefully rare heavily-collided case, maybe the slowdown is acceptable if the implementation is simplified sufficiently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation was based on the one here but allows for different node types with or without hashes and values. You would get the same performance degradation if you used it for TreeMap and TreeSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to my mind the only advantage comes if we literally use scala.collection.mutable.RedBlackTree for the impacted nodes here. Anything else involves significant code duplication; if you're already duplicating code, may as well make it ideal for the application (as you already did).

@szeiger szeiger force-pushed the wip/order-aware-hashmap branch 2 times, most recently from c647c4c to b36c246 Compare February 22, 2019 17:27
The implementation is based on my new HashMap. Hash buckets use linked
lists ordered by hashcode by default. When a bucket gets too big due to
hash collisions it is converted into a red-black tree based on an
implicit Ordering for the keys. This is similar to java.util.HashMap
except we are using an Ordering instead of requiring the key type to
implement Comparable.

Equality semantics are determined by the Ordering, which has to be
consistent with the key’s hashCode.
@Ichoran
Copy link
Contributor

Ichoran commented Feb 28, 2019

Bikeshedding: maybe call it ResilientMap or ToughMap to get across the idea that it's supposed to be resistant to attack.

@szeiger
Copy link
Member Author

szeiger commented Mar 1, 2019

@Ichoran Relative to the old name or the new one? Did you notice that I renamed it to CollisionProofHashMap?

@Ichoran
Copy link
Contributor

Ichoran commented Mar 1, 2019

Relative to the new one, which definitely gets the idea across but is very long. I was hoping to find a name that succinctly attracts attention to the docs rather than putting the doc string as the class name.

@szeiger
Copy link
Member Author

szeiger commented Mar 12, 2019

The current name was the preferred one at the last team meeting, so I'm merging this. If anyone wants to rename it, please open a PR.

@szeiger szeiger merged commit f8fdd3e into scala:2.13.x Mar 12, 2019
@SethTisue
Copy link
Member

gentlemen, start your thesauruses

@SethTisue SethTisue added the release-notes worth highlighting in next release notes label Apr 5, 2019
@SethTisue SethTisue changed the title A mutable HashMap that degrades to red-black trees in the worst case Add CollisionProofHashMap, a mutable HashMap that degrades to red-black trees in the worst case Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library release-notes worth highlighting in next release notes
Projects
None yet
6 participants