Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with Open Addressing HashMap and base it on Vector #1902

Closed
l0rinc opened this issue Mar 9, 2017 · 9 comments
Closed

Experiment with Open Addressing HashMap and base it on Vector #1902

l0rinc opened this issue Mar 9, 2017 · 9 comments

Comments

@l0rinc
Copy link
Contributor

l0rinc commented Mar 9, 2017

Since Hash Array Mapped Trie and Bit Mapped Trie are very similar conceptually (not even sure what the differences are), we could eliminate duplicate code if we would base the Maps on Vector, instead of recreating the array-emulating logic (where applicable, i.e. not for TreeMap).
(This is true even if we decide not to switch to Open Addressing.)

In addition to that, we could experiment with Open Addressing also - the current HashMaps use linking for collisions, right?

Pro:

  • Compared to chaining, linear probing has better cache locality, therefore it can be a lot faster
  • Could end up using a lot less memory

Contra:

  • It's difficult to implement
  • It cannot store null values (easily), might be problematic with primitive storage (a BitSet could help here, I think)
  • Delete will leave the keys and just change the values to null, might lead to memory leaks (I'm sure though that others have thought about this also)

The following resources could help us in designing it:

@ruslansennov has the expertise in designing the Maps, I know the Vector and @danieldietrich could mediate and contribute with his overall expertise. It could be a team-effort.
We could either create an experimental branch/package (or both), where we could collaborate on trying to achieve the fastest persistent map for the JVM - with automatic primitive support!

Related to: #1486

@danieldietrich
Copy link
Contributor

Fusing similar code and speeding up things + making them more memory efficient is alway a good idea.

I will create a feature branch that we can work on. The experimental HashMap could use the best of both worlds, BMT and HAMT. I.e. using the hashing stuff of HAMT and building a structure using BMT.

But caution: open-addressed impls may perform very very bad: http://stackoverflow.com/a/2557451/1110815

Open-addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes. It gets very, very slow if the load factor approaches 1, because you end up usually having to search through many of the slots in the bucket array before you find either the key that you were looking for or an empty slot. Also, you can never have more elements in the hash table than there are entries in the bucket array.

@l0rinc
Copy link
Contributor Author

l0rinc commented Mar 9, 2017

It gets very, very slow if the load factor approaches 1

The mentioned article addresses this with a logarithmic worst-case probing

@danieldietrich
Copy link
Contributor

Yes, but if HAMT is moderately slower (absolutely) for the good case and more stable in direction to the worst case I would prefer HAMT. We have to check that. Scala uses HAMT for HashMap. This might have a reason.

@danieldietrich danieldietrich modified the milestones: 3.0.0, 2.1.0 Mar 9, 2017
@danieldietrich
Copy link
Contributor

@paplorinc I've created the open-addressing-hash-map branch.

@ruslansennov
Copy link
Member

Hi guys.
I believe it will be very interesting experiment. I will try to suggest cons-benchmark when prototype will be ready.

@ruslansennov
Copy link
Member

(hope I can) 😄

@l0rinc
Copy link
Contributor Author

l0rinc commented Mar 10, 2017

a few notes:

  • if we store Vector<Tuple<K, V>>s in the nodes, we cannot take advantage of primitive support
    • unless we generate corresponding Tuples also
    • or we store the K and V in separate Vectors (we might lose cache locality this way)
    • we could store two arrays in a specialized BMT somehow...
  • we could use BitSet to store info about the values, e.g. was it deleted or not (is needed for linear probing)
  • we may want to use better hashing than a simple modulo

I will experiment with these, please share your findings also :)

@danieldietrich
Copy link
Contributor

we could store two arrays in a specialized BMT somehow...

I would not change the BMT for the purpose of an Open Addressing (OA) HashMap. The past development showed that it is best to create specialized backing structures for each purpose.

On the other hand we do not want to have another backing collection, we want to reduce them. Changing BMT in order to make it work for Vector and HashMap will make it too complicated (-> separation of concerns). In the end no one would know which line of code is necessary for which other collection...

we could use BitSet to store info about the values

Maybe this will make it too heavy-weight. We should try to stay on the lowest level possible for our backing collections: primitive types, bit-operations, ...

we may want to use better hashing than a simple modulo

Yes, definitely. The quality of the hashing algorithm is the key to an outstanding hashed collection.


I will try to get more involved after solving the issues I'm currently on - too many parallel tasks 😳

@danieldietrich
Copy link
Contributor

Closed because this issue is stale.
If this is an issue, we need a new investigation, which map impl is state of the art to that moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants