Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add immutable TreeSeqMap (a SeqMap implemented via a customized IntMap/HashMap pair) #7146

Merged
merged 1 commit into from
Jan 23, 2019

Conversation

odd
Copy link
Contributor

@odd odd commented Aug 28, 2018

This is an alternative SeqMap implementation to compare against VectorMap. It uses a customized IntMap to provide insertion/modification iteration ordering and a HashMap to allow efficient lookup by key.

@scala-jenkins scala-jenkins added this to the 2.13.0-RC1 milestone Aug 28, 2018
@NthPortal
Copy link
Contributor

I gave this a quick look this morning, and it looks really good. I will try to give it a full review later this week or this weekend.

cheers!

@joshlemer
Copy link
Member

I was thinking that since there's nowhere where you're using the TreeMap's lookup functionality, you could get by with a persistent min-max heap. Unfortunately none exist in the standard library, but maybe we could create a private[collection] implementation and see if it really has lower overhead.

@mdedetrich
Copy link
Contributor

Nice work, I think in general a properly optimized VectorMap should theoretically be faster than this OrderedMap, but we should benchmark the two implementations vigorously. There is also a tradeoff in terms of memory usage (which can effect performance due to cache lines as well if the collection is large enough)

@odd
Copy link
Contributor Author

odd commented Aug 29, 2018

Thanks, @mdedetrich! I must stress that this PR is highly inspired by, and partly based on, your VectorMap PR, but with the Vector switched out for a TreeMap. My main objective was to create a simple straightforward baseline implementation of SeqMap that other implementations (e.g. VectorMap) could be compared to and benchmarked against. There might also exist better-suited data structures than TreeMap to use for the ordering part in this implementation (as @joshlemer suggests above). Creating a benchmarking suite to run these against each other (both time and memory wise) feels like a good next step. Perhaps your work on the new JSON implementation can provide some relevant real-world samples to test on?

@mdedetrich
Copy link
Contributor

@odd Yup completely understand, I am curious what the real world results. Apologies if I comment came off as being too critical. The devil is usually in the detail with these things, I will also try and optimize both implementations if I have enough time

@odd odd changed the title Add immutable OrderedMap (a SeqMap implemented via a TreeMap/HashMap pair) Add immutable OrderedMap (a SeqMap implemented via an customized IntMap/HashMap pair) Sep 7, 2018
@odd odd changed the title Add immutable OrderedMap (a SeqMap implemented via an customized IntMap/HashMap pair) Add immutable OrderedMap (a SeqMap implemented via a customized IntMap/HashMap pair) Sep 7, 2018
@odd
Copy link
Contributor Author

odd commented Sep 7, 2018

I pushed an updated version which replaces the immutable.TreeMap with a customized immutable.IntMap leading to a 25% improvement in performance on average. I will follow up with a comparison between this implementation and immutable.VectorMap but I have started on some performance improvements to immutable.VectorMap that I would like to finish first.

@odd
Copy link
Contributor Author

odd commented Oct 6, 2018

For reference, the performance improvements to VectorMap I mentioned in my previous comment has been implemented in PR #7193, and while improving performance for consecutive removes (i.e. removing elements in the same order they where added) with a factor of 5 when removing 2048 consecutive elements out of a map of size 4096, OrderedMap is still faster than VectorMap by a factor of 200.

@SethTisue SethTisue added the library:collections PRs involving changes to the standard collection library label Nov 8, 2018
@SethTisue
Copy link
Member

supposing we merge this, would there be any reason to keep VectorMap? (keeping in mind that keeping VectorMap around as an alternative isn't without cost, it has to be documented and maintained and tested in perpetuity)

@odd
Copy link
Contributor Author

odd commented Nov 8, 2018

@SethTisue Assuming both VectorMap and OrderedMap pass the collection laws, the question becomes which has the most desired overall performance characteristics (time/memory). I have started on a benchmarking suite in odd/collection-benchmark which indicate similar performance (time) between them for all operations except removal of sequential elements (see #7146 (comment)) where OrderedMap is significantly faster. I plan on adding more extensive benchmarking once the lawfullness of both has been be verified.

@mdedetrich
Copy link
Contributor

mdedetrich commented Nov 9, 2018

@SethTisue Afaik the issue is its hard to say which collection is overall better. Afaik OrderedMap is better if you remove keys very often (for some definition of often), where as if this isn't the case than VectorMap is better overall. This is one of the drawbacks of such immutable collections, its hard to get them to be very good in all cases.

Then there is also memory to to take into account, I think that VectorMap uses less memory overall. @odd Is there a list of benchmarks somewhere which show how much faster VectorMap is compared to OrderedMap on common operations (i.e. lookup by key, iteration, etc etc)?

@odd
Copy link
Contributor Author

odd commented Dec 17, 2018

I have now (finally) updated and pushed my latest benchmarking code to collection-benchmark; here are the results comparing the three immutable.SeqMap implementations ListMap, VectorMap and OrderedMap:

@odd odd force-pushed the orderedmap branch 2 times, most recently from 101bf9c to c23b977 Compare December 17, 2018 19:53
@mdedetrich
Copy link
Contributor

@odd Thanks for the benchmarks, I think they vaguely confirmed my suspicions

@SethTisue I think from the results its hard to argue that OrderedMap should be a replacement for VectorMap. VectorMap arguably is faster for "common" patterns although OrderedMap clearly beats it for anything removal based (and some other operations are interesting, I think we need to investigate why). VectorMap does however take a lot less memory in general.

My personal vote is that VectorMap should be the default for this reason, albeit with documentation saying that if you have a removal heavy scenario you should explore OrderedMap. I still think its useful to add OrderedMap for people that need it (with brief documentation stating the tradeoffs) @odd Do you agree?

@odd
Copy link
Contributor Author

odd commented Dec 30, 2018

@mdedetrich Thanks for the feedback, I disagree with your conclusion though 🤓

By looking at the benchmark results a couple of things can be noted:

  • ListMap has really low memory usage and has excellent performance for traversals as long as the traversal is done in the same direction as the entries are linked (i.e. doing the iteration in the opposite direction like in traverse_headTail incurs an enormous performance penalty, as do any other operation requiring out-of-order processing).
  • OrderedMap and VectorMap have similar performance for all operations that does not involve removal, especially consecutive removal (where OrderedMap is several orders of magnitude faster).
  • OrderedMap takes about 50% extra memory compared to VectorMap (although ListMap is by far the best memory wise).

All together I believe the points above speaks for making OrderedMap the default immutable.SeqMap-implementation rather than VectorMap. OrderedMap is safer in that no single operation has drastically worse performance than the other implementations (which is not the case for ListMap or VectorMap). Of course this comes at the cost of greater memory usage, but IMHO the added safety of not having a performance gotcha is more important for a default implementation. I believe VectorMap should be kept as a more specialized implementation offering improved memory usage over OrderedMap (but without the bad out-of-order processing performance of ListMap), preferably with a note in the documentation explaining its performance profile and suitable use-cases.

@mdedetrich
Copy link
Contributor

mdedetrich commented Jan 3, 2019

@odd Honestly I am personally borderline, completely understand (and agree!) with your points, the issue I have is that VectorMap is already a memory intensive collection (its actually what I would consider borderline in regards to amortized memory used vs collection size) and OrderedMap just makes this much worse.

Its the whole using roughly 50% more memory than a collection which is already borderline in terms of acceptable memory usage that I have an issue with (and collections that use a lot of memory pose their own problems, with such high memory churn you are often competing with cache lines versus other resources which isn't something that shows up in benchmarks. Compounded ontop of this we are dealing with the JVM environment which doesn't have the best track record when it comes to memory conservation).

So to me the whole equation is whether we should balance using a huge amount of memory against an operation which is at least acceptable? in terms of performance.

@Ichoran @NthPortal @joshlemer @julienrf What are your thoughts on this?

@Ichoran
Copy link
Contributor

Ichoran commented Jan 3, 2019

I'm inclined to agree with @odd that in their current incarnations, OrderedMap is a better default collection than VectorMap for most use cases. Despite the substantial per-element overhead (~150 bytes vs. 100 for VectorMap), it's considerably faster at many common operations (e.g. filter, map, collect) without being considerably slower in anything. That seems, to me, to be the more desirable default behavior: pretty good at everything with no serious deficits.

On the other hand, I'm a little surprised at the results for VectorMap, so it may be that VectorMap can be sped up substantially. In that case, if you ask me which architecture seems more promising, Vector + Hash vs. IntHash + Hash, I'd think the former. While Vector isn't great at removals, it's also not that common for immutable collections to have a lot of single-element removals, and in other cases the Vector architecture ought to be better than the custom IntMap. Otherwise, why don't we throw away Vector and replace it with an IntMap?

@mdedetrich, do you have time to track down why VectorMap seems slow with operations like map?

@Ichoran
Copy link
Contributor

Ichoran commented Jan 3, 2019

Also @odd how do you know that OrderedMap is correct? There are, of course, a lot of ways to go faster if you're not constrained to produce the right answer :)

(I've run collections-laws on VectorMap and it passes all the tests I've got. That said, there aren't extensive tests for the in-order property on maps.)

@joshlemer
Copy link
Member

@Ichoran if you're talking about the variant of map that takes an ((K, V)) => (X, Y) and returns VectorMap[X, Y], it's probably slow because VectorMap's builder is very slow (see #7508).

@odd
Copy link
Contributor Author

odd commented Jan 3, 2019

@Ichoran I've been meaning to run OrderedMap through collections-laws and make sure it passes, but I haven't been able to do so yet (last I checked @SethTisue was going to make a new release of collection-laws; perhaps that has already been done, I'll check tomorrow).

@mdedetrich
Copy link
Contributor

@odd Do you want to rerun your benchmarks now that #7508 is merged?

@odd
Copy link
Contributor Author

odd commented Jan 15, 2019

@mdedetrich I will rerun the benchmarks later today on a local Scala build which includes #7508 and also an updated OrderedMap-implementation that uses the same techniques as in #7508 for the mapping and combines this with an in-place append operation for the ordering tree (see 00b7f1e). The benchmark will compare OrderedMap (the previous version), OrderedMapY (the latest version) and VectorMap (the latest version).

@odd
Copy link
Contributor Author

odd commented Jan 15, 2019

@Ichoran Regarding the name OrderedMap, I agree that it could be somewhat confusing (especially since SortedMap uses an Ordering to sort its elements). Some suggestions for alternative names are:

  • SeqHashMap
  • LinkedMap/LinkedHashMap (although it is not technically linked like the mutable one)

@joshlemer
Copy link
Member

  • TreeHashMap?

@odd
Copy link
Contributor Author

odd commented Jan 15, 2019

@joshlemer I considered TreeHashMap but since immutable.HashMap is also implemented as a tree, it was a bit undescriptive. What do you think about SeqHashMap?

@joshlemer
Copy link
Member

joshlemer commented Jan 15, 2019

<bikeshed>

Hmm, well I don't like it a lot because there isn't actually a Seq anywhere backing it, like a Vector backs a VectorMap or a List backs a ListMap. Maybe a TreeSeqMap?

</bikeshed>

@odd
Copy link
Contributor Author

odd commented Jan 15, 2019

@mdedetrich Here are the results of the latest benchmark runs:

@odd
Copy link
Contributor Author

odd commented Jan 15, 2019

@joshlemer But the principal interface is called SeqMap and it isn't a Seq (as in collection.Seq) either, so SeqHashMap would be short for Sequential Hash Map pertaining to the entries in the map being sequential (as opposed to the meaning of Seq in collection.Seq where Seq stands for sequence; whether there's a meaningful difference between the terms I don't know - I'm not a native English speaker so the nuances might be lost on me I'm afraid).

@joshlemer
Copy link
Member

I think it makes most sense to call it a TreeSeqMap because what it is is a SeqMap (not a HashMap, though it contains one that is not what it is). It's a SeqMap which is backed by both a HashMap and a Tree, so I think the last word should be SeqMap, and then at the front put something like TreeSeqMap/TrieSeqMap/TreeHashSeqMap/TrieHashSeqMap.

I think the Tree/Trie aspect are more important to differentiate because remember that VectorMap also uses a HashMap, so SeqHashMap could equally apply to VectorMap. Anyways, just my $0.02 😄

@SethTisue
Copy link
Member

👍 to TreeSeqMap

Copy link
Contributor

@Ichoran Ichoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to rename to TreeSeqMap and then I guess it's okay. I still am uneasy about the ordering policy, but I will leave that to someone else to decide.

@odd odd changed the title Add immutable OrderedMap (a SeqMap implemented via a customized IntMap/HashMap pair) Add immutable TreeSeqMap (a SeqMap implemented via a customized IntMap/HashMap pair) Jan 17, 2019
@odd odd force-pushed the orderedmap branch 2 times, most recently from d52cdcc to ad4d0c6 Compare January 17, 2019 08:48
@odd
Copy link
Contributor Author

odd commented Jan 17, 2019

@Ichoran I have changed the name and squashed and pushed but it still looks like there is change requested by you. The problem is that I cannot find your comments any more (I had to do a force push to avoid the earlier commit that failed by Codacy so the change request might have got sidelined somehow). Please let me know if you think anything more should be changed (besides your reservation about the modification ordering).

@Ichoran
Copy link
Contributor

Ichoran commented Jan 17, 2019

Looks good! (Still uneasy about the insertion/modification thing, but not so uneasy that we should hold this up.)

@mdedetrich
Copy link
Contributor

mdedetrich commented Jan 17, 2019

@odd Thanks for the benchmarks, there are definitely substantial improvements in performance to VectorMap. I think there are still areas where we can squeeze more performance, will check this out on the weekend.

Is it also possible to document how you generate your benchmarks, don't want to ping you every-time I run them!

@odd
Copy link
Contributor Author

odd commented Jan 22, 2019

/rebuild

@Ichoran Ichoran merged commit 5888c7d into scala:2.13.x Jan 23, 2019
@SethTisue SethTisue added the release-notes worth highlighting in next release notes label Jan 23, 2019
@SethTisue
Copy link
Member

#8041 proposes to change the default to TreeSeqMap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library release-notes worth highlighting in next release notes
Projects
None yet
8 participants