Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed Vector from HAMT to BMT #1504

Merged
merged 3 commits into from Aug 24, 2016
Merged

Changed Vector from HAMT to BMT #1504

merged 3 commits into from Aug 24, 2016

Conversation

paplorinc
Copy link
Contributor

@paplorinc paplorinc commented Aug 17, 2016

Fixes: Implement Vector as bit-mapped vector trie #568.
Dependency for: #1449, #1522 and #1557.

This is the non-optimized version that I only compared to Scala for now, feel free to compare it against:


A bit-mapped trie is a very wide and shallow tree (for integer indices the depth will be ≤6).
Each node has a maximum of 32 children (configurable).
Access to a given position is done by converting the index to a base 32 number and using each digit to descend down the tree.
Modifying the tree is done similarly, but along the way the path is copied, returning a new root every time.
Append inserts in the last leaf, or if the tree is full from the right, it adds another layer on top of it (the old root will be the first of the new one).
Prepend is done similarly, but an offset is needed, because adding a new top node (where the current root would be the last node of the new root) shifts the indices by half of the current tree's full size. The offset shifts them back to the correct index.
Slice is done by trimming the path from the root and discarding any leading/trailing values in effectively constant time (without memory leak, as in Java/Clojure).


Memory:

for 32768 elements
 `scala.collection.immutable.Vector` uses `659 KB` (`20.7 KB` overhead, `0.6` bytes overhead per element)
 `javaslang.collection.Vector` uses `659 KB` (`20.6 KB` overhead, `0.6` bytes overhead per element)

for 1024 elements
 `scala.collection.immutable.Vector` uses `19.3 KB` (`672 bytes` overhead, `0.7` bytes overhead per element)
 `javaslang.collection.Vector` uses `19.3 KB` (`664 bytes` overhead, `0.6` bytes overhead per element)

for 32 elements
 `scala.collection.immutable.Vector` uses `536 bytes` (`32 bytes` overhead, `1.0` bytes overhead per element)
 `javaslang.collection.Vector` uses `528 bytes` (`24 bytes` overhead, `0.8` bytes overhead per element)

Speed:

Operation Ratio                                 32   1024  32768 
Create    slang_persistent/scala_persistent  0.33×  0.11×  0.07×
Head      slang_persistent/scala_persistent  0.85×  0.64×  0.49×
Tail      slang_persistent/scala_persistent  1.04×  1.06×  0.96×
Get       slang_persistent/scala_persistent  1.22×  0.90×  0.85×
Update    slang_persistent/scala_persistent  1.04×  1.14×  1.24×
Prepend   slang_persistent/scala_persistent  0.64×  0.58×  0.43×
Append    slang_persistent/scala_persistent  0.94×  0.37×  0.29×
GroupBy   slang_persistent/scala_persistent  0.96×  0.31×  0.22×
Slice     slang_persistent/scala_persistent  0.79×  1.23×  1.64×
Iterate    slang_persistent/scala_persistent 1.72×  1.03×  0.95×

i.e. for small number of elements it often has the same speed, for larger ones it's often 2-3x slower.
Compared to the optimized version this is roughly:

  • create: ~30x slower
  • head: ~2x slower
  • tail: ~4x slower
  • get: ~0.7x slower (i.e. ~1.4x faster)
  • update: ~0.95x slower (i.e. ~1.05x faster)
  • prepend: ~2.5x slower
  • append: ~2.5x slower
  • groupby: ~3.5x slower
  • slice: ~3.5x slower
  • iterate: ~1.1x slower

@codecov-io
Copy link

codecov-io commented Aug 17, 2016

Current coverage is 96.98% (diff: 90.65%)

Merging #1504 into master will decrease coverage by 0.04%

@@             master      #1504   diff @@
==========================================
  Files            87         88     +1   
  Lines         10541      10638    +97   
  Methods           0          0          
  Messages          0          0          
  Branches       1798       1814    +16   
==========================================
+ Hits          10228      10317    +89   
+ Misses          183        179     -4   
- Partials        130        142    +12   

Powered by Codecov. Last update 8360a92...9205299

* @author Ruslan Sennov
* @since 2.0.0
* @author Pap Lőrinc
* @since 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it will make it to 2.1.0, the interface did not change (only methods were added)

@paplorinc
Copy link
Contributor Author

Force pushed a new version.
Thanks to @danieldietrich, I managed to have a single depth modification method, extended with some lambdas. This way most methods are unified now!

We don't need to go down the tree and back again (saves time), but the lambdas slow it down a bit, so the end performance is basically the same as before (updated it in the description, though).

Now the non-optimized BMT is only 150 lines :).

@danieldietrich, @ruslansennov, @zsolt-donca, waiting eagerly for your reviews :)

return modifyLeaf(array, depthShift, index, node, leaf);
}

private static Object[] modifyLeaf(Object[] array, int depthShift, int index, NodeModifier nodeModifier, NodeModifier leafModifier) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepend, append, update, drop and take now use this iterative method instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can save lambda instances by creating singletons. I'm not sure if the curried NODE version helps...

// names can be changed
private static final NodeModifier NODE = (o, i) -> copy(o, i + 1);
private static final Function<Object, NodeModifier> LEAF = element -> (o, i) -> copyUpdate(o, i, element);

// ...

private static Object[] modifyLeafValue(Object[] array, int depthShift, int index, Object element) {
    return modifyLeaf(array, depthShift, index, NODE, LEAF.apply(element));
}

Update: If the above makes sense, we don't need modifyLeafValue() at all and can use modifyLeaf() directly.

Copy link
Contributor Author

@paplorinc paplorinc Aug 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a measurable difference this way 💃
Thanks @danieldietrich!

before:

Operation  Ratio                                32   1024  32768 
Update    slang_persistent/scala_persistent  1.03×  0.93×  1.17×
Prepend   slang_persistent/scala_persistent  0.64×  0.58×  0.44×
Append    slang_persistent/scala_persistent  0.84×  0.33×  0.26×
Slice     slang_persistent/scala_persistent  0.74×  1.19×  1.47×

after:

Operation  Ratio                                 32   1024  32768 
Update     slang_persistent/scala_persistent  1.04×  1.14×  1.24×
Prepend    slang_persistent/scala_persistent  0.64×  0.58×  0.43×
Append     slang_persistent/scala_persistent  0.94×  0.37×  0.29×
Slice      slang_persistent/scala_persistent  0.79×  1.23×  1.64×

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmhh, interesting - do I interpret it right that creating more instances performs better?

How frequently modifyLeafValue() is called on an operation, round about? Zero or Once / logarithmic times?

I guess that before/after performs equal (modulo measurement error). Because we turned GC off, we don't see the effect of creating instances. However, if the method is not called often, it will not matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called once per operation, e.g. one update will call modifyLeaf once.
But don't worry about GC, through escape analysis short-lived lambdas might end up on the stack, instead of the heap (https://dzone.com/articles/java-lambdas-and-low-latency).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@danieldietrich danieldietrich added this to the 2.1.0 milestone Aug 18, 2016

/** clone the source and set the value at the given position */
static Object[] copyUpdate(Object arrayObject, int index, Object element) {
final Object[] array = (Object[]) arrayObject;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as stated in the above JavaDoc, untyped Object array, to get rid of call-site casts (and make room for non-Object based arrays also)

@danieldietrich
Copy link
Member

Build fails with

[ERROR] /home/travis/build/javaslang/javaslang/javaslang-benchmark/src/test/java/javaslang/collection/CharSeqBenchmark.java:[54,98] method equals in class java.lang.Object cannot be applied to given types;
  required: java.lang.Object
  found: char[],char[]
  reason: actual and formal argument lists differ in length

return branchCount * fullBranchSize;
}

BitMappedTrie<T> prependLeaf(T leading) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be optimized by appending whole arrays instead...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this method called prependLeaf as opposed to just prepend ? Does it make sense to have, or can you imagine, an operation in which we prepend anyplace else but to the first leaf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now you could only append full leafs (proposed here as optimization) and didn't see the point in renaming it back and forth.
Renamed both :)

@ruslansennov
Copy link
Member

Sorry guys, I can't take part in discussion, too busy until Monday.
I'll review this PR later (if it will be actual)

assertAreEqual(actual, null, constant(actual), expected);
}

System.out.println("Depth " + depth + " ok!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove all System.out calls (and assert if necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it was mainly for debugging (I've seen that many other Javaslang tests print out content, but I agree that they shouldn't) :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will clean up the others also (in another PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I will keep System.out.println("using seed " + seed); as it's the only way to reproduce it in case of failure)

@danieldietrich
Copy link
Member

For me it looks good so far. I can pull it in if the System.outs are removed...

@@ -578,6 +581,7 @@ public void slang_persistent(Blackhole bh) {
for (int i = 1; !values.isEmpty(); i++) {
values = values.slice(1, values.size());
values = values.slice(0, values.size() - 1);
assert sizeOf(values.trie) < (branchingFactor() * sizeOf(copyRange(ELEMENTS, i, javaMutable.size() - i))); /* detect memory leak */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added leak detection, i.e. if slice only keeps a few elements from a lot, its size should be small

@danieldietrich
Copy link
Member

danieldietrich commented Aug 21, 2016

We can optimize the partition method, which currently first fills an ArrayList, then calls ofAll(Iterable), which internally calls trie.append:

    @Override
    public Tuple2<Vector<T>, Vector<T>> partition(Predicate<? super T> predicate) {
        Objects.requireNonNull(predicate, "predicate is null");
        final ArrayList<T> left = new ArrayList<>(), right = new ArrayList<>();
        for (int i = 0; i < length(); i++) {
            final T t = get(i);
            (predicate.test(t) ? left : right).add(t);
        }
        return Tuple.of(ofAll(left), ofAll(right));
    }

I think we could save CPU + Mem by calling trie.append directly:

    @Override
    public Tuple2<Vector<T>, Vector<T>> partition(Predicate<? super T> predicate) {
        Objects.requireNonNull(predicate, "predicate is null");
        final BitMappedTrie<T> left = BitMappedTrie.empty(), right = BitMappedTrie.empty();
        for (T t : this) {
            if (predicate.test(t)) {
                left = left.append(t);
            } else {
                right = right.append(t);
            }
        }
        return Tuple.of(ofAll(left), ofAll(right));
    }

We can also iterate elements instead of calling get for each element.


Question: Does the JVM optimizes length() calls here? It can't know that Vector is immutable and the value never changes...

for (int i = 0; i < length(); i++) {

Wouldn't it be better to first store length() in a variable and then check against that?

final int length = length();
for (int i = 0; i < length; i++) {

@paplorinc
Copy link
Contributor Author

paplorinc commented Aug 22, 2016

I think we could save CPU + Mem by calling trie.append directly:

You're basing these optimizations on the unoptimized version.
In the new optimized version ofAll is ~90x faster (I'm working on that PR in parallel), so this method would become a lot slower if I changed it to append directly.
I suggest doing the optimizations later, when we can measure something concrete :).

Edit:

It can't know that Vector is immutable and the value never changes...

Actually, when optimizing, the JMS has a few assumptions (e.g. only these 2 switch statements seem to be used, so it will throw out the rest), as long as this assumption can be corrected at a small cost, when violated. Therefore it can inline length (but even if it doesn't, storing it wouldn't likely provide any measurable gain, but would complicate the code slightly).

It's different though for iterator, when an internal class refers to the outer one, there extracting the length makes sense (but again, it has to be measured on a case-by-case basis, unfortunately, I've had too many surprises)

@paplorinc
Copy link
Contributor Author

@danieldietrich, @ruslansennov, @zsolt-donca, @eduardmanas, did I answer or apply all your requests?
Can we merge it :)?

*
* @param <T> Component type of the Vector.
* @author Ruslan Sennov
* @author Ruslan Sennov, Pap Lőrinc
* @see BitMappedTrie
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe last my comment :) BMT has package visibility, I believe this line should be removed (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be useful when someone's viewing the sources (I commented BMT thoroughly).
@danieldietrich?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paplorinc Please remove the line. Instead you could modify the first line and mention that our Vector is based on a Bit Mapped Trie implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, let's merge it, as I have at least 2 more PRs to prepare :)

@zsolt-donca
Copy link

@paplorinc Mine are all addressed!

Keep up the good work!

* @author Pap Lőrinc
* @since 2.1.0
*/
final class BitMappedTrie<T> implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @paplorinc, great work so far. This implementation is smaller than I expected!

Before pulling it I want to see just one more thing: Please make BitMappedTrie Iterable and implement the iterator() method in the way that it does not use get(index). Instead it should use the internal BMT structure to efficiently iterate the elements (i.e. not getting the array on every step, instead use the offset until the end of the array is reached, right?).

This is inevitable because otherwise either the internal structure would leak to the outside or we could not leverage the knowledge about the internal structure and have to fall back to less efficient ways to access elements. The latter case is currently implemented (using get(index) each step).

I think that it will be a better base for comparing the upcoming optimizations.

It shouldn't be much effort, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do, but will have to delete it later when the leading and trailing are introduced to optimize append and prepend (in order to be able to iterate over all arrays, cannot combine 3 iterators efficiently otherwise)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force pushed a new version.
@danieldietrich, I thought you were interested in the simplest solution in this commit.
I consider this explicit iterator an optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will see. In the design of types it is the right place for iterator(). We will take a look at leading and trailing next.

In Scala Vector is in between Array and List. List is best at prepend() & tail(), Array is best at random access get() and update(). This is the goal for Vector optimizations, it should be good at both and be in between Array and List. That's what I'm looking at, not the optimization of specific methods like groupBy() and slice() for example.

I'm really looking forward to review leading() and trailing(), especially on the performance changes of get(), update(), append(), prepend() and tail().


Iterator<T> iterator() {
return new Iterator<T>() {
int globalIndex = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create final length = BitMappedTrie.this.length();

@paplorinc
Copy link
Contributor Author

Pushed, could we please merge this and move on the the optimizations?

@danieldietrich
Copy link
Member

ok, thank you, pulling it in now

@danieldietrich danieldietrich merged commit 9311fee into vavr-io:master Aug 24, 2016
@paplorinc paplorinc deleted the NewVector branch August 25, 2016 07:18
@paplorinc
Copy link
Contributor Author

ok, thank you, pulling it in now

Please review #1522 also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants