-
-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector optimizations #1522
Vector optimizations #1522
Conversation
@danieldietrich, @ruslansennov, @zsolt-donca, @eduardmanas these are the optimizations for |
Current coverage is 96.48% (diff: 95.58%)@@ master #1522 diff @@
==========================================
Files 89 89
Lines 10878 10940 +62
Methods 0 0
Messages 0 0
Branches 1832 1850 +18
==========================================
+ Hits 10495 10556 +61
+ Misses 241 240 -1
- Partials 142 144 +2
|
Interesting table of measures :-)) I've removed functionaljava and PCollections from the following list because they are too slow. Also I've reduced the not-so-common methods. Get and Update are most important to us. Vector is a random-access (index sequence) collection and has to be as good as possible in these operations. Head, Tail, Prepend and Append are typical linear sequence operations. These are not so interesting because users might choose a linear sequence collection (List or Stream) for that purpose. This is what remains: Operation Ratio 32 1024 32768
Create slang_persistent/java_mutable 0.87× 0.44× 0.58×
Create slang_persistent/scala_persistent 14.73× 8.65× 6.38×
Get slang_persistent/java_mutable 0.86× 0.34× 0.30×
Get slang_persistent/scala_persistent 1.26× 0.86× 0.81×
Update slang_persistent/java_mutable 0.10× 0.05× 0.02×
Update slang_persistent/scala_persistent 1.17× 0.98× 1.17×
Prepend slang_persistent/java_mutable 1.47× 3.99× 100.43×
Prepend slang_persistent/scala_persistent 1.33× 1.41× 1.44×
Append slang_persistent/java_mutable 0.26× 0.14× 0.10×
Append slang_persistent/scala_persistent 0.84× 0.59× 0.60× Update is slow compared to Java mutable. But I think that is not a surprise. Compared to Scala we are head to head here. Why Get and Append are relatively slower than in Scala? |
Java mutable, are you drunk? Prepend slang_persistent/java_mutable 1.47× 3.99× 100.43× |
This would be a case where |
@danieldietrich, @ruslansennov, @zsolt-donca, @eduardmanas, these are the optimizations for the new Please review it commit-by-commit and take a look at the commit messages to see how much faster it got. Added |
@@ -42,7 +43,7 @@ public void testAsserts() { | |||
|
|||
public static void main(String... args) { | |||
JmhRunner.runDebugWithAsserts(CLASSES); | |||
JmhRunner.runNormalNoAsserts(CLASSES); | |||
JmhRunner.runNormalNoAsserts(CLASSES, JAVA, FUNCTIONAL_JAVA, PCOLLECTIONS, CLOJURE, SCALA, JAVASLANG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During optimizations I don't want to wait for pcollections
and fjava
, so I just delete these temporarily:
JmhRunner.runNormalNoAsserts(CLASSES, SCALA, JAVASLANG);
Pushed new version. @ruslansennov, I might have found what you meant by double copying, simplified the grouping algorithm to avoid it, thanks! @danieldietrich, the |
@@ -883,6 +884,16 @@ void setCurrentArray() { | |||
} | |||
} | |||
|
|||
void unsafeLeafIterator(LeafVisitor<T> visitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified this significantly
The branching base is selected to be Operation Ratio base=2 base=3 base=4 base=5 base=6 base=7
Get java_mutable/slang_persistent 8.84× 5.36× 4.23× 3.40× 3.39× 3.32×
Update java_mutable/slang_persistent 82.37× 57.74× 53.07× 44.47× 49.36× 77.90×
Iterate java_mutable/slang_persistent 8.01× 7.06× 3.05× 2.74× 2.83× 2.38× (it's surprising that |
@paplorinc thanks for the great overview. I like it that you not guess but measure! Could you create the same overview for the previous/not optimized Vector version? Would be nice for comparison. |
There's no difference in this regard. |
Update is a little faster than in Scala, I think it is fine. I've recognized your speed-fix regarding |
/** Create a single element array */ | ||
static Object[] asArray(Object element) { | ||
final Object[] copy = copy(empty(), 1); | ||
final Object[] copy = newInstance(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danieldietrich, unified these also, as per your suggestion
} | ||
} | ||
|
||
<U> BitMappedTrie<U> map(Function<? super T, ? extends U> mapper) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be optimized to take advantage of the fast that map keeps the original structure (i.e. almost 2x
speedup), but would result in slightly more complicated code.
@danieldietrich?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also unroll the middle manually, resulting in a measurable change, but distorting the code significantly, i.e.
int i = start;
for (; i < end - 3; i += 4) {
final int val = index.val;
elems[val] = mapper.apply(leaf[i]);
elems[val + 1] = mapper.apply(leaf[i + 1]);
elems[val + 2] = mapper.apply(leaf[i + 2]);
elems[val + 3] = mapper.apply(leaf[i + 3]);
index.val = val + 4;
}
for (; i < end; i++) {
elems[index.val++] = mapper.apply(leaf[i]);
}
before:
Map slang_persistent/scala_persistent 1.26× 1.53× 1.88×
after:
Map slang_persistent/scala_persistent 1.40× 1.54× 1.92×
i.e. not worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we do not need the further improvement (for now).
With the new visitor, the map
method looks like this:
<U> BitMappedTrie<U> map(Function<? super T, ? extends U> mapper) {
Objects.requireNonNull(mapper, "mapper is null");
final Object[] results = new Object[length()];
this.<T> visit((leaf, start, end, index) -> {
for (int i = start; i < end; i++) {
elems[index++] = mapper.apply(leaf[i]);
}
return index;
});
return BitMappedTrie.ofAll(elems, length);
}
final Object[] results = new Object[length()]; | ||
final MutableInt index = new MutableInt(); | ||
this.<T> visit((leaf, start, end) -> { | ||
int resultIndex = index.val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutating a local variable and writing it back to MutableInt
at the end makes it ~5%
faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need MutableInt
. Instead we could add a visit
method that takes a param index
and returns the updated index
:
@FunctionalInterface
interface LeafVisitor<T> {
int visit(T leaf, int start, int end, int index);
}
The visit
method then looks like this:
@SuppressWarnings("unchecked")
<T2> int visit(LeafVisitor<T2[]> visitor) {
int start = lastDigit(offset);
int currentIndex = 0;
for (int index = 0; index < length; ) {
final T2[] leaf = (T2[]) getLeaf(index);
final int end = Math.min(leaf.length, start + length - index);
currentIndex = visitor.visit(leaf, start, end, currentIndex);
index += end - start;
start = 0;
}
return currentIndex;
}
BEFORE:
...
final MutableInt index = new MutableInt();
this.<T> visit((leaf, start, end) -> {
int resultIndex = index.val;
for (int i = start; i < end; i++) {
final T value = leaf[i];
if (predicate.test(value)) {
results[resultIndex++] = value;
}
}
index.val = resultIndex;
});
return (length() == index.val) ? ...
AFTER:
...
final int resultLength = this.<T> visit((leaf, start, end, index) -> {
for (int i = start; i < end; i++) {
final T value = leaf[i];
if (predicate.test(value)) {
results[index++] = value;
}
}
return index;
});
return (length() == resultLength) ? ...
Note: Similar changes could be applied to all other methods that use visit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitated at first, but trying it out seems indeed like a good idea, thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark speeds are basically the same, pushed a new version.
Did I address all your concerns?
} | ||
|
||
/* drop root node while it has a single element */ | ||
private static <T> BitMappedTrie<T> collapsed(Object[] array, int offset, int length, int shift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also drop leading null
s, but that wouldn't result in any speed advantage, just aesthetics...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain to me how nulls can occur. I have questions:
- Does array contain these nulls?
- What is the meaning of these nulls?
- Is more and more memory consumed over time be nulled array slots?
The BMT should only contain valid nodes.
Before changing the implementation please explain to me how the tree structure changes over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the tree is conceptually not completely symmetric (i.e. a prepend
changes existing indices, an append
doesn't), we don't need to store every array as if it were a full array, i.e. if we have only 20
elements, the array can be of length < 32
.
If you have a full array and drop 10
from the end, we can create a new array of size 22
.
If however we drop from the beginning, it's easier to make a full array, and only copy the kept elements (and store the difference in the offset), having the effect of prepend
with null
s.
If we would want to eliminate all leading null
s, the computations would be more difficult (though I can try again, if this constant overhead (that's applicable only to non-leaf nodes) bothers you).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is fine as is!
@danieldietrich, @zsolt-donca I pushed a new version with the applied recommendations. Edit: we could optimize |
return leaf[leafIndex]; | ||
} | ||
|
||
/** | ||
* fetch the leaf, corresponding to the given index. | ||
* Node: the offset and length should be taken into consideration as there may be leading and trailing garbage. | ||
* Also, the returned array is mutable, but should not be mutated! | ||
*/ | ||
@SuppressWarnings("unchecked") | ||
T[] getLeaf(int index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also private
now, but since it's accessed from iterator
, this way we can avoid the bridge
|
||
private static final long serialVersionUID = 1L; | ||
|
||
private static final BitMappedTrie<?> EMPTY = new BitMappedTrie<>(Arrays.empty(), 0, 0, 0); | ||
private static final BitMappedTrie<?> EMPTY_BMT = new BitMappedTrie<>(Arrays.empty(), 0, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I would it name it EMPTY
to be consistent with all other Javaslang types that have an EMPTY instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, now I can rename it back. Previously it was used from Vector
, which had its own EMPTY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, EMPTY
is better.
public static <T> Vector<T> narrow(Vector<? extends T> vector) { | ||
return (Vector<T>) vector; | ||
} | ||
public static <T> Vector<T> narrow(Vector<? extends T> vector) { return (Vector<T>) vector; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methods should be no on-liners (for readability) It is ok
public Vector<Vector<T>> combinations() { | ||
return rangeClosed(0, length()).map(this::combinations).flatMap(Function.identity()); | ||
} | ||
public Vector<Vector<T>> combinations() { return rangeClosed(0, length()).map(this::combinations).flatMap(Function.identity()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question :) Did you change your local formatter? Now we have many single-line methods. It does not really hurt but I think you may have a reason to reformat. Generally ok for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I formatted these manually, the formatter should keep them intact (mine does).
I wanted the formatting to reflect how it would look in e.g. Scala
or in a Java 8 lambda: very simple methods in the same line, complex ones on multiple, to increase SNR.
@paplorinc
Many thanks for doing it, I really appreciate it! As you said - further optimizations can be applied later. Let's finalize this PR. I will still take a look. Maybe we need another iteration... Thx! |
(Note: I will focus now on this PR again - starting tomorrow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome to me :)
} | ||
|
||
/* drop root node while it has a single element */ | ||
private static <T> BitMappedTrie<T> collapsed(Object[] array, int offset, int length, int shift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is fine as is!
Hi @paplorinc, now I can't merge because of the conflicts. My fault - I waited too long... |
@ruslansennov, the build seems to be failing because of GWT incompatibility (https://travis-ci.org/javaslang/javaslang/builds/169480221), could you please help (I presume I need some annotations somewhere) |
Hi @paplorinc |
Ok, but how do I find out which method is causing the problem (I can't run the GWT tests locally)? |
Now I see that error log shows not enough info
Why? Just run |
@ruslansennov, it seems to be a GWT bug, i.e. apparently method references don't convert arrays to varargs: 5102b4d |
public void testCompileVector() { | ||
applyCollection(Vector::ofAll); | ||
applyCollection(chars -> Vector.ofAll(chars)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice finding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruslansennov, should we open a bug report for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least there are chances that it will be fixed near-time. Our GWT dependency is still an RC (release candidate).
Great, thank you!!! |
Optimizes #1504
Every commit introduces a new type of optimization, i.e.
ofAll
forVector
~150×
faster by building the tree from bottom upOptimizedprepend
andappend
forVector
- removed, as it hinders maintainability~6×
faster by adding thetrailing
andleading
arrays as depthless staging areas (i.e. only modify the tree if these two are full)drop
andtake
forVector
>6×
faster by keeping the leafs intact on drops, as the middle tree already has anoffset
and alength
that can mask themmap
forVector
and Optimizedfilter
forVector
>14×
faster by using the internalvisitor
withofAll
Benchmarks:
i.e. this PR made the following methods faster (for
32768
elements, compared toScala
)and memory usages:
Note: only very simple optimizations were kept, favoring maintainability over raw speed.
Note2: all collections store boxed, non-primitive values for now.