Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302

franz1981 · 2023-03-16T17:01:13Z

Deloying a wildfly instance with many JPA models shows that buffered files reads allocate, by default, a 8K byte[] buffer per each indexed file, causing many out of TLAB allocations.

Ideally we could reuse such buffer across different indexed ops, saving zeroing and TLAB bandwidth that could be used elsewhere.

I'm assuming 2 things:

the same Indexer to be reused
Indexer is a single-threaded class that can just borrow/release the same buffer again and again without being bothered by concurrency issues

@Ladicek these are valid assumptions?

franz1981 · 2023-03-16T17:06:44Z

I'll add more commits addressing other allocation(s) hot paths, but probably these were happening because of this big byte[] allocations, leaving scarse bandwith for further allocations, ending up allocating in a new TLAB (or out of the TLAB).

I'll provided a JMH bench and/or an end 2 end results if required/possible

franz1981 · 2023-03-16T17:25:32Z

The relevant stack traces I've found are:

https://github.com/wildfly/wildfly-core/blob/main/server/src/main/java/org/jboss/as/server/deployment/annotation/ResourceRootIndexer.java#L84

that seems to match the assumption of the same Indexer reused over and over, but sadly...

https://github.com/hibernate/hibernate-orm/blob/6cf9d2d4801962b659eb4fb936a58abd461fbfc1/hibernate-core/src/main/java/org/hibernate/boot/archive/scan/spi/ClassFileArchiveEntryHandler.java#L64

nope.
I have to think if it worth to change how the Indexer is used on Hibernate (making it to be reused/shared somehow, if possible) and/or change the optimization by providing an indexer configuration with the initial buffer size (or the buffer itself, provided externally).
@Sanne any suggestion?

scottmarlow · 2023-03-16T18:04:50Z

I'd like to build jandex locally with this change to see if I can see improvements on deployment time of https://github.com/scottmarlow/tribe-krd-quarkus/tree/wildfly. I tried building with different JDK versions (jdk 1.6/1.8/11 but still get failure like:

Failed to execute goal net.revelc.code:impsort-maven-plugin:1.6.2:sort (sort-imports) on project jandex: Execution sort-imports of goal net.revelc.code:impsort-maven-plugin:1.6.2:sort failed: A required class was missing while executing net.revelc.code:impsort-maven-plugin:1.6.2:sort: org/codehaus/plexus/util/DirectoryScanner
[ERROR] -----------------------------------------------------
[ERROR] realm = plugin>net.revelc.code:impsort-maven-plugin:1.6.2--1736414650
[ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
[ERROR] urls[0] = file:/home/smarlow/.m2/repository/net/revelc/code/impsort-maven-plugin/1.6.2/impsort-maven-plugin-1.6.2.jar
[ERROR] urls[1] = file:/home/smarlow/.m2/repository/com/github/javaparser/javaparser-core/3.22.1/javaparser-core-3.22.1.jar
[ERROR] urls[2] = file:/home/smarlow/.m2/repository/com/google/guava/guava/30.1.1-jre/guava-30.1.1-jre.jar
[ERROR] urls[3] = file:/home/smarlow/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar
[ERROR] urls[4] = file:/home/smarlow/.m2/repository/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
[ERROR] urls[5] = file:/home/smarlow/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar
[ERROR] urls[6] = file:/home/smarlow/.m2/repository/org/checkerframework/checker-qual/3.8.0/checker-qual-3.8.0.jar
[ERROR] urls[7] = file:/home/smarlow/.m2/repository/com/google/errorprone/error_prone_annotations/2.5.1/error_prone_annotations-2.5.1.jar
[ERROR] urls[8] = file:/home/smarlow/.m2/repository/com/google/j2objc/j2objc-annotations/1.3/j2objc-annotations-1.3.jar
[ERROR] Number of foreign imports: 1
[ERROR] import: Entry[import from realm ClassRealm[project>io.smallrye:jandex:3.0.6-SNAPSHOT, parent: ClassRealm[maven.api, parent: null]]]
[ERROR]

Any suggestions on how to avoid ^ failure? Thanks!

scottmarlow · 2023-03-16T18:30:55Z

Thanks @jamezp for pointing out that Maven 3.9 has ^ problem (3.8.1 works fine now). Also thanks for suggesting other solution of mvn clean install -Dversion.impsort.plugin=1.8.0

Ladicek · 2023-03-17T09:04:53Z

Yeah, I need to attend to #298 some day, this is pretty unfortunate :-/

When it comes to Indexer assumptions, those are correct: it is never supposed to be used concurrently, and it is supposed to be reused, though it doesn't have to. It is relatively common for a single Indexer to be used to index just one class. With Jandex 2.x, an Indexer returned a ClassInfo after indexing each class, but Jandex 3.x does some postprocessing during Indexer.complete(), so it doesn't return a ClassInfo for the just-indexed class anymore.

We could have an Indexer take an instance of some class that would manage those buffers I guess?

franz1981 · 2023-03-17T11:07:29Z

We could have an Indexer take an instance of some class that would manage those buffers I guess?

Today I am on PTO but yes, that would be the perfect deal - and would help the hibernate case as well.
I see there are others allocations but if we solve the bigger one, probably it won't be necessary to fix them too

Ladicek · 2023-03-17T11:27:16Z

Enjoy your PTO! :-) I'm on vacation most of next week (except Mon), but feel free to ping me the week after that.

And thanks for looking into this -- I didn't have much time (and I lack the experience 😆), so the only thing I noticed is that a lot of time is spent on java.io.DataInputStream parsing the modified UTF-8 encoding (which likely only exists because the JVM is written in a language with null-terminated strings...). I didn't even look at allocations.

core/src/main/java/org/jboss/jandex/Utils.java

franz1981 · 2023-03-20T11:03:07Z

@Ladicek Not a big fan of what's I've done here, but let me provide some context

74f33e9 allows users of Indexer to recycle the underlying read stream without being aware of the existing APIs and us to not rewrite a BufferedInputStream impl, although not complex, will still force us to maintain more code (adding more tests etc etc).

With this solution:

if Indexer is reused by user code (the wildfly use case): it will benefit from reusing buffered input streams (and the underlying buffer)
if Indexer is not reused (hibernate): it will be a duty of user code to snapshot it's memento, storing it somewhere, and providing it to any subsequent Indexer instance.

The reason why it will use Object is both to save users to mess up with it and to allow us, in near future (maybe in this PR) to expose a whole set of maps/byte[] in order to be reused again and again (saving more and more memory), wdyt?

Ladicek · 2023-03-20T11:18:46Z

Okay, this is pretty nice actually!

On the API front, I think we can hide the "memento object" by providing an IndexerFactory for the non-reused use case. It should be pretty easy for Hibernate to reuse the IndexerFactory, and then it doesn't matter that they don't reuse the Indexer.

If you're fine with the improvements now, I can take it from here and massage the API a little bit per above.

franz1981 · 2023-03-20T12:24:47Z

I would like to run few benchs and see if together with a modified version of hibernate I'm getting the expected speedup.
If yes, we can move this forward and addressing later other (minors) byte[] allocation(s) into the same memento (and decodeUTF8, that doesn't appear in the test with wildfly run by @scottmarlow but maybe you got more data points why you saw that, for quarkus, likely).

Ladicek · 2023-03-20T12:32:04Z

Okay, just let me know when you think this is ready. For my testing, I assembled a somewhat bigger JAR (160K classes, 250 MB) and indexing that shows decodeUtf8Entry as taking quite a lot of time. But I'm hardly an expert, I might have measured it wrong.

franz1981 · 2023-03-20T12:45:56Z

and indexing that shows decodeUtf8Entry as taking quite a lot of time

I believe the reason for that is due to the many intermediate obj (with their buffers) created to decode in what (very likely) is just a latin String: I think I'll create a fast path for this to cover the most common case, but in a different commit/pr, likely

franz1981 · 2023-03-20T16:03:15Z

I've further extended the pool concept, but is very tied to the way we use the tmp buffers for indexing, and some are not as safe as I wish
i.e. very tied to how processConstantPool fill data
while in others I won't reset the tmp buffers content, meaning that a misbehaving class could potentially cause read into something that would have thrown an out of bounds exception with non pooled buffers, but now can read "junk" data coming from a previous recycling.

The easier solution is to cleanup everything when borrowed again, but it will cause some perf hit (to be measured).
I'm paused working on other stuff, but feel free to add commits to make this one better from an API perspective, or let me know if there are other parts you want me to improve.

Ladicek · 2023-03-20T16:14:24Z

Thank you! As mentioned above, I'm on PTO starting tomorrow until end of week, so I'll get to this next week. I'd also like to learn more about performance work, so I'll be happy to help with this (once I'm back).

franz1981 · 2023-03-20T21:37:23Z

core/src/main/java/org/jboss/jandex/Indexer.java

        return new DataInputStream(new ByteArrayInputStream(pool, pos, len + 2)).readUTF();
    }

+    private static String tryDecodeAsciiEntry(byte[] chars, int pos, int len) {


@Ladicek
this has been benchmarked in undertow-io/undertow#1424 vs a simple copy loop vs different JDK versions - BUT both wildfly and quarkus maybe would run non warmed up code (i.e. C1 or interpreted) - and having a micro bench here was seems a good idea.

When Jdk 11 will be the main version we can use VarHandle to batch read from the byte[] in one go (that will be hell faster!)

franz1981 · 2023-03-20T21:55:48Z

@scottmarlow I should add few tests but this could be used already as it is, improving the wildfly use case at
https://github.com/wildfly/wildfly-core/blob/main/server/src/main/java/org/jboss/as/server/deployment/annotation/ResourceRootIndexer.java#L84

while

https://github.com/hibernate/hibernate-orm/blob/6cf9d2d4801962b659eb4fb936a58abd461fbfc1/hibernate-core/src/main/java/org/hibernate/boot/archive/scan/spi/ClassFileArchiveEntryHandler.java#L61-71

should have a new field Indexer.Factory indexerFactory in ClassFileArchiveEntryHandler that should perform

        private ClassDescriptor toClassDescriptor(ArchiveEntry entry) {
		try (InputStream inputStream = entry.getStreamAccess().accessInputStream()) {
                        // DELETED -> Indexer indexer = new Indexer();
                        // REPLACED WITH AN INDEXER REUSING BUFFERS
                        Indexer indexer = indexerFactory.newIndexer();
			ClassSummary classSummary = indexer.indexWithSummary( inputStream );
			Index index = indexer.complete();
			return toClassDescriptor( classSummary, index, entry );
		}
		catch (IOException e) {
			throw new ArchiveException( "Could not build ClassInfo", e );
		}
	}

@Ladicek can confirm that's the right way to do it or not...

@scottmarlow please remember that a IndexerFactory cannot be used in parallel by different threads i.e. is not thread-safe!

franz1981 · 2023-03-22T15:07:32Z

I've sadly added 2 Arrays::fill while borrowing 2 recycled arrays, but I'm sure this could be avoided by lazy checking if there's no usage of the borrowed array instances, saving precious cpu cycles.

In some subsequent commits I would like to remove all the field loads for the fields (by passing them as method parameters):

    private byte[] constantPool;
    private int[] constantPoolOffsets;
    private byte[] constantPoolAnnoAttrributes;

in many tiny methods, because it:

is super slow while interpreted because of field load
is super slow due to additional bound checks happening very time
it likely nullify the benefit of inlining these small methods using such fields

franz1981 · 2023-03-24T15:22:59Z

Good news! The allocation profiling data for this PR are very promising: the last addressable thing is

that I was hoping was addressed by reusing the same constantPool buffer again and again, and hoping the heuristic that enlarge it 20 times the required size to be enough were enough, but I was wrong.
Maybe worth thinking to a different heuristic here

franz1981 · 2023-03-27T04:48:19Z

The last versions of the changes (that include a different growing strategy that mimic what array list does) reports, for wildfly startup in a dummy project an allocation reductions of 1.2 GB from 3.9 GB i.e. new allocations are 2.68 GB and almost all the out of TLAB allocations are gone as well.

CPU-wise I think that if @Ladicek got something that can be turned into a micro/smoke perf test for this that heavily make use of decodeUtf8 we will see some gain there as well

franz1981 · 2023-03-27T09:33:30Z

I've prepared a microbenchmark that load a big jar (~25 MB) and place it into a tmpfs folder to avoid I/O to be a relevant factor here and I see that I'm getting a "small" improvement (mem-wise, huge, in loading speed, not "wow") i.e.

with the changes in the PR:

Benchmark                          (count)                                  (jarFile)  Mode  Cnt      Score   Error  Units
IndexerBenchmark.indexJars               1  /tmp/quarkus-tribe-krd-1.0.0-SNAPSHOT.jar    ss    2  21174.930          ms/op

before the changes in the PR:

Benchmark                          (count)                                  (jarFile)  Mode  Cnt      Score   Error  Units
IndexerBenchmark.indexJars               1  /tmp/quarkus-tribe-krd-1.0.0-SNAPSHOT.jar    ss    2  23488.840          ms/op

The microbenchmark is using SingleShot benchmark run, loading that single jar file, to mimic real world scenario (no hot code compiled at last tiered comp level etc etc) and still, appear disappointing.

What I've discovered so far is that:

beside the small load speed improvement (%10%), the actual cpu usage improvement is significant: we now use 87% of the cpu cycles while loading it faster
the test I've build is using Index of(File... files) that's not likely the best case to test the imrprovements made, given that it perform its own class file unzipping and file walking - meaning that it won't use the optimized/pooled buffered input stream provided on the shared Indexer
org/jboss/jandex/StrongInternPool is performing a chain of instanceof that breaking the profiler I'm using, likely due to Unknown Java Frame on JDK-8180450 reproducer async-profiler/async-profiler#673

Below the flamegraph of a benchmark run:

it shows the StrongInternPool usage as a top cpu cycles consumer, and some broken stack frames due to instanceof abuse in the left bottom

The suggestion would be to have separate factory method for intern pools hiding an enum field that can be used as n hint to always apply (per-type) a specific equals/hash method expecting the right type(s) upfront, hence saving type checks to discover it each time.

Ladicek · 2023-04-05T14:05:52Z

I'm closing this, because I just submitted #303, which uses most of the commits from here, except:

~~the commit "Modify sizeToFit by reusing the same enlarging strategy of ArrayList" is dropped as it doesn't bring measurable improvement, sometimes it even leads to slightly worse performance;~~
the Indexer.Factory is dropped, because it isn't needed; after calling Indexer.complete(), the Indexer instance is fresh as new and can be used to build a new, independent index -- so the thing that should be shared is not a factory, it's the indexer directly.

I also added a few commits of my own, created with a lot of @franz1981's help -- thanks!

Together, I've observed roughly 25% speedup on indexing a 256 MB JAR containing more than 160K classes. Decent improvement for now, and I need to focus on other things as well :-)

Ladicek reviewed Mar 17, 2023

View reviewed changes

core/src/main/java/org/jboss/jandex/Utils.java Outdated Show resolved Hide resolved

franz1981 force-pushed the reduce_allocations branch from 9ee451e to ba1bea2 Compare March 20, 2023 15:59

franz1981 commented Mar 20, 2023

View reviewed changes

franz1981 force-pushed the reduce_allocations branch 2 times, most recently from 9e8f0ae to aef0444 Compare March 20, 2023 21:50

franz1981 marked this pull request as ready for review March 20, 2023 21:51

franz1981 force-pushed the reduce_allocations branch 2 times, most recently from 1bd32fc to 3971a5f Compare March 22, 2023 15:04

franz1981 force-pushed the reduce_allocations branch from 3971a5f to 3a94312 Compare March 22, 2023 15:19

franz1981 added 2 commits March 23, 2023 16:09

Adding Indexer's Factory to favour reusable buffers pools

8409d45

Adding fast-path ASCII decoding

2dec148

franz1981 force-pushed the reduce_allocations branch from 3a94312 to 2dec148 Compare March 23, 2023 15:09

Modify sizeToFit by reusing the same enlarging strategy of ArrayList

7b01188

franz1981 force-pushed the reduce_allocations branch from ed54c4c to 7b01188 Compare March 25, 2023 21:42

Saving wrapping buffered streams twice

469e447

franz1981 force-pushed the reduce_allocations branch from 1ecf650 to 469e447 Compare March 25, 2023 22:31

Ladicek mentioned this pull request Apr 5, 2023

performance improvements #303

Merged

Ladicek closed this Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302

Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302

franz1981 commented Mar 16, 2023 •

edited

franz1981 commented Mar 16, 2023

franz1981 commented Mar 16, 2023

scottmarlow commented Mar 16, 2023 •

edited

scottmarlow commented Mar 16, 2023

Ladicek commented Mar 17, 2023

franz1981 commented Mar 17, 2023

Ladicek commented Mar 17, 2023

franz1981 commented Mar 20, 2023

Ladicek commented Mar 20, 2023

franz1981 commented Mar 20, 2023 •

edited

Ladicek commented Mar 20, 2023

franz1981 commented Mar 20, 2023

franz1981 commented Mar 20, 2023 •

edited

Ladicek commented Mar 20, 2023

franz1981 Mar 20, 2023 •

edited

franz1981 Mar 21, 2023 •

edited

franz1981 commented Mar 20, 2023

franz1981 commented Mar 22, 2023 •

edited

franz1981 commented Mar 24, 2023

franz1981 commented Mar 27, 2023

franz1981 commented Mar 27, 2023 •

edited

Ladicek commented Apr 5, 2023 •

edited

Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302

Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302

Conversation

franz1981 commented Mar 16, 2023 • edited

franz1981 commented Mar 16, 2023

franz1981 commented Mar 16, 2023

scottmarlow commented Mar 16, 2023 • edited

scottmarlow commented Mar 16, 2023

Ladicek commented Mar 17, 2023

franz1981 commented Mar 17, 2023

Ladicek commented Mar 17, 2023

franz1981 commented Mar 20, 2023

Ladicek commented Mar 20, 2023

franz1981 commented Mar 20, 2023 • edited

Ladicek commented Mar 20, 2023

franz1981 commented Mar 20, 2023

franz1981 commented Mar 20, 2023 • edited

Ladicek commented Mar 20, 2023

franz1981 Mar 20, 2023 • edited

Choose a reason for hiding this comment

franz1981 Mar 21, 2023 • edited

Choose a reason for hiding this comment

franz1981 commented Mar 20, 2023

franz1981 commented Mar 22, 2023 • edited

franz1981 commented Mar 24, 2023

franz1981 commented Mar 27, 2023

franz1981 commented Mar 27, 2023 • edited

Ladicek commented Apr 5, 2023 • edited

franz1981 commented Mar 16, 2023 •

edited

scottmarlow commented Mar 16, 2023 •

edited

franz1981 commented Mar 20, 2023 •

edited

franz1981 commented Mar 20, 2023 •

edited

franz1981 Mar 20, 2023 •

edited

franz1981 Mar 21, 2023 •

edited

franz1981 commented Mar 22, 2023 •

edited

franz1981 commented Mar 27, 2023 •

edited

Ladicek commented Apr 5, 2023 •

edited