Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302

Closed
wants to merge 4 commits into from

Conversation

franz1981
Copy link
Contributor

@franz1981 franz1981 commented Mar 16, 2023

Deloying a wildfly instance with many JPA models shows that buffered files reads allocate, by default, a 8K byte[] buffer per each indexed file, causing many out of TLAB allocations.

Ideally we could reuse such buffer across different indexed ops, saving zeroing and TLAB bandwidth that could be used elsewhere.

I'm assuming 2 things:

  • the same Indexer to be reused
  • Indexer is a single-threaded class that can just borrow/release the same buffer again and again without being bothered by concurrency issues

@Ladicek these are valid assumptions?

@franz1981
Copy link
Contributor Author

I'll add more commits addressing other allocation(s) hot paths, but probably these were happening because of this big byte[] allocations, leaving scarse bandwith for further allocations, ending up allocating in a new TLAB (or out of the TLAB).

I'll provided a JMH bench and/or an end 2 end results if required/possible

@franz1981
Copy link
Contributor Author

The relevant stack traces I've found are:

that seems to match the assumption of the same Indexer reused over and over, but sadly...

https://github.com/hibernate/hibernate-orm/blob/6cf9d2d4801962b659eb4fb936a58abd461fbfc1/hibernate-core/src/main/java/org/hibernate/boot/archive/scan/spi/ClassFileArchiveEntryHandler.java#L64

nope.
I have to think if it worth to change how the Indexer is used on Hibernate (making it to be reused/shared somehow, if possible) and/or change the optimization by providing an indexer configuration with the initial buffer size (or the buffer itself, provided externally).
@Sanne any suggestion?

@scottmarlow
Copy link

scottmarlow commented Mar 16, 2023

I'd like to build jandex locally with this change to see if I can see improvements on deployment time of https://github.com/scottmarlow/tribe-krd-quarkus/tree/wildfly. I tried building with different JDK versions (jdk 1.6/1.8/11 but still get failure like:

Failed to execute goal net.revelc.code:impsort-maven-plugin:1.6.2:sort (sort-imports) on project jandex: Execution sort-imports of goal net.revelc.code:impsort-maven-plugin:1.6.2:sort failed: A required class was missing while executing net.revelc.code:impsort-maven-plugin:1.6.2:sort: org/codehaus/plexus/util/DirectoryScanner
[ERROR] -----------------------------------------------------
[ERROR] realm = plugin>net.revelc.code:impsort-maven-plugin:1.6.2--1736414650
[ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
[ERROR] urls[0] = file:/home/smarlow/.m2/repository/net/revelc/code/impsort-maven-plugin/1.6.2/impsort-maven-plugin-1.6.2.jar
[ERROR] urls[1] = file:/home/smarlow/.m2/repository/com/github/javaparser/javaparser-core/3.22.1/javaparser-core-3.22.1.jar
[ERROR] urls[2] = file:/home/smarlow/.m2/repository/com/google/guava/guava/30.1.1-jre/guava-30.1.1-jre.jar
[ERROR] urls[3] = file:/home/smarlow/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar
[ERROR] urls[4] = file:/home/smarlow/.m2/repository/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
[ERROR] urls[5] = file:/home/smarlow/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar
[ERROR] urls[6] = file:/home/smarlow/.m2/repository/org/checkerframework/checker-qual/3.8.0/checker-qual-3.8.0.jar
[ERROR] urls[7] = file:/home/smarlow/.m2/repository/com/google/errorprone/error_prone_annotations/2.5.1/error_prone_annotations-2.5.1.jar
[ERROR] urls[8] = file:/home/smarlow/.m2/repository/com/google/j2objc/j2objc-annotations/1.3/j2objc-annotations-1.3.jar
[ERROR] Number of foreign imports: 1
[ERROR] import: Entry[import from realm ClassRealm[project>io.smallrye:jandex:3.0.6-SNAPSHOT, parent: ClassRealm[maven.api, parent: null]]]
[ERROR]

Any suggestions on how to avoid ^ failure? Thanks!

@scottmarlow
Copy link

Thanks @jamezp for pointing out that Maven 3.9 has ^ problem (3.8.1 works fine now). Also thanks for suggesting other solution of mvn clean install -Dversion.impsort.plugin=1.8.0

@Ladicek
Copy link
Collaborator

Ladicek commented Mar 17, 2023

Yeah, I need to attend to #298 some day, this is pretty unfortunate :-/

When it comes to Indexer assumptions, those are correct: it is never supposed to be used concurrently, and it is supposed to be reused, though it doesn't have to. It is relatively common for a single Indexer to be used to index just one class. With Jandex 2.x, an Indexer returned a ClassInfo after indexing each class, but Jandex 3.x does some postprocessing during Indexer.complete(), so it doesn't return a ClassInfo for the just-indexed class anymore.

We could have an Indexer take an instance of some class that would manage those buffers I guess?

@franz1981
Copy link
Contributor Author

We could have an Indexer take an instance of some class that would manage those buffers I guess?

Today I am on PTO but yes, that would be the perfect deal - and would help the hibernate case as well.
I see there are others allocations but if we solve the bigger one, probably it won't be necessary to fix them too

@Ladicek
Copy link
Collaborator

Ladicek commented Mar 17, 2023

Enjoy your PTO! :-) I'm on vacation most of next week (except Mon), but feel free to ping me the week after that.

And thanks for looking into this -- I didn't have much time (and I lack the experience 😆), so the only thing I noticed is that a lot of time is spent on java.io.DataInputStream parsing the modified UTF-8 encoding (which likely only exists because the JVM is written in a language with null-terminated strings...). I didn't even look at allocations.

@franz1981
Copy link
Contributor Author

@Ladicek Not a big fan of what's I've done here, but let me provide some context

74f33e9 allows users of Indexer to recycle the underlying read stream without being aware of the existing APIs and us to not rewrite a BufferedInputStream impl, although not complex, will still force us to maintain more code (adding more tests etc etc).

With this solution:

  • if Indexer is reused by user code (the wildfly use case): it will benefit from reusing buffered input streams (and the underlying buffer)
  • if Indexer is not reused (hibernate): it will be a duty of user code to snapshot it's memento, storing it somewhere, and providing it to any subsequent Indexer instance.

The reason why it will use Object is both to save users to mess up with it and to allow us, in near future (maybe in this PR) to expose a whole set of maps/byte[] in order to be reused again and again (saving more and more memory), wdyt?

@Ladicek
Copy link
Collaborator

Ladicek commented Mar 20, 2023

Okay, this is pretty nice actually!

On the API front, I think we can hide the "memento object" by providing an IndexerFactory for the non-reused use case. It should be pretty easy for Hibernate to reuse the IndexerFactory, and then it doesn't matter that they don't reuse the Indexer.

If you're fine with the improvements now, I can take it from here and massage the API a little bit per above.

@franz1981
Copy link
Contributor Author

franz1981 commented Mar 20, 2023

I would like to run few benchs and see if together with a modified version of hibernate I'm getting the expected speedup.
If yes, we can move this forward and addressing later other (minors) byte[] allocation(s) into the same memento (and decodeUTF8, that doesn't appear in the test with wildfly run by @scottmarlow but maybe you got more data points why you saw that, for quarkus, likely).

@Ladicek
Copy link
Collaborator

Ladicek commented Mar 20, 2023

Okay, just let me know when you think this is ready. For my testing, I assembled a somewhat bigger JAR (160K classes, 250 MB) and indexing that shows decodeUtf8Entry as taking quite a lot of time. But I'm hardly an expert, I might have measured it wrong.

@franz1981
Copy link
Contributor Author

and indexing that shows decodeUtf8Entry as taking quite a lot of time

I believe the reason for that is due to the many intermediate obj (with their buffers) created to decode in what (very likely) is just a latin String: I think I'll create a fast path for this to cover the most common case, but in a different commit/pr, likely

@franz1981
Copy link
Contributor Author

franz1981 commented Mar 20, 2023

I've further extended the pool concept, but is very tied to the way we use the tmp buffers for indexing, and some are not as safe as I wish
i.e. very tied to how processConstantPool fill data
while in others I won't reset the tmp buffers content, meaning that a misbehaving class could potentially cause read into something that would have thrown an out of bounds exception with non pooled buffers, but now can read "junk" data coming from a previous recycling.

The easier solution is to cleanup everything when borrowed again, but it will cause some perf hit (to be measured).
I'm paused working on other stuff, but feel free to add commits to make this one better from an API perspective, or let me know if there are other parts you want me to improve.

@Ladicek
Copy link
Collaborator

Ladicek commented Mar 20, 2023

Thank you! As mentioned above, I'm on PTO starting tomorrow until end of week, so I'll get to this next week. I'd also like to learn more about performance work, so I'll be happy to help with this (once I'm back).

return new DataInputStream(new ByteArrayInputStream(pool, pos, len + 2)).readUTF();
}

private static String tryDecodeAsciiEntry(byte[] chars, int pos, int len) {
Copy link
Contributor Author

@franz1981 franz1981 Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladicek
this has been benchmarked in undertow-io/undertow#1424 vs a simple copy loop vs different JDK versions - BUT both wildfly and quarkus maybe would run non warmed up code (i.e. C1 or interpreted) - and having a micro bench here was seems a good idea.

Copy link
Contributor Author

@franz1981 franz1981 Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Jdk 11 will be the main version we can use VarHandle to batch read from the byte[] in one go (that will be hell faster!)

@franz1981 franz1981 force-pushed the reduce_allocations branch 2 times, most recently from 9e8f0ae to aef0444 Compare March 20, 2023 21:50
@franz1981 franz1981 marked this pull request as ready for review March 20, 2023 21:51
@franz1981
Copy link
Contributor Author

@scottmarlow I should add few tests but this could be used already as it is, improving the wildfly use case at
https://github.com/wildfly/wildfly-core/blob/main/server/src/main/java/org/jboss/as/server/deployment/annotation/ResourceRootIndexer.java#L84

while

https://github.com/hibernate/hibernate-orm/blob/6cf9d2d4801962b659eb4fb936a58abd461fbfc1/hibernate-core/src/main/java/org/hibernate/boot/archive/scan/spi/ClassFileArchiveEntryHandler.java#L61-71

should have a new field Indexer.Factory indexerFactory in ClassFileArchiveEntryHandler that should perform

        private ClassDescriptor toClassDescriptor(ArchiveEntry entry) {
		try (InputStream inputStream = entry.getStreamAccess().accessInputStream()) {
                        // DELETED -> Indexer indexer = new Indexer();
                        // REPLACED WITH AN INDEXER REUSING BUFFERS
                        Indexer indexer = indexerFactory.newIndexer();
			ClassSummary classSummary = indexer.indexWithSummary( inputStream );
			Index index = indexer.complete();
			return toClassDescriptor( classSummary, index, entry );
		}
		catch (IOException e) {
			throw new ArchiveException( "Could not build ClassInfo", e );
		}
	}

@Ladicek can confirm that's the right way to do it or not...

@scottmarlow please remember that a IndexerFactory cannot be used in parallel by different threads i.e. is not thread-safe!

@franz1981 franz1981 force-pushed the reduce_allocations branch 2 times, most recently from 1bd32fc to 3971a5f Compare March 22, 2023 15:04
@franz1981
Copy link
Contributor Author

franz1981 commented Mar 22, 2023

I've sadly added 2 Arrays::fill while borrowing 2 recycled arrays, but I'm sure this could be avoided by lazy checking if there's no usage of the borrowed array instances, saving precious cpu cycles.

In some subsequent commits I would like to remove all the field loads for the fields (by passing them as method parameters):

    private byte[] constantPool;
    private int[] constantPoolOffsets;
    private byte[] constantPoolAnnoAttrributes;

in many tiny methods, because it:

  • is super slow while interpreted because of field load
  • is super slow due to additional bound checks happening very time
  • it likely nullify the benefit of inlining these small methods using such fields

@franz1981
Copy link
Contributor Author

Good news! The allocation profiling data for this PR are very promising: the last addressable thing is

image

that I was hoping was addressed by reusing the same constantPool buffer again and again, and hoping the heuristic that enlarge it 20 times the required size to be enough were enough, but I was wrong.
Maybe worth thinking to a different heuristic here

@franz1981
Copy link
Contributor Author

The last versions of the changes (that include a different growing strategy that mimic what array list does) reports, for wildfly startup in a dummy project an allocation reductions of 1.2 GB from 3.9 GB i.e. new allocations are 2.68 GB and almost all the out of TLAB allocations are gone as well.

CPU-wise I think that if @Ladicek got something that can be turned into a micro/smoke perf test for this that heavily make use of decodeUtf8 we will see some gain there as well

@franz1981
Copy link
Contributor Author

franz1981 commented Mar 27, 2023

I've prepared a microbenchmark that load a big jar (~25 MB) and place it into a tmpfs folder to avoid I/O to be a relevant factor here and I see that I'm getting a "small" improvement (mem-wise, huge, in loading speed, not "wow") i.e.

with the changes in the PR:

Benchmark                          (count)                                  (jarFile)  Mode  Cnt      Score   Error  Units
IndexerBenchmark.indexJars               1  /tmp/quarkus-tribe-krd-1.0.0-SNAPSHOT.jar    ss    2  21174.930          ms/op

before the changes in the PR:

Benchmark                          (count)                                  (jarFile)  Mode  Cnt      Score   Error  Units
IndexerBenchmark.indexJars               1  /tmp/quarkus-tribe-krd-1.0.0-SNAPSHOT.jar    ss    2  23488.840          ms/op

The microbenchmark is using SingleShot benchmark run, loading that single jar file, to mimic real world scenario (no hot code compiled at last tiered comp level etc etc) and still, appear disappointing.

What I've discovered so far is that:

  • beside the small load speed improvement (%10%), the actual cpu usage improvement is significant: we now use 87% of the cpu cycles while loading it faster
  • the test I've build is using Index of(File... files) that's not likely the best case to test the imrprovements made, given that it perform its own class file unzipping and file walking - meaning that it won't use the optimized/pooled buffered input stream provided on the shared Indexer
  • org/jboss/jandex/StrongInternPool is performing a chain of instanceof that breaking the profiler I'm using, likely due to Unknown Java Frame on JDK-8180450 reproducer async-profiler/async-profiler#673

Below the flamegraph of a benchmark run:
image

it shows the StrongInternPool usage as a top cpu cycles consumer, and some broken stack frames due to instanceof abuse in the left bottom

The suggestion would be to have separate factory method for intern pools hiding an enum field that can be used as n hint to always apply (per-type) a specific equals/hash method expecting the right type(s) upfront, hence saving type checks to discover it each time.

@Ladicek Ladicek mentioned this pull request Apr 5, 2023
@Ladicek
Copy link
Collaborator

Ladicek commented Apr 5, 2023

I'm closing this, because I just submitted #303, which uses most of the commits from here, except:

  • the commit "Modify sizeToFit by reusing the same enlarging strategy of ArrayList" is dropped as it doesn't bring measurable improvement, sometimes it even leads to slightly worse performance;
  • the Indexer.Factory is dropped, because it isn't needed; after calling Indexer.complete(), the Indexer instance is fresh as new and can be used to build a new, independent index -- so the thing that should be shared is not a factory, it's the indexer directly.

I also added a few commits of my own, created with a lot of @franz1981's help -- thanks!

Together, I've observed roughly 25% speedup on indexing a 256 MB JAR containing more than 160K classes. Decent improvement for now, and I need to focus on other things as well :-)

@Ladicek Ladicek closed this Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants