Performance improvements under Thymeleaf 3 #102

ultraq · 2016-05-02T09:00:54Z

I'm currently implementing the layout dialect in Thymeleaf 3 using 'model processors', which makes the transition to the new event-based architecture a lot easier for me, but may cause developers to lose a lot of the performance benefits of said architecture.

So yeah, I'm concerned about performance. Some things I should try that should help:

Create a benchmark suite to make sure further changes are headed in the right direction
Currently using a very basic web app - ~~should think of publishing it to GitHub~~ or creating some other microbenchmarks as suggested in the comments below
Re-implement processors as tag processors
This was done for FragmentProcessor and doesn't seem to be able to work anywhere else 😢 The reason being that every other processor needs to peek at its model to perform optimally, otherwise I'm going to do more template fetches which have already shown to be slow, particularly in a dev environment where caching is turned off, and even more so when the template resolver involves the network!
Create utility or IModel extension that can extract models from events within itself (I currently always go to the ModelFinder class, which uses the template manager to extract models, which is a very round-about way of obtaining a sub-model)
Have decorate and merge interfaces and implementations return a new decorated/merged object, instead of modifying the first argument (the target of the operation) so as to keep in line with Thymeleaf's architecture of immutability (goes hand-in-hand with implementing things as tag processors which must return changes as new models)
This actually adds a few more memory allocations, so a barely noticeable increase in memory usage. All GC-able, so still good.
layout:fragment being run unnecessarily in certain cases, eg: captured as a fragment for decoration/inclusion which then ends-up in the resulting template to process. See the TitlePattern-DynamicContent.thtest for an example.
Doesn't seem to happen any more?

The text was updated successfully, but these errors were encountered:

ultraq · 2016-05-07T04:06:23Z

These can probably wait until after the 2.0.0 release if I don't want to hold that up.

ultraq · 2016-05-24T08:20:12Z

See #108 for a performance issue when I load templates too often (made worse by the fact that templates are pulled over the internet). That should tie in to the 3rd bullet check point above.

ultraq · 2016-05-24T08:31:41Z

Something else that's come up in talks w/ Daniel: performing model equality might be a bit of a hit, particularly for text nodes that will need to resolve their underlying CharSequence in memory.

Antibrumm · 2016-06-03T18:57:25Z

If you like to implement benchmarks with the help of JMH, i have a benchmark profile implemented in one of my github projects.

Works pretty well to get an overview and was kind of easy to get it running. The project is small so could good for some hints :) take a look at the pom.xml

https://github.com/Antibrumm/jackson-antpathfilter/blob/master/pom.xml

ultraq · 2016-07-09T05:26:45Z

Hey @Antibrumm, sorry to take so long getting back to you, but just getting layout dialect updated for Thymeleaf 3 has been keeping me busy!

Anyway, I don't know anything about JMH (I had to Google it and do some reading), but how has it been helping you and your project?

ultraq · 2016-07-14T12:40:23Z

See some of the comments on the thread about a potential memory leak: #122 ExpandoMetaClass came up as a memory dominator (due to my reliance on Groovy's metaprogramming features to make life much simpler for me) so it might be good to cut back on them

Antibrumm · 2016-07-19T06:05:37Z

I did not use it in an extreme way (yet), as the library is not meant to be a high throughput piece of work at the moment. But for playing around with different possibilities of optimizations I found JMH quite helpful as the usage is very easy. Similiar to JUnit, all you have to do is place some annotations like in the code piece below and JMH takes care of the warmup phase and all the other stuff you easily forget when you write benchmarks yourself. At least when you are no expert :)

In my case I write JSON dynamically based on the include/exclude pattern given to the Jackson2Helper. I wanted to measure the impact of calling serialization using wildcards, explicit path declaration and standard ObjectMapper.

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@Threads(5)
public class ComparisonBenchmark {

    private final Jackson2Helper jackson2Helper = new Jackson2Helper();
    private final List<User> users = new ArrayList<User>();
    private ObjectMapper objectMapper = new ObjectMapper();

    public ComparisonBenchmark() {
        for (int i = 0; i < 10; i++) {
            users.add(User.buildRecursive(100));
        }
    }

    @Benchmark
    public void measureFiltered() {
        jackson2Helper.writeFiltered(users,
                "*", "**.manager.*", "**.address.streetName", "**.reports.lastName");
    }

    @Benchmark
    public void measureSpecificFilters() {
        jackson2Helper.writeFiltered(users,
                "firstName", "lastName", "email", "manager.firstName", "manager.lastName");
    }

    @Benchmark
    public void measureStandard() throws JsonProcessingException {
        objectMapper.writer().writeValueAsString(users);
    }
}

And here is an example of the output it generates:

[INFO] --- exec-maven-plugin:1.4.0:exec (run-benchmarks) @ jackson-antpathfilter ---
# JMH 1.11.1 (released 298 days ago, please consider updating!)
# VM version: JDK 1.8.0_40, VM 25.40-b25
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 10 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 5 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: ch.mfrey.jackson.antpathfilter.test.ComparisonBenchmark.measureFiltered

# Run progress: 0.00% complete, ETA 00:06:00
# Fork: 1 of 3
# Warmup Iteration   1: 331.001 ±(99.9%) 82.782 ms/op
....
# Warmup Iteration  10: 249.697 ±(99.9%) 21.749 ms/op
Iteration   1: 248.342 ±(99.9%) 12.490 ms/op
...
Iteration  10: 256.905 ±(99.9%) 26.289 ms/op

# Run progress: 5.56% complete, ETA 00:08:04
# Fork: 2 of 3
# Warmup Iteration   1: 353.631 ±(99.9%) 124.372 ms/op
...
# Warmup Iteration  10: 267.917 ±(99.9%) 26.909 ms/op
Iteration   1: 268.767 ±(99.9%) 12.481 ms/op
...
Iteration  10: 271.785 ±(99.9%) 31.168 ms/op

# Run progress: 11.11% complete, ETA 00:07:26
# Fork: 3 of 3
# Warmup Iteration   1: 337.758 ±(99.9%) 128.091 ms/op
...
# Warmup Iteration  10: 268.730 ±(99.9%) 40.317 ms/op
Iteration   1: 277.757 ±(99.9%) 15.678 ms/op
...
Iteration  10: 275.310 ±(99.9%) 27.292 ms/op


Result "measureFiltered":
  267.000 ±(99.9%) 7.360 ms/op [Average]
  (min, avg, max) = (247.731, 267.000, 284.497), stdev = 11.016
  CI (99.9%): [259.640, 274.360] (assumes normal distribution)

ultraq · 2016-10-17T10:14:16Z

Have removed more uses of dynamic objects/classes as a follow-through of #122, taken to the extreme, and have found ridiculous performance gains O_o Was gonna post some of them here but someone just raised this exact issue, so I'll post more there.

ultraq · 2016-11-06T02:06:20Z

Final state of performance improvements between 2.0 and 2.1

I've uploaded the very basic web app that I was using for all of my memory tests since the first report of a memory leak in 2.0.0. GitHub repo of that project can be found here: https://github.com/ultraq/thymeleaf-layout-dialect-benchmark

It does very little, only using layout dialect features so that other things that would normally be a part of an application don't impact the tests. It does use Spring MVC however, which might be unnecessary overhead, but I have a good amount of faith in the Spring project's robustness such that I don't think it'll affect the results too badly.

That web app is started with the YourKit profiling agent attached, and object allocation profiling is also enabled. The web app is then stress tested with a simple JMeter test plan (included in the repo) that simulates concurrent users and load to exaggerate any problems in the layout dialect. At the end of the test, a forced GC is done and a memory snapshot is taken to see how the app is at rest.

Thymeleaf Layout Dialect 2.0.4

Main takeaways:

The JMeter test took about 3 minutes to complete (started around the 30 second mark), with requests taking an average of 1.674 seconds each
Old generation space at 99MB
35 garbage collections
27 million object allocations
4 seconds spent in GC
Several items taking over 10MB of retained memory (none of them appearing as dominators however, so are potentially GC'able, but don't seem to have been collected)
Majority of the object allocations taking place in the IModelExtensions.findModel closure, which uses a Groovy feature of dynamic metaclass creation

Thymeleaf Layout Dialect 2.1.0-SNAPSHOT (latest snapshot to be 2.1.0 version)

Differences:

The JMeter test took about 1 minute to complete (also started around the 30 second mark), with requests taking an average of 452ms to complete (at least 3x faster)
Old generation space at 22MB (memory footprint 1/5th the size)
21 garbage collections (40% less GCs)
1 second spent in GC (75% less time spent in GC)
Only 1 item taking over 10MB of retained memory (dominator profile looking mostly the same however)
Majority of the object allocations no longer in a Groovy dynamic meta class method, but in one of Thymeleaf's utility projects, unbescape

Changes made and lessons learned

The change that had the biggest impact to the performance profile of the layout dialect was the removal of Groovy's dynamic metaclass creation. Here's a line representative of what that is:

thymeleaf-layout-dialect/Source/nz/net/ultraq/thymeleaf/models/extensions/IModelExtensions.groovy

Line 211 in b9f0000

model.metaClass.startIndex = eventIndex

What that line did was add a property, startIndex, to the object which I would use later on in the dialect to know exactly where a model started. A similar property is added for where the model ended. This was done because, in Thymeleaf 3, it did away with DOM nodes and so it was much harder for dialects to track the things that made up an "element", instead relying on queues of what it calls "events" (text, tags, comments, anything that you write into a template) with no clear demarcation of elements.

This was a huge convenience for the code, but as shown in the object allocation profiling above, something about it incurred a massive cost in memory. This was Groovy under-the-hood code, so I don't know exactly what goes on there to provide programmers this convenience.

So I removed lines like that throughout the layout dialect, and implemented an additional step for calling code to search for the start/end of a model after the model was received. Theoretically this should have been slower because it's an additional O(n) lookup on code that already did an O(n) lookup to retrieve the model in the first place, but practically it beat out the dynamic metaclass allocation.

Lesson learned: be careful/sparing with dynamic metaclasses. The convenience they provide is a boon for programmers, but if in a critical part of the code its benefits may not outweigh the costs.

Notes, caveats, and final words

These numbers are specific to the benchmark that I ran them on, so don't expect to see improvements of a similar scale in your own app. However, this all just means that the layout dialect should now be even less of a use on your own app's memory and CPU profiles, thus allowing you to focus instead on the performance of your app rather than the performance of your libraries.

If you continue to experience performance problems though, feel free to raise an issue but also provide memory profiles if at all possible. I've actually quite enjoyed digging into and fixing up these things as I learn a lot from it in the process.

ultraq self-assigned this May 2, 2016

ultraq added the enhancement label May 2, 2016

ultraq added this to the Thymeleaf Layout Dialect 2.0 milestone May 2, 2016

ultraq changed the title ~~Create benchmark suite~~ Performance improvements under Thymeleaf 3 May 6, 2016

ultraq removed this from the Thymeleaf Layout Dialect 2.0 milestone May 18, 2016

ultraq mentioned this issue Jul 19, 2016

Possible Memory Leak #122

Closed

ultraq mentioned this issue Sep 1, 2016

ArrayIndexOutOfBoundsException #129

Closed

ultraq added this to the Thymeleaf Layout Dialect 2.1 milestone Oct 1, 2016

ultraq mentioned this issue Oct 17, 2016

Memory leak: ExpandoMetaClass' retaining old gen heap with weakrefs #139

Closed

ultraq added a commit that referenced this issue Oct 19, 2016

#102 Prevent dynamic meta class creation via start/end index props

bc56fdb

ultraq closed this as completed Oct 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements under Thymeleaf 3 #102

Performance improvements under Thymeleaf 3 #102

ultraq commented May 2, 2016 •

edited

ultraq commented May 7, 2016

ultraq commented May 24, 2016 •

edited

ultraq commented May 24, 2016

Antibrumm commented Jun 3, 2016

ultraq commented Jul 9, 2016

ultraq commented Jul 14, 2016 •

edited

Antibrumm commented Jul 19, 2016 •

edited

ultraq commented Oct 17, 2016

ultraq commented Nov 6, 2016

Performance improvements under Thymeleaf 3 #102

Performance improvements under Thymeleaf 3 #102

Comments

ultraq commented May 2, 2016 • edited

ultraq commented May 7, 2016

ultraq commented May 24, 2016 • edited

ultraq commented May 24, 2016

Antibrumm commented Jun 3, 2016

ultraq commented Jul 9, 2016

ultraq commented Jul 14, 2016 • edited

Antibrumm commented Jul 19, 2016 • edited

ultraq commented Oct 17, 2016

ultraq commented Nov 6, 2016

Final state of performance improvements between 2.0 and 2.1

Thymeleaf Layout Dialect 2.0.4

Thymeleaf Layout Dialect 2.1.0-SNAPSHOT (latest snapshot to be 2.1.0 version)

Changes made and lessons learned

Notes, caveats, and final words

ultraq commented May 2, 2016 •

edited

ultraq commented May 24, 2016 •

edited

ultraq commented Jul 14, 2016 •

edited

Antibrumm commented Jul 19, 2016 •

edited