Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements under Thymeleaf 3 #102

Closed
5 tasks done
ultraq opened this issue May 2, 2016 · 9 comments
Closed
5 tasks done

Performance improvements under Thymeleaf 3 #102

ultraq opened this issue May 2, 2016 · 9 comments
Assignees

Comments

@ultraq
Copy link
Owner

ultraq commented May 2, 2016

I'm currently implementing the layout dialect in Thymeleaf 3 using 'model processors', which makes the transition to the new event-based architecture a lot easier for me, but may cause developers to lose a lot of the performance benefits of said architecture.

So yeah, I'm concerned about performance. Some things I should try that should help:

  • Create a benchmark suite to make sure further changes are headed in the right direction
    Currently using a very basic web app - should think of publishing it to GitHub or creating some other microbenchmarks as suggested in the comments below
  • Re-implement processors as tag processors
    This was done for FragmentProcessor and doesn't seem to be able to work anywhere else 😢 The reason being that every other processor needs to peek at its model to perform optimally, otherwise I'm going to do more template fetches which have already shown to be slow, particularly in a dev environment where caching is turned off, and even more so when the template resolver involves the network!
  • Create utility or IModel extension that can extract models from events within itself (I currently always go to the ModelFinder class, which uses the template manager to extract models, which is a very round-about way of obtaining a sub-model)
  • Have decorate and merge interfaces and implementations return a new decorated/merged object, instead of modifying the first argument (the target of the operation) so as to keep in line with Thymeleaf's architecture of immutability (goes hand-in-hand with implementing things as tag processors which must return changes as new models)
    This actually adds a few more memory allocations, so a barely noticeable increase in memory usage. All GC-able, so still good.
  • layout:fragment being run unnecessarily in certain cases, eg: captured as a fragment for decoration/inclusion which then ends-up in the resulting template to process. See the TitlePattern-DynamicContent.thtest for an example.
    Doesn't seem to happen any more?
@ultraq ultraq self-assigned this May 2, 2016
@ultraq ultraq added this to the Thymeleaf Layout Dialect 2.0 milestone May 2, 2016
@ultraq ultraq changed the title Create benchmark suite Performance improvements under Thymeleaf 3 May 6, 2016
@ultraq
Copy link
Owner Author

ultraq commented May 7, 2016

These can probably wait until after the 2.0.0 release if I don't want to hold that up.

@ultraq ultraq removed this from the Thymeleaf Layout Dialect 2.0 milestone May 18, 2016
@ultraq
Copy link
Owner Author

ultraq commented May 24, 2016

See #108 for a performance issue when I load templates too often (made worse by the fact that templates are pulled over the internet). That should tie in to the 3rd bullet check point above.

@ultraq
Copy link
Owner Author

ultraq commented May 24, 2016

Something else that's come up in talks w/ Daniel: performing model equality might be a bit of a hit, particularly for text nodes that will need to resolve their underlying CharSequence in memory.

@Antibrumm
Copy link

If you like to implement benchmarks with the help of JMH, i have a benchmark profile implemented in one of my github projects.

Works pretty well to get an overview and was kind of easy to get it running. The project is small so could good for some hints :) take a look at the pom.xml

https://github.com/Antibrumm/jackson-antpathfilter/blob/master/pom.xml

@ultraq
Copy link
Owner Author

ultraq commented Jul 9, 2016

Hey @Antibrumm, sorry to take so long getting back to you, but just getting layout dialect updated for Thymeleaf 3 has been keeping me busy!

Anyway, I don't know anything about JMH (I had to Google it and do some reading), but how has it been helping you and your project?

@ultraq
Copy link
Owner Author

ultraq commented Jul 14, 2016

See some of the comments on the thread about a potential memory leak: #122 ExpandoMetaClass came up as a memory dominator (due to my reliance on Groovy's metaprogramming features to make life much simpler for me) so it might be good to cut back on them

@Antibrumm
Copy link

Antibrumm commented Jul 19, 2016

I did not use it in an extreme way (yet), as the library is not meant to be a high throughput piece of work at the moment. But for playing around with different possibilities of optimizations I found JMH quite helpful as the usage is very easy. Similiar to JUnit, all you have to do is place some annotations like in the code piece below and JMH takes care of the warmup phase and all the other stuff you easily forget when you write benchmarks yourself. At least when you are no expert :)

In my case I write JSON dynamically based on the include/exclude pattern given to the Jackson2Helper. I wanted to measure the impact of calling serialization using wildcards, explicit path declaration and standard ObjectMapper.

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@Threads(5)
public class ComparisonBenchmark {

    private final Jackson2Helper jackson2Helper = new Jackson2Helper();
    private final List<User> users = new ArrayList<User>();
    private ObjectMapper objectMapper = new ObjectMapper();

    public ComparisonBenchmark() {
        for (int i = 0; i < 10; i++) {
            users.add(User.buildRecursive(100));
        }
    }

    @Benchmark
    public void measureFiltered() {
        jackson2Helper.writeFiltered(users,
                "*", "**.manager.*", "**.address.streetName", "**.reports.lastName");
    }

    @Benchmark
    public void measureSpecificFilters() {
        jackson2Helper.writeFiltered(users,
                "firstName", "lastName", "email", "manager.firstName", "manager.lastName");
    }

    @Benchmark
    public void measureStandard() throws JsonProcessingException {
        objectMapper.writer().writeValueAsString(users);
    }
}

And here is an example of the output it generates:

[INFO] --- exec-maven-plugin:1.4.0:exec (run-benchmarks) @ jackson-antpathfilter ---
# JMH 1.11.1 (released 298 days ago, please consider updating!)
# VM version: JDK 1.8.0_40, VM 25.40-b25
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 10 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 5 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: ch.mfrey.jackson.antpathfilter.test.ComparisonBenchmark.measureFiltered

# Run progress: 0.00% complete, ETA 00:06:00
# Fork: 1 of 3
# Warmup Iteration   1: 331.001 ±(99.9%) 82.782 ms/op
....
# Warmup Iteration  10: 249.697 ±(99.9%) 21.749 ms/op
Iteration   1: 248.342 ±(99.9%) 12.490 ms/op
...
Iteration  10: 256.905 ±(99.9%) 26.289 ms/op

# Run progress: 5.56% complete, ETA 00:08:04
# Fork: 2 of 3
# Warmup Iteration   1: 353.631 ±(99.9%) 124.372 ms/op
...
# Warmup Iteration  10: 267.917 ±(99.9%) 26.909 ms/op
Iteration   1: 268.767 ±(99.9%) 12.481 ms/op
...
Iteration  10: 271.785 ±(99.9%) 31.168 ms/op

# Run progress: 11.11% complete, ETA 00:07:26
# Fork: 3 of 3
# Warmup Iteration   1: 337.758 ±(99.9%) 128.091 ms/op
...
# Warmup Iteration  10: 268.730 ±(99.9%) 40.317 ms/op
Iteration   1: 277.757 ±(99.9%) 15.678 ms/op
...
Iteration  10: 275.310 ±(99.9%) 27.292 ms/op


Result "measureFiltered":
  267.000 ±(99.9%) 7.360 ms/op [Average]
  (min, avg, max) = (247.731, 267.000, 284.497), stdev = 11.016
  CI (99.9%): [259.640, 274.360] (assumes normal distribution)

@ultraq
Copy link
Owner Author

ultraq commented Oct 17, 2016

Have removed more uses of dynamic objects/classes as a follow-through of #122, taken to the extreme, and have found ridiculous performance gains O_o Was gonna post some of them here but someone just raised this exact issue, so I'll post more there.

@ultraq
Copy link
Owner Author

ultraq commented Nov 6, 2016

Final state of performance improvements between 2.0 and 2.1

I've uploaded the very basic web app that I was using for all of my memory tests since the first report of a memory leak in 2.0.0. GitHub repo of that project can be found here: https://github.com/ultraq/thymeleaf-layout-dialect-benchmark

It does very little, only using layout dialect features so that other things that would normally be a part of an application don't impact the tests. It does use Spring MVC however, which might be unnecessary overhead, but I have a good amount of faith in the Spring project's robustness such that I don't think it'll affect the results too badly.

That web app is started with the YourKit profiling agent attached, and object allocation profiling is also enabled. The web app is then stress tested with a simple JMeter test plan (included in the repo) that simulates concurrent users and load to exaggerate any problems in the layout dialect. At the end of the test, a forced GC is done and a memory snapshot is taken to see how the app is at rest.

Thymeleaf Layout Dialect 2.0.4

memory usage 2 0 4

Main takeaways:

  • The JMeter test took about 3 minutes to complete (started around the 30 second mark), with requests taking an average of 1.674 seconds each

  • Old generation space at 99MB

  • 35 garbage collections

  • 27 million object allocations

  • 4 seconds spent in GC

  • Several items taking over 10MB of retained memory (none of them appearing as dominators however, so are potentially GC'able, but don't seem to have been collected)

    screen shot 2016-11-06 at 1 33 32 pm

  • Majority of the object allocations taking place in the IModelExtensions.findModel closure, which uses a Groovy feature of dynamic metaclass creation

    screen shot 2016-11-06 at 1 36 02 pm

Thymeleaf Layout Dialect 2.1.0-SNAPSHOT (latest snapshot to be 2.1.0 version)

memory usage 2 1 0-snapshot

Differences:

  • The JMeter test took about 1 minute to complete (also started around the 30 second mark), with requests taking an average of 452ms to complete (at least 3x faster)

  • Old generation space at 22MB (memory footprint 1/5th the size)

  • 21 garbage collections (40% less GCs)

  • 1 second spent in GC (75% less time spent in GC)

  • Only 1 item taking over 10MB of retained memory (dominator profile looking mostly the same however)

    screen shot 2016-11-06 at 1 51 37 pm

  • Majority of the object allocations no longer in a Groovy dynamic meta class method, but in one of Thymeleaf's utility projects, unbescape

    screen shot 2016-11-06 at 1 54 42 pm

Changes made and lessons learned

The change that had the biggest impact to the performance profile of the layout dialect was the removal of Groovy's dynamic metaclass creation. Here's a line representative of what that is:

What that line did was add a property, startIndex, to the object which I would use later on in the dialect to know exactly where a model started. A similar property is added for where the model ended. This was done because, in Thymeleaf 3, it did away with DOM nodes and so it was much harder for dialects to track the things that made up an "element", instead relying on queues of what it calls "events" (text, tags, comments, anything that you write into a template) with no clear demarcation of elements.

This was a huge convenience for the code, but as shown in the object allocation profiling above, something about it incurred a massive cost in memory. This was Groovy under-the-hood code, so I don't know exactly what goes on there to provide programmers this convenience.

So I removed lines like that throughout the layout dialect, and implemented an additional step for calling code to search for the start/end of a model after the model was received. Theoretically this should have been slower because it's an additional O(n) lookup on code that already did an O(n) lookup to retrieve the model in the first place, but practically it beat out the dynamic metaclass allocation.

Lesson learned: be careful/sparing with dynamic metaclasses. The convenience they provide is a boon for programmers, but if in a critical part of the code its benefits may not outweigh the costs.

Notes, caveats, and final words

These numbers are specific to the benchmark that I ran them on, so don't expect to see improvements of a similar scale in your own app. However, this all just means that the layout dialect should now be even less of a use on your own app's memory and CPU profiles, thus allowing you to focus instead on the performance of your app rather than the performance of your libraries.

If you continue to experience performance problems though, feel free to raise an issue but also provide memory profiles if at all possible. I've actually quite enjoyed digging into and fixing up these things as I learn a lot from it in the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants