Persistence - Performance Improvement #60

sageserpent-open · 2019-04-13T17:08:02Z

Improve performance of the persistent implementation introduced in #16.

sageserpent-open · 2019-04-14T17:14:55Z

Under investigation: tuples are being proxied at the moment, and there are a lot of delayed loads made through the proxy to access the '_1' component. Why is this the case, and what would be the impact of not proxying it?

sageserpent-open · 2019-04-14T17:18:22Z

Another thought - as expected, there is a substantial penalty for using synchronous calls to H2. Overall, there is quadratic scaling of the runtime for the benchmark in its form from commit 18a3cd8.

Is this connected with Doobie / H2 / poor synchronous use, or is this higher up the stack in 'ImmutableObjectStorage' / Kryo etc?

I should experiment with the benchmark running with a hacked word implementation that uses 'FakeTranches'.

sageserpent-open · 2019-04-14T17:27:30Z

Hmm - with 'Tuple2' proxying disabled - it takes the world spec:
5 minutes 30 seconds
6 minutes 47 seconds
to complete.

sageserpent-open · 2019-04-14T18:31:03Z

OK, committed 35c249f with proxying disabled on 'Tuple2' to see the effects on the Travis build.

Running the benchmark still shows the quadratic scaling, the coefficients are only slightly smaller than at 18a3cd8.

The next stop is to go back to using 'FakeTranches' and to get to the bottom of that quadratic scaling.

sageserpent-open · 2019-04-14T20:58:13Z

It takes the world spec a mere 4 minutes and 11 seconds when the 'H2Tranches' is swapped out for a copy of 'FakeTranches', this is from commit c066192.

Irrespective of the quadratic scaling, the interaction with H2 needs finessing, badly. This is made more obvious by some simple profiling that shows the H2 is definitely not the bottleneck - it is consuming very little CPU time.

sageserpent-open · 2019-04-14T21:21:09Z

Benchmarking at commit c066192, there is still a quadratic scaling even with 'FakeTranches'. This doesn't come as a surprise, and points at 'ImmutableObjectStorage' as being a possible suspect.

sageserpent-open · 2019-04-14T22:30:48Z

Facepalm moment: realised that the optimization to the old 'QuestionableTranches' implementation was never copied over to 'ImmutableObjectStorage.FakeTranches', which has in turn being copied again (!) into 'WorldH2StorageImplementation.FakeTranches' as a temporary measure while investigating. As of commit 7328295 that reapplies the optimization across both copies, this brings the time for the world spec down to a mere 2 minutes and 36 seconds.

sageserpent-open · 2019-04-14T22:42:36Z

Benchmarking the code at commit 7328295 reveals a definite decrease in the quadratic scaling. Let's see where the residual scaling lies...

sageserpent-open · 2019-04-14T23:17:58Z

Aha! Looking through a debug session reveals a lot of forced loading of underlying objects by their proxies due to mundane method calls such as '.hashCode' or 'itemClazz' or 'timeOrdering'. Could these be cached when the underlying object is stored?

sageserpent-open · 2019-04-15T09:43:10Z

Looking at method counts reveals a quadratic scaling via hash trie maps use - specifically via the immutable hashmap implementation that is used by both 'AllEventsImplementation' and 'BlobStorageInMemory'. This would appear at first glance to be unrelated to 'ImmutableObjectStorage'; does this quadratic scaling exist in 'WorldEfficientInMemoryImplementation' too?

sageserpent-open · 2019-04-15T09:53:28Z

Thinking about the performance work in #18, omitting the queries proved to be a useful gambit in locating the final source of quadratic scaling.

sageserpent-open · 2019-04-15T09:57:16Z

More musing ... is it worth proxying any case class? May as well just pull them in wholesale - this prompted especially by 'UniqueItemSpecification', which has instances all over the place.

sageserpent-open · 2019-04-15T10:10:08Z

Another musing - when making a query, could the structure of 'BlobStorageInMemory' cause its internal proxies to load across the board for all ids / lifecycles?

sageserpent-open · 2019-04-15T10:33:34Z

OK, so running the method counts benchmark against the older 'WorldEfficientInMemoryImplementation' reveals linear scaling and the complete absence of calls to hash trie maps in the top ten highest counts.

So the good news is that the common code shared between the two implementations isn't the source of the quadratic behavior - at least not in its own right, perhaps it interacts poorly with 'ImmutableObjectStorage'. Let's revisit those calls to the hash trie map again...

sageserpent-open · 2019-04-15T10:39:31Z

So, 'WorldEfficientInMemoryImplementation' also makes lots of calls to hash trie maps via 'AllEventsImplementation' and 'BlobStorageInMemory', just not in the vast numbers that 'WorldH2StorageImplementation' does. Perhaps this was a red herring - is there another source of calls peculiar to 'WorldH2StorageImplementation'.... ?

sageserpent-open · 2019-04-15T10:49:49Z

A cursory look via debugging indicates not - perhaps it is (de)serialization that is hitting the values of the hash trie map that is causing the churn - this would indeed fit what is seen in the method counts, where scala.collection.immutable.HashMap$HashTrieMap.bitmap() and scala.collection.immutable.HashMap$HashTrieMap.elems() are being hammered.

sageserpent-open · 2019-04-15T11:50:20Z

Running the method counts benchmark on 'WorldH2StorageImplementation' with querying disabled still reveals quadratic scaling via scala.collection.immutable.HashMap$HashTrieMap.bitmap() and scala.collection.immutable.HashMap$HashTrieMap.elems().

This leads me to suspect that any loading of a tranche for a timeline is forcing wholesale loading of other tranches - this may be a case of the structure sharing between tranches acting pathologically, where a simple reference to something in another tranche leads to everything in the other tranche being loaded, which somehow cascades to the next tranche and so on...

sageserpent-open · 2019-04-15T19:21:38Z

One source of the quadratic scaling is a 'tranche avalanche', whereby the Chill collection serializer on deserialization starts inserting elements one by one into a collection builder. For a hash map, this requires that each newly added element be queried for its hash-code; thus forcing its proxy to have to load its underlying object. Not only can this recurse, it also causes incidental loading of collections stored elsewhere in the tranche needed to resolve the underlying object, which causes still more recursive loading to take place.

As of commit ebfe12a, the Chill serializers have been omitted and only the default Kryo ones are used to circumvent this. This brings the world spec time down to 2 minutes, 22 seconds, which is definitely progress...

sageserpent-open · 2019-04-16T09:12:45Z

It looks like the tranche avalanche has been circumvented - debugging and logging shows an absence of any proxies being loaded in the context of another proxy load from a referring tranche.

There are a lot of requests to load hash maps which seem to be unavoidable - these stem from the code in 'TimelineImplementation' that has to load these maps to calculate update timelines, just as it does for 'WorldEfficientInMemoryImplementation', which has linear scaling.

Currently there is still some quadratic dependency via underlying object loading. One possibility is that this is caused by the constructor precondition checking for 'AllEventsImplementation', which scans through a bounded sample of lifecycles and thus forces loading. The bound is 100 lifecycles; this is of the some order of magnitude as the number of items dealt with by the benchmark for sizes up to 500 steps, so perhaps that quadratic term is due to the number of lifecycles being scanned 'filling up' to 100 as the number of steps goes up.

If so, then reducing the sample size bound to say, 5, should reveal linear scaling.

sageserpent-open · 2019-04-16T09:22:56Z

Note to self: consider reverting commit a9a1645 if the cutover to using tree maps to underpin Quiver doesn't yield significant performance benefits?

sageserpent-open · 2019-04-16T10:12:09Z

Tried reducing the sample size in commit d6ec90a.

This did knock down the quadratic dependency a little, perhaps the constructor preconditions in 'AllEventsImplementation.LifecycleImplementation' are also causing trouble?

sageserpent-open · 2019-04-16T11:04:36Z

More musing: the benchmark shotguns item ids across a whole bunch of events, where the number of item ids is proportional to the benchmark size. This means that the same item id can appear in multiple revisions, and thus be involved in multiple tranches. Perhaps this is causing loading of multiple tranches when the item state update dag is used to perform recalculation?

sageserpent-open · 2019-04-16T18:39:40Z

Disabling the contracts altogether in 'AllEventsImplementation' and 'AllEventsImplementation.LifecycleImplementation' only knocked a little bit of the quadratic scaling off, so I'm backing out of that line of attack now .

sageserpent-open · 2019-04-16T18:45:36Z

Tried an experiment in which tranches are preserved wholesale between distinct sessions by parking them in the tranches instance passed to a session. This immediately gives linear scaling, even though new tranches are being stored.

Clearly we can't just store all the trances in memory, (not least because it breaks some of the 'ImmutableObjectStorageSpec' tests concerning tranche corruption), but it does point out that the remaining scaling issue is down to a tranche avalanche.

Is there perhaps some obvious thing that links the tranches in long chains, or should the code be cutover to maintaining a bounded cache of tranches?

sageserpent-open · 2019-04-17T10:02:35Z

Two ideas:

Cache loaded tranches in a cache of bounded size with an LRU policy.
Introduce the notion banning inter-tranche references for certain classes, but continue to use local reference counting.

sageserpent-open · 2019-04-18T13:48:48Z

Interesting - running the benchmark and looking at the number of tranches loaded in each session, it appears that this scales linearly with the number of revisions made while making revisions, also linearly with a massive scatter when making queries.

Is this really down to the nature of the benchmark or is something still forcing a inter-tranche references to be loaded?

sageserpent-open · 2019-04-22T09:39:45Z

OK, done a lot of work on this and found option 2 to be a bad idea.

Option 1 morphed into caching of objects between session by both reference id and by tranche id, and that has yielded fruit...

sageserpent-open · 2019-04-22T09:42:19Z

So, here is a graph of benchmark time versus the number of revision / query steps in the benchmark for commit eacb316:

This is using 'FakeTranches' as the backend.

sageserpent-open · 2019-07-08T09:48:01Z

Ah - one thing, what happens nowadays if queries are omitted? There are results above for up to 1 000 000 steps, but this was before the benchmark was revamped. I should do that again...

sageserpent-open · 2019-07-17T08:36:16Z

Reopening to add discussion of subsequent work...

sageserpent-open · 2019-07-17T09:37:54Z

Removing queries revealed some residual non-linear timing dependency.

This was initially investigated by looking at the number of cached associations from objects to their reference ids, from reference ids to object proxies and from tranche ids to completed operations.

As of commit 1e1fadd, this revealed a steady increase in the number of object to reference id associations, as well as the number of reference id to proxy associations; in addition the number of completed operations (and objects and proxies) all exploded at around 250 000 revisions.

Number of cached object to reference id associations:

Number of reference id to proxy associations:

Number of tranche id to cached completed operations:

sageserpent-open · 2019-07-17T10:11:48Z

As the Caffeine cache implementation clears the cache via a background thread, it was hypothesised that it might not be catching up with the workload, possibly due to its default thread pool being starved by other libraries.

Note that in the previous graph for commit 1e1fadd, the caches were being periodically cleared in an attempt to force cleanup, this clearly wasn't working.

Commit af6f546 swapped out Caffeine for the Guava cache implementation, and also dropped the periodic forced clearance.

Number of cached object to reference id associations:

Number of reference id to proxy associations:

Number of tranche id to cached completed operations:

This has clearly fixed the explosion issue, but there is still the linear trend in cached objects and proxies, which is unacceptable if the memory usage is to be kept bounded.

sageserpent-open · 2019-07-17T10:49:33Z

The linear trend turned out to be caused by the caching of completed operation objects on storage by 'ImmutableObjectStorage'. Doing this holds 'fresh' objects that have been created in a session and not loaded from an existing tranche; even though successive stored tranches will correctly partition up the graph of live in-memory objects, the fact is that those in memory will never have been restored from a tranche, so no proxies are generated and thus as more revisions are made, the in-memory object graph gets bigger and bigger. Of course, reloading tranches due to previous cache eviction will mitigate this somewhat, but even then, requests for an inter-tranche reference can be resolved via a completed operation that contains fresh objects rather than reloaded ones.

Commit efc78bb fixed this problem.

Number of cached object to reference id associations:

Number of reference id to proxy associations:

Number of tranche id to cached completed operations:

sageserpent-open · 2019-07-17T10:55:03Z

Given that the linear dependency coefficients are an order of magnitude less for the object and proxy associations, and are negative for the object associations and the completed object associations, it is assumed that these are due to the caching 'settling in' as a working hypothesis.

sageserpent-open · 2019-07-17T10:58:38Z

It has been confirmed by inspection using Visual VM that the memory footprint remains bounded, alternating between a roughly constant maximum and a roughly constant minimum due to garbage collection, looking at the results for commits from efc78bb onwards.

However, the timings are still non-linear, even with queries omitted:

sageserpent-open · 2019-07-17T11:06:41Z

The focus moved to looking at the number of tranches loaded overall and per 50 revisions, the amount of tranche data loaded overall and the maximum size of tranche loaded to satisfy fetching an underlying object for a proxy.

sageserpent-open · 2019-07-17T11:08:40Z

What was immediately apparent is that the largest tranches loaded to satisfy fetching a proxy's underlying object are always due to the proxy being a hash trie map or more basic hash map, usually a trie.

sageserpent-open · 2019-07-17T11:13:13Z

A brief excursion was made post commit efc78bb to see whether blobs were exhibiting any nonlinear trend; this turned out not to be the case:

Cumulative size of stored blobs

:

sageserpent-open · 2019-07-17T11:14:16Z

Given that the size of a stored blob is roughly constant per revision, it cannot be the case that reading a blob can be the cause of the non-linearity.

sageserpent-open · 2019-07-17T11:26:41Z

Commit f476096 yields interesting results.

Cumulative number of tranche loads (both direct and on behalf of proxies):

Cumulative size of tranche loads (both direct and on behalf of proxies):

Maximum size of tranche loaded on behalf of a proxy:

Number of tranches loaded per 50 revisions:

sageserpent-open · 2019-07-17T11:33:24Z

There is a rather pleasing logarithmic scaling to the maximum tranche size loaded on behalf of a proxy, and indeed the number of tranches loaded per revision also scales logarithmically, presumably because there is some balanced tree structure that requires a logarithmic number of nodes to be replaced by new nodes as a new tree is generated.

So the first two curves are not actually quadratic, they are simply best fits to a sum of logarithms plus constant terms, that is: a * (N * ln(N) - N) + b * N (using Stirling's approximation).

sageserpent-open · 2019-07-17T11:36:00Z

So, is this the origin of the non-linear time dependency? Again, this is commit f476096.

Time taken per 50 revisions, linear and logarithmic fits:

sageserpent-open · 2019-07-17T11:46:56Z

Apart from the slight change in the goodness of fit from 0.0941 to 0.0943 in favour of the linear fit, there is no evidence either way. I'd be wary of choosing one over the other given the order of magnitude of the goodness of fit, ~1e-1, not to mention that the difference is in the third significant figure.

So what can be said with confidence is that:

Incremental recalculations scales linearly in the number of recalculations.
The overall time scaling has a small non-linear dependency.
Memory usage is bounded.
Blob storage size is roughly constant per revision.
Tranche loads per revision scale logarithmically.
Tranche sizes scale logarithmically.
Retrieval or large objects graphs doesn't force proxies to have to load their underlying object en-masse.

So, good, but tantalisingly inconclusive regarding the timings...

sageserpent-open · 2019-07-17T16:02:29Z

Screenshot of VisualVM showing memory usage at around 242850 revisions:

sageserpent-open · 2019-07-17T16:03:28Z

Note that the benchmark is heavily IO bound, CPU usage is much higher for the in-memory world implementation.

ben-manes · 2019-07-17T19:20:20Z

It looks like you're doing a really nice job at performance analysis.

As the Caffeine cache implementation clears the cache via a background thread, it was hypothesised that it might not be catching up with the workload, possibly due to its default thread pool being starved by other libraries.

You can set Caffeine.executor(Runnable::run) to execute work on a calling thread, like Guava does. Since that work is fairly cheap, it should be fine and might help resolve your issues. We try to give good defaults for minimizing response latencies, but also be configurable as users know best.

It sounds like you found and fixed a possible memory leak (ImmutableObjectStorage) which could have also been the culprit, such that the default executor would work fine. Since the cache is bounded by weakValues, I would have thought the caching libraries wouldn't have much of an impact as the heavyweight objects are managed by the garbage collector. Thus, the clean up work of Caffeine/Guava is only to discard these collected map entries, which has an insignificant memory/time overhead.

If possible, I wouldn't rely too heavily on the GC for bounding a large cache. It is a good strategy for smaller sizes, but may not be aggressive enough and cause performance issue due to longer GC cycles when under a load for large caches. A maximumSize / maximumWeight offers better predictability, lower GC cost, and often a better hit rate. However that might be hard since your proxies are loaded lazily so a weight cannot be calculated upfront when the storage object is retrieved. In that case you might consider having the proxies fetched through their own cache, which can then be sized appropriately, rather than retaining them on the storage object explicitly.

sageserpent-open · 2019-07-18T19:09:20Z

@ben-manes , thanks for the nice commentary, good to know that this topic has provoked interest.

Your suggestion about using a bounded size cache instead of GC/ weak keys or values is interesting; back in commit 1e1fadd the tranches cache was indeed size bounded, with an additional periodic forced clearance via ‘invalidateAll’. This still failed to clear that cache as I recall, so at that point I tried the Guava implementation, which sorted out the cache explosion problem.

There was indeed a second problem, namely the memory creep due to caching of fresh objects as opposed to their proxies, but this was fixed in a subsequent commit. Having that fix in place prior to the Guava cutover would have ameliorated the problem, but I don’t think it would have eliminated it altogether.

I was aware that Caffeine can be made to use a custom executor, this is what got me thinking about thread starvation, having seen similar issues before in Plutonium with other pieces of code using the same default executor from Scala and grinding to a halt.

However, I didn’t know about the trick of passing a callable to be executed in the caller’s thread, thanks for the tip, I may return to that in the future.

sageserpent-open · 2019-07-18T19:17:30Z

In addition, you are right about the motivation for using a weak cache instead of a bounded size cache - the application has no idea as to how large the loaded data structure can be, so I decided to use a weak cache to let it find its own sweet spot based on how much memory was claimed by the JVM. This does indeed hammer the garbage collector, so I may yet revisit that decision.

For now, though, my focus is on providing a more realistic use case than the existing benchmark, and doing some basic local optimisations as well as better use of threads in Plutonium; irrespective of the choice of cache implementation, Plutonium itself is by far the worst offender in terms of CPU and IO time.

Of course, once that’s done, it may be worth revisiting this if it gets to the point that the cache implementation itself becomes a bottleneck.

ben-manes · 2019-07-18T19:41:55Z

That all makes sense and sounds good. I was merely trying to think through why Guava would have done better in this case, etc where both used weakValues. Since the GC is collecting the heavy objects automatically, the difference should have been minor. At worst the rate of map entry removals would be different (due to the executor), but those entries wouldn't use much memory comparably. So I'm still a bit confused why Guava was helpful, but glad there are variety of options regardless.

Soft references are of course the bigger pain. In that case they linger for many GC cycles and have a global LRU eviction, so it is very unpredictable. Since LRU is not a good policy in scanny workloads (like a database scanning rows with a predicate filter), that eviction policy breaks all of the GC hypothesis. Weak references are far better, but also means your hit rates might look much different under a stress test than real world due to whenever the GC decides to clear items (in a load test is more resident or less? does the GC run periodically on idle to free?). Lots of unknowns, in addition to penalizing the GC thread(s) rather than your own. I don't know if its good or bad for you, just an opaque process.

sageserpent-open · 2019-07-22T09:49:30Z

Reopening for follow up on review comment from @ben-manes.

sageserpent-open · 2019-07-22T09:54:54Z

Commit 7d7d755 reintroduces Caffeine instead of Guava, bounds the size of the tranche id to completed operations cache and performs cleanup of the all three caches in 'IntersessionState' using an executor that simply runs in the cache API caller's thread.

sageserpent-open · 2019-07-22T09:58:03Z

Number of cached object to reference id associations:

Number of reference id to proxy associations:

sageserpent-open · 2019-07-22T10:04:45Z

The results are similar to commit efc78bb, so this retains the same slight negative trend for the object to reference id cache size growth and slight positive trend for the reference id to proxy cache size growth.

Note that neither undergoes an explosion at around 250 000 or anywhere else, for that matter.

The size of the tranche id to completed operation cache remains constant at exactly 100 entries throughout the benchmark, so clearly cleanup is working with the new choice of executor.

sageserpent-open added the enhancement label Apr 13, 2019

sageserpent-open added this to the Sprint #15 - "Only a Northern Song" milestone Apr 14, 2019

sageserpent-open closed this as completed in ac0f20d Jul 8, 2019

sageserpent-open reopened this Jul 17, 2019

sageserpent-open closed this as completed in 8461882 Jul 17, 2019

sageserpent-open reopened this Jul 22, 2019

sageserpent-open closed this as completed in 9353992 Jul 22, 2019

Persistence - Performance Improvement #60

Persistence - Performance Improvement #60

Comments

sageserpent-open commented Apr 13, 2019

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 14, 2019 • edited

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 14, 2019 • edited

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 14, 2019

sageserpent-open commented Apr 15, 2019 • edited

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 15, 2019 • edited

sageserpent-open commented Apr 15, 2019

sageserpent-open commented Apr 16, 2019 • edited

sageserpent-open commented Apr 16, 2019

sageserpent-open commented Apr 16, 2019

sageserpent-open commented Apr 16, 2019 • edited

sageserpent-open commented Apr 16, 2019

sageserpent-open commented Apr 16, 2019

sageserpent-open commented Apr 17, 2019

sageserpent-open commented Apr 18, 2019

sageserpent-open commented Apr 22, 2019

sageserpent-open commented Apr 22, 2019

sageserpent-open commented Jul 8, 2019 • edited

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019 • edited

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019 • edited

sageserpent-open commented Jul 17, 2019 • edited

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019 • edited

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019 • edited

sageserpent-open commented Jul 17, 2019 • edited

sageserpent-open commented Jul 17, 2019

sageserpent-open commented Jul 17, 2019

ben-manes commented Jul 17, 2019

sageserpent-open commented Jul 18, 2019 • edited

sageserpent-open commented Jul 18, 2019

ben-manes commented Jul 18, 2019

sageserpent-open commented Jul 22, 2019

sageserpent-open commented Jul 22, 2019

sageserpent-open commented Jul 22, 2019

sageserpent-open commented Jul 22, 2019 • edited

sageserpent-open commented Apr 14, 2019 •

edited

sageserpent-open commented Apr 14, 2019 •

edited

sageserpent-open commented Apr 15, 2019 •

edited

sageserpent-open commented Apr 15, 2019 •

edited

sageserpent-open commented Apr 16, 2019 •

edited

sageserpent-open commented Apr 16, 2019 •

edited

sageserpent-open commented Jul 8, 2019 •

edited

sageserpent-open commented Jul 17, 2019 •

edited

sageserpent-open commented Jul 17, 2019 •

edited

sageserpent-open commented Jul 17, 2019 •

edited

sageserpent-open commented Jul 17, 2019 •

edited

sageserpent-open commented Jul 17, 2019 •

edited

sageserpent-open commented Jul 17, 2019 •

edited

sageserpent-open commented Jul 18, 2019 •

edited

sageserpent-open commented Jul 22, 2019 •

edited