Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence - Performance Improvement #60

Closed
sageserpent-open opened this issue Apr 13, 2019 · 157 comments
Closed

Persistence - Performance Improvement #60

sageserpent-open opened this issue Apr 13, 2019 · 157 comments

Comments

@sageserpent-open
Copy link
Owner

Improve performance of the persistent implementation introduced in #16.

@sageserpent-open
Copy link
Owner Author

Under investigation: tuples are being proxied at the moment, and there are a lot of delayed loads made through the proxy to access the '_1' component. Why is this the case, and what would be the impact of not proxying it?

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Apr 14, 2019

Another thought - as expected, there is a substantial penalty for using synchronous calls to H2. Overall, there is quadratic scaling of the runtime for the benchmark in its form from commit 18a3cd8.

Is this connected with Doobie / H2 / poor synchronous use, or is this higher up the stack in 'ImmutableObjectStorage' / Kryo etc?

I should experiment with the benchmark running with a hacked word implementation that uses 'FakeTranches'.

@sageserpent-open
Copy link
Owner Author

Hmm - with 'Tuple2' proxying disabled - it takes the world spec:
5 minutes 30 seconds
6 minutes 47 seconds
to complete.

@sageserpent-open
Copy link
Owner Author

OK, committed 35c249f with proxying disabled on 'Tuple2' to see the effects on the Travis build.

Running the benchmark still shows the quadratic scaling, the coefficients are only slightly smaller than at 18a3cd8.

The next stop is to go back to using 'FakeTranches' and to get to the bottom of that quadratic scaling.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Apr 14, 2019

It takes the world spec a mere 4 minutes and 11 seconds when the 'H2Tranches' is swapped out for a copy of 'FakeTranches', this is from commit c066192.

Irrespective of the quadratic scaling, the interaction with H2 needs finessing, badly. This is made more obvious by some simple profiling that shows the H2 is definitely not the bottleneck - it is consuming very little CPU time.

@sageserpent-open
Copy link
Owner Author

Benchmarking at commit c066192, there is still a quadratic scaling even with 'FakeTranches'. This doesn't come as a surprise, and points at 'ImmutableObjectStorage' as being a possible suspect.

@sageserpent-open
Copy link
Owner Author

Facepalm moment: realised that the optimization to the old 'QuestionableTranches' implementation was never copied over to 'ImmutableObjectStorage.FakeTranches', which has in turn being copied again (!) into 'WorldH2StorageImplementation.FakeTranches' as a temporary measure while investigating. As of commit 7328295 that reapplies the optimization across both copies, this brings the time for the world spec down to a mere 2 minutes and 36 seconds.

@sageserpent-open
Copy link
Owner Author

Benchmarking the code at commit 7328295 reveals a definite decrease in the quadratic scaling. Let's see where the residual scaling lies...

@sageserpent-open
Copy link
Owner Author

Aha! Looking through a debug session reveals a lot of forced loading of underlying objects by their proxies due to mundane method calls such as '.hashCode' or 'itemClazz' or 'timeOrdering'. Could these be cached when the underlying object is stored?

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Apr 15, 2019

Looking at method counts reveals a quadratic scaling via hash trie maps use - specifically via the immutable hashmap implementation that is used by both 'AllEventsImplementation' and 'BlobStorageInMemory'. This would appear at first glance to be unrelated to 'ImmutableObjectStorage'; does this quadratic scaling exist in 'WorldEfficientInMemoryImplementation' too?

@sageserpent-open
Copy link
Owner Author

Thinking about the performance work in #18, omitting the queries proved to be a useful gambit in locating the final source of quadratic scaling.

@sageserpent-open
Copy link
Owner Author

More musing ... is it worth proxying any case class? May as well just pull them in wholesale - this prompted especially by 'UniqueItemSpecification', which has instances all over the place.

@sageserpent-open
Copy link
Owner Author

Another musing - when making a query, could the structure of 'BlobStorageInMemory' cause its internal proxies to load across the board for all ids / lifecycles?

@sageserpent-open
Copy link
Owner Author

OK, so running the method counts benchmark against the older 'WorldEfficientInMemoryImplementation' reveals linear scaling and the complete absence of calls to hash trie maps in the top ten highest counts.

So the good news is that the common code shared between the two implementations isn't the source of the quadratic behavior - at least not in its own right, perhaps it interacts poorly with 'ImmutableObjectStorage'. Let's revisit those calls to the hash trie map again...

@sageserpent-open
Copy link
Owner Author

So, 'WorldEfficientInMemoryImplementation' also makes lots of calls to hash trie maps via 'AllEventsImplementation' and 'BlobStorageInMemory', just not in the vast numbers that 'WorldH2StorageImplementation' does. Perhaps this was a red herring - is there another source of calls peculiar to 'WorldH2StorageImplementation'.... ?

@sageserpent-open
Copy link
Owner Author

A cursory look via debugging indicates not - perhaps it is (de)serialization that is hitting the values of the hash trie map that is causing the churn - this would indeed fit what is seen in the method counts, where scala.collection.immutable.HashMap$HashTrieMap.bitmap() and scala.collection.immutable.HashMap$HashTrieMap.elems() are being hammered.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Apr 15, 2019

Running the method counts benchmark on 'WorldH2StorageImplementation' with querying disabled still reveals quadratic scaling via scala.collection.immutable.HashMap$HashTrieMap.bitmap() and scala.collection.immutable.HashMap$HashTrieMap.elems().

This leads me to suspect that any loading of a tranche for a timeline is forcing wholesale loading of other tranches - this may be a case of the structure sharing between tranches acting pathologically, where a simple reference to something in another tranche leads to everything in the other tranche being loaded, which somehow cascades to the next tranche and so on...

@sageserpent-open
Copy link
Owner Author

One source of the quadratic scaling is a 'tranche avalanche', whereby the Chill collection serializer on deserialization starts inserting elements one by one into a collection builder. For a hash map, this requires that each newly added element be queried for its hash-code; thus forcing its proxy to have to load its underlying object. Not only can this recurse, it also causes incidental loading of collections stored elsewhere in the tranche needed to resolve the underlying object, which causes still more recursive loading to take place.

As of commit ebfe12a, the Chill serializers have been omitted and only the default Kryo ones are used to circumvent this. This brings the world spec time down to 2 minutes, 22 seconds, which is definitely progress...

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Apr 16, 2019

It looks like the tranche avalanche has been circumvented - debugging and logging shows an absence of any proxies being loaded in the context of another proxy load from a referring tranche.

There are a lot of requests to load hash maps which seem to be unavoidable - these stem from the code in 'TimelineImplementation' that has to load these maps to calculate update timelines, just as it does for 'WorldEfficientInMemoryImplementation', which has linear scaling.

Currently there is still some quadratic dependency via underlying object loading. One possibility is that this is caused by the constructor precondition checking for 'AllEventsImplementation', which scans through a bounded sample of lifecycles and thus forces loading. The bound is 100 lifecycles; this is of the some order of magnitude as the number of items dealt with by the benchmark for sizes up to 500 steps, so perhaps that quadratic term is due to the number of lifecycles being scanned 'filling up' to 100 as the number of steps goes up.

If so, then reducing the sample size bound to say, 5, should reveal linear scaling.

@sageserpent-open
Copy link
Owner Author

Note to self: consider reverting commit a9a1645 if the cutover to using tree maps to underpin Quiver doesn't yield significant performance benefits?

@sageserpent-open
Copy link
Owner Author

Tried reducing the sample size in commit d6ec90a.

This did knock down the quadratic dependency a little, perhaps the constructor preconditions in 'AllEventsImplementation.LifecycleImplementation' are also causing trouble?

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Apr 16, 2019

More musing: the benchmark shotguns item ids across a whole bunch of events, where the number of item ids is proportional to the benchmark size. This means that the same item id can appear in multiple revisions, and thus be involved in multiple tranches. Perhaps this is causing loading of multiple tranches when the item state update dag is used to perform recalculation?

@sageserpent-open
Copy link
Owner Author

Disabling the contracts altogether in 'AllEventsImplementation' and 'AllEventsImplementation.LifecycleImplementation' only knocked a little bit of the quadratic scaling off, so I'm backing out of that line of attack now .

@sageserpent-open
Copy link
Owner Author

Tried an experiment in which tranches are preserved wholesale between distinct sessions by parking them in the tranches instance passed to a session. This immediately gives linear scaling, even though new tranches are being stored.

Clearly we can't just store all the trances in memory, (not least because it breaks some of the 'ImmutableObjectStorageSpec' tests concerning tranche corruption), but it does point out that the remaining scaling issue is down to a tranche avalanche.

Is there perhaps some obvious thing that links the tranches in long chains, or should the code be cutover to maintaining a bounded cache of tranches?

@sageserpent-open
Copy link
Owner Author

Two ideas:

  1. Cache loaded tranches in a cache of bounded size with an LRU policy.
  2. Introduce the notion banning inter-tranche references for certain classes, but continue to use local reference counting.

@sageserpent-open
Copy link
Owner Author

Interesting - running the benchmark and looking at the number of tranches loaded in each session, it appears that this scales linearly with the number of revisions made while making revisions, also linearly with a massive scatter when making queries.

Is this really down to the nature of the benchmark or is something still forcing a inter-tranche references to be loaded?

@sageserpent-open
Copy link
Owner Author

OK, done a lot of work on this and found option 2 to be a bad idea.

Option 1 morphed into caching of objects between session by both reference id and by tranche id, and that has yielded fruit...

@sageserpent-open
Copy link
Owner Author

So, here is a graph of benchmark time versus the number of revision / query steps in the benchmark for commit eacb316:

TimingsFor_eacb316c497bd7f592e0fcdee5044c4c55ee6c64

This is using 'FakeTranches' as the backend.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 8, 2019

Ah - one thing, what happens nowadays if queries are omitted? There are results above for up to 1 000 000 steps, but this was before the benchmark was revamped. I should do that again...

@sageserpent-open
Copy link
Owner Author

Reopening to add discussion of subsequent work...

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 17, 2019

Removing queries revealed some residual non-linear timing dependency.

This was initially investigated by looking at the number of cached associations from objects to their reference ids, from reference ids to object proxies and from tranche ids to completed operations.

As of commit 1e1fadd, this revealed a steady increase in the number of object to reference id associations, as well as the number of reference id to proxy associations; in addition the number of completed operations (and objects and proxies) all exploded at around 250 000 revisions.

Number of cached object to reference id associations:

1e1faddbfd02e9a1daa00c1c4a8effb16c1784f5 - objectAssociations

Number of reference id to proxy associations:
1e1faddbfd02e9a1daa00c1c4a8effb16c1784f5 - Proxies

Number of tranche id to cached completed operations:

1e1faddbfd02e9a1daa00c1c4a8effb16c1784f5 - Completed Operations

@sageserpent-open
Copy link
Owner Author

As the Caffeine cache implementation clears the cache via a background thread, it was hypothesised that it might not be catching up with the workload, possibly due to its default thread pool being starved by other libraries.

Note that in the previous graph for commit 1e1fadd, the caches were being periodically cleared in an attempt to force cleanup, this clearly wasn't working.

Commit af6f546 swapped out Caffeine for the Guava cache implementation, and also dropped the periodic forced clearance.

Number of cached object to reference id associations:

af6f546a4a90f7111599b1a4992efb0d112de26e - Objects

Number of reference id to proxy associations:
af6f546a4a90f7111599b1a4992efb0d112de26e - Proxies

Number of tranche id to cached completed operations:
af6f546a4a90f7111599b1a4992efb0d112de26e - Completed Operations

This has clearly fixed the explosion issue, but there is still the linear trend in cached objects and proxies, which is unacceptable if the memory usage is to be kept bounded.

@sageserpent-open
Copy link
Owner Author

The linear trend turned out to be caused by the caching of completed operation objects on storage by 'ImmutableObjectStorage'. Doing this holds 'fresh' objects that have been created in a session and not loaded from an existing tranche; even though successive stored tranches will correctly partition up the graph of live in-memory objects, the fact is that those in memory will never have been restored from a tranche, so no proxies are generated and thus as more revisions are made, the in-memory object graph gets bigger and bigger. Of course, reloading tranches due to previous cache eviction will mitigate this somewhat, but even then, requests for an inter-tranche reference can be resolved via a completed operation that contains fresh objects rather than reloaded ones.

Commit efc78bb fixed this problem.

Number of cached object to reference id associations:
efc78bb73fb9716272401b1dcffbf156db1da3ef - Objects

Number of reference id to proxy associations:
efc78bb73fb9716272401b1dcffbf156db1da3ef - Proxies

Number of tranche id to cached completed operations:
efc78bb73fb9716272401b1dcffbf156db1da3ef - Completed Operations

@sageserpent-open
Copy link
Owner Author

Given that the linear dependency coefficients are an order of magnitude less for the object and proxy associations, and are negative for the object associations and the completed object associations, it is assumed that these are due to the caching 'settling in' as a working hypothesis.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 17, 2019

It has been confirmed by inspection using Visual VM that the memory footprint remains bounded, alternating between a roughly constant maximum and a roughly constant minimum due to garbage collection, looking at the results for commits from efc78bb onwards.

However, the timings are still non-linear, even with queries omitted:

efc78bb73fb9716272401b1dcffbf156db1da3ef - Timings

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 17, 2019

The focus moved to looking at the number of tranches loaded overall and per 50 revisions, the amount of tranche data loaded overall and the maximum size of tranche loaded to satisfy fetching an underlying object for a proxy.

@sageserpent-open
Copy link
Owner Author

What was immediately apparent is that the largest tranches loaded to satisfy fetching a proxy's underlying object are always due to the proxy being a hash trie map or more basic hash map, usually a trie.

@sageserpent-open
Copy link
Owner Author

A brief excursion was made post commit efc78bb to see whether blobs were exhibiting any nonlinear trend; this turned out not to be the case:

Cumulative size of stored blobs
Cumulative Snapshot Store Size
:

@sageserpent-open
Copy link
Owner Author

Given that the size of a stored blob is roughly constant per revision, it cannot be the case that reading a blob can be the cause of the non-linearity.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 17, 2019

Commit f476096 yields interesting results.

Cumulative number of tranche loads (both direct and on behalf of proxies):
f476096ff74f9fae29ed7367cae788ca94573f4f - Cumulative Tranche Loads

Cumulative size of tranche loads (both direct and on behalf of proxies):
f476096ff74f9fae29ed7367cae788ca94573f4f - Cumulative Tranche Sizes

Maximum size of tranche loaded on behalf of a proxy:
f476096ff74f9fae29ed7367cae788ca94573f4f - Maximum tranche size for proxy

Number of tranches loaded per 50 revisions:
f476096ff74f9fae29ed7367cae788ca94573f4f - Number of tranche loads per revision

@sageserpent-open
Copy link
Owner Author

There is a rather pleasing logarithmic scaling to the maximum tranche size loaded on behalf of a proxy, and indeed the number of tranches loaded per revision also scales logarithmically, presumably because there is some balanced tree structure that requires a logarithmic number of nodes to be replaced by new nodes as a new tree is generated.

So the first two curves are not actually quadratic, they are simply best fits to a sum of logarithms plus constant terms, that is: a * (N * ln(N) - N) + b * N (using Stirling's approximation).

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 17, 2019

So, is this the origin of the non-linear time dependency? Again, this is commit f476096.

Time taken per 50 revisions, linear and logarithmic fits:
f476096ff74f9fae29ed7367cae788ca94573f4f - Time per revision

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 17, 2019

Apart from the slight change in the goodness of fit from 0.0941 to 0.0943 in favour of the linear fit, there is no evidence either way. I'd be wary of choosing one over the other given the order of magnitude of the goodness of fit, ~1e-1, not to mention that the difference is in the third significant figure.

So what can be said with confidence is that:

  1. Incremental recalculations scales linearly in the number of recalculations.
  2. The overall time scaling has a small non-linear dependency.
  3. Memory usage is bounded.
  4. Blob storage size is roughly constant per revision.
  5. Tranche loads per revision scale logarithmically.
  6. Tranche sizes scale logarithmically.
  7. Retrieval or large objects graphs doesn't force proxies to have to load their underlying object en-masse.

So, good, but tantalisingly inconclusive regarding the timings...

@sageserpent-open
Copy link
Owner Author

Screenshot of VisualVM showing memory usage at around 242850 revisions:
image

@sageserpent-open
Copy link
Owner Author

Note that the benchmark is heavily IO bound, CPU usage is much higher for the in-memory world implementation.

@ben-manes
Copy link

It looks like you're doing a really nice job at performance analysis.

As the Caffeine cache implementation clears the cache via a background thread, it was hypothesised that it might not be catching up with the workload, possibly due to its default thread pool being starved by other libraries.

You can set Caffeine.executor(Runnable::run) to execute work on a calling thread, like Guava does. Since that work is fairly cheap, it should be fine and might help resolve your issues. We try to give good defaults for minimizing response latencies, but also be configurable as users know best.

It sounds like you found and fixed a possible memory leak (ImmutableObjectStorage) which could have also been the culprit, such that the default executor would work fine. Since the cache is bounded by weakValues, I would have thought the caching libraries wouldn't have much of an impact as the heavyweight objects are managed by the garbage collector. Thus, the clean up work of Caffeine/Guava is only to discard these collected map entries, which has an insignificant memory/time overhead.

If possible, I wouldn't rely too heavily on the GC for bounding a large cache. It is a good strategy for smaller sizes, but may not be aggressive enough and cause performance issue due to longer GC cycles when under a load for large caches. A maximumSize / maximumWeight offers better predictability, lower GC cost, and often a better hit rate. However that might be hard since your proxies are loaded lazily so a weight cannot be calculated upfront when the storage object is retrieved. In that case you might consider having the proxies fetched through their own cache, which can then be sized appropriately, rather than retaining them on the storage object explicitly.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 18, 2019

@ben-manes , thanks for the nice commentary, good to know that this topic has provoked interest.

Your suggestion about using a bounded size cache instead of GC/ weak keys or values is interesting; back in commit 1e1fadd the tranches cache was indeed size bounded, with an additional periodic forced clearance via ‘invalidateAll’. This still failed to clear that cache as I recall, so at that point I tried the Guava implementation, which sorted out the cache explosion problem.

There was indeed a second problem, namely the memory creep due to caching of fresh objects as opposed to their proxies, but this was fixed in a subsequent commit. Having that fix in place prior to the Guava cutover would have ameliorated the problem, but I don’t think it would have eliminated it altogether.

I was aware that Caffeine can be made to use a custom executor, this is what got me thinking about thread starvation, having seen similar issues before in Plutonium with other pieces of code using the same default executor from Scala and grinding to a halt.

However, I didn’t know about the trick of passing a callable to be executed in the caller’s thread, thanks for the tip, I may return to that in the future.

@sageserpent-open
Copy link
Owner Author

In addition, you are right about the motivation for using a weak cache instead of a bounded size cache - the application has no idea as to how large the loaded data structure can be, so I decided to use a weak cache to let it find its own sweet spot based on how much memory was claimed by the JVM. This does indeed hammer the garbage collector, so I may yet revisit that decision.

For now, though, my focus is on providing a more realistic use case than the existing benchmark, and doing some basic local optimisations as well as better use of threads in Plutonium; irrespective of the choice of cache implementation, Plutonium itself is by far the worst offender in terms of CPU and IO time.

Of course, once that’s done, it may be worth revisiting this if it gets to the point that the cache implementation itself becomes a bottleneck.

@ben-manes
Copy link

That all makes sense and sounds good. I was merely trying to think through why Guava would have done better in this case, etc where both used weakValues. Since the GC is collecting the heavy objects automatically, the difference should have been minor. At worst the rate of map entry removals would be different (due to the executor), but those entries wouldn't use much memory comparably. So I'm still a bit confused why Guava was helpful, but glad there are variety of options regardless.

Soft references are of course the bigger pain. In that case they linger for many GC cycles and have a global LRU eviction, so it is very unpredictable. Since LRU is not a good policy in scanny workloads (like a database scanning rows with a predicate filter), that eviction policy breaks all of the GC hypothesis. Weak references are far better, but also means your hit rates might look much different under a stress test than real world due to whenever the GC decides to clear items (in a load test is more resident or less? does the GC run periodically on idle to free?). Lots of unknowns, in addition to penalizing the GC thread(s) rather than your own. I don't know if its good or bad for you, just an opaque process.

@sageserpent-open
Copy link
Owner Author

Reopening for follow up on review comment from @ben-manes.

@sageserpent-open
Copy link
Owner Author

Commit 7d7d755 reintroduces Caffeine instead of Guava, bounds the size of the tranche id to completed operations cache and performs cleanup of the all three caches in 'IntersessionState' using an executor that simply runs in the cache API caller's thread.

@sageserpent-open
Copy link
Owner Author

Number of cached object to reference id associations:
7d7d755e9c97a68b3b98aee201b37c5551902658 - ObjectToReferenceId

Number of reference id to proxy associations:
7d7d755e9c97a68b3b98aee201b37c5551902658 - ReferenceIdToProxy

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Jul 22, 2019

The results are similar to commit efc78bb, so this retains the same slight negative trend for the object to reference id cache size growth and slight positive trend for the reference id to proxy cache size growth.

Note that neither undergoes an explosion at around 250 000 or anywhere else, for that matter.

The size of the tranche id to completed operation cache remains constant at exactly 100 entries throughout the benchmark, so clearly cleanup is working with the new choice of executor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants