[WIP] Substantial improvements to persistent cache efficiency #243

jamadden · 2019-06-04T21:14:27Z

Previously, the hit rates achievable by RelStorage's persistent cache files were notoriously poor, obtaining hit rates of 1 to 2% --- if you were lucky. If you threw a global memcache server in there they could be better, but part of the point was to be able to not have a global memcache server in the first place. I've been working rather feverishly recently to try to improve the situation.

Good news, everybody! It appears the situation can be improved, and substantially. Here's zodbshootout's 'cold' benchmark for reading 500 objects not found in the ZODB Connection caches:

Benchmark	MySQL No Cache	MySQL Persistent Cache
cold	152 ms	40.8 ms: 3.73x faster (-73%)

In this workload, the persistent cache is able to get a 100% hit rate.

Here's another, more recent benchmark, showing both 'add' and 'cold' speed:

psycopg2-pcache: add 500 objects: Mean +- std dev: 70.0 ms +- 5.0 ms
psycopg2:        add 500 objects: Mean +- std dev: 67.0 ms +- 2.8 ms
psycopg2-pcache: read 500 cold objects: Mean +- std dev: 23.0 ms +- 2.6 ms
psycopg2:        read 500 cold objects: Mean +- std dev: 103 ms +- 2 ms

Once again, the persistent cache gets 100% hit rates and delivers nearly 5x better performance than going to the database server. (And that's a local server, doing nothing else, so that's likely only the floor of potential improvements.)

The Idea

The core of the idea is to be smarter about what we store, and smarter about how we read that data back into memory.

The in-memory cache tends to accumulate various revisions of an object. But when we're shutting down and writing to disk, we only need to store the most recent revision (for the normal case of a non-historical connection of course; this dosen't accelerate historical connections).
The in-memory caches's "checkpoint" system means that we can use a few different cache keys to find object data. We need to save those checkpoints, and reconstitute the keys from disk in a way that will fit with those checkpoints. (You can't use the natural keys, because those are almost certainly wrong for the bulk of the objects. And you can't just shove everything into a single checkpoint because that leads to surprising conflicts. You have to keep the spread. Actually, the checkpoints concept could potentially use some more revisiting down the line.)
The MVCC nature of RelStorage means each connection gets its own cache object; we need to provide some of the data (e.g., the checkpoints) we read from the persistent file to each cache object so it can then begin mutating on its own.

So this PR is working on a few things in parallel (I've left all the gnarly intermittent commits in, for now):

Re-structuring the in-memory cache to keep more information about its keys and values. It knows what's an OID and what's a TID, so that we can use this information later.
Doing what I outlined above to preserve checkpoints and delta maps in the cache files.
Implementing a layer that makes those checkpoints and delta maps usefully cooperate between multiple storage instances (e.g., web workers) sharing the same persistent cache. The idea is that instead of just hoping to find the best data, we actually do get the best data. And its self-consistent.

Storage Layer

That last part has involved moving to a structured storage system instead of the ad-hoc multi-file thing we were doing before. Fortunately, Python ships with one that works great for this kind of thing: sqlite3 (as long as you have a version of sqlite from 2010 or newer, and of course the cache files shouldn't be on a network filesystem --- again, that would kind of defeat the point).

Unfortunately, it's possible to produce benchmarks where this storage layer appears to perform rather poorly compared to what we did before. Here's reading and writing a 100mb persistent cache:

Benchmark	RelStorage 2	sqlite, all new	sqlite, all shared
write	327 ms	2.36 sec: 7.22x slower (+622%)	633 ms: 1.94x slower (+94%)
read	828 ms	1.23 sec: 1.49x slower (+49%)	1.23 sec: 1.49x slower (+49%)

The first column is the old code. The second column is the new code, writing the 100mb cache to disk for the first time (worst case scenario). The third column is it noticing that there's nothing at all to do and not writing anything except some maintenance checks.

We can see that writing is between 2 and 7 times slower; this appears to be the same, no matter the size of the data; on a default 10MB cache, the difference is roughly 200ms). But this is absolutely the worst scenario for the new code: the cache is filled with completely distinct objects with no overlap at all; most of the time the cache will have the same object at different versions, and we account for that.

Because writing caches occurs only at shutdown, I'm hopeful that even at this performance level, adding 2s per 100MB of cache isn't a major problem? Feedback is welcome, as always.

Reading is 50% slower. That doesn't seem too bad to get almost-guaranteed useful data, either in relative or absolute terms. And keep in mind that it was probable that the old code would actually go through that process for at least two files, as it tried to hunt for the best data. So we may wind up about the same.

This isn't particularly optimized yet, there's probably room for improvement.

There's still a big TODO list (see below) but I'm excited to get this out there for real-world testing as part of RelStorage 3.0a1 sooner rather than later, so lots of that might get deferred. To reiterate, feedback is welcome.

TODOs

Specific testing. There are lots of places where I left 'TODO' comments about specific needed unit tests.
General conflict handling testing --- I've done lots of ad-hoc things, but I need to formalize that into unit tests. In the early days I was able to easily produce conflicts, but those issues mostly got resolved fairly quickly.
Performance tuning (memory and speed) --- make sure this hasn't had an impact on non-persistent cache operations. On read, maybe we specifically want to only partly fill the cache instead of reading everything? Handle the pre-allocated memory for the cache nodes better.
Improved persistent trimming algorithm --- we need to frequency age somehow, just like the in-memory version does
Improved in-memory trimming algorithm --- deciding what should go out to disk can be slow (especially when, as in some benchmarks, the answer is "everything"), and its not clear how much it buys us right now. Can we do better? Is it worth just simplifying?

/cc @jzuech3 @cutz

Don't use strings, use the native ints. Only when we go to externalize for memcache do we need to go to byte strings. This lets us have better knowledge of what's in the local client.

…tence.

…here. And move its cleanup to a background.

… not store things it can know are stale.

This should also fix a pylint error.

jamadden · 2019-06-04T23:35:33Z

Having made it this far, It occurs to me that an in memory sqlite database might make a compelling replacement for our custom Segmented LRU Cache --- that is, it might allow for a more efficient implementation of a Segmented LRU Cache combined with persistence.

(EDIT: Or indeed, a temporary database; that's expected to be in memory, but allows SQLite to page it out to disk itself as needed.)

Advantages

More SQL :)

Reading in and writing out should both benefit by attaching databases. Whereas the current code took 11s to write a 500mb cache, and 5 to read it, doing the same with SQL copies took 4s and 1s, respectively (obviously that's rough, there's more overhead to account for).

By keeping the data in the SQL space, we avoid the need for all the little objects (and the large allocations and surrounding questions) that make #186 so annoying (we should expect improved GC times, maybe even better processor cache locality).

By dropping our native code, we fix #184.

Unsure

Presumably much of the bookkeeping I'm doing in Python could be done in the SQLlite virtual machine (e.g., the min_allowed_writeback, currently a LLBTree could be an actual indexed table). The question is whether that's equally fast...but it does drop the GIL, so there's that. (And there's set_progress_handler, a hook we could use to prevent blocking gevent for too long.)

We could drop our Python-level locks and rely on transactional semantics (BEGIN IMMEDIATE) instead, which might be cleaner. That would play just as well with the GIL, but I don't know about gevent. (OTOH, these are very small operations, so it may not matter. I should profile to see how hot our lock is.)

Disadvantages

More SQL :(

Getting data in and out would require a copy between SQL and Python. Those should only be transient though, we don't need the pickle data for long.

The (efficient) implementation of Segmented LRU might be more complex?

I feel like it deserves some experimentation.

Also some logging around the pragmas we set.

…w features.

The file was way too big.

Refactor and simplify after_poll somewhat and add copious comments.

…ilities.

A test I wrote in 3.7 crashed in 2.7 (but not pypy) having something to do with the CFFI allocations; I've disabled pre-allocation for now until I can look more closely. (But hopefully one of two things will happen: either more people will use persistent caches and get the contiguous benefits that way, or a different LRU implementation such as sqlite will replace the custom one and we can drop CFFI altogether.)

… tisting the sqlite implementation.

…to make it truly tenable

Early results show that it can be a large win for memory usage, if we use a temporary database instead of in-memory (which has about the same performance, it seems). More work needed to allow efficient use. It's still going to be slower at the micro level, it looks like, but the next step will be to try it at the macro level.

…sense. This also allows for a more fair test of the (unoptimized, incomplete) SQL-based storage. Performance numbers below; what they don't show is the memory usage. The CFFI ring takes between 80 and 215MB to do what the temp SQL storage does in about 2MB. +-----------+----------+--------------------------------+ | Benchmark | CFFI_100 | SQL_100 | +===========+==========+================================+ | pop_eq | 1.04 sec | 3.13 sec: 3.00x slower (+200%) | +-----------+----------+--------------------------------+ | pop_ne | 1.11 sec | 3.66 sec: 3.29x slower (+229%) | +-----------+----------+--------------------------------+ | epop | 947 ms | 2.62 sec: 2.77x slower (+177%) | +-----------+----------+--------------------------------+ | read | 370 ms | 1.14 sec: 3.09x slower (+209%) | +-----------+----------+--------------------------------+ | mix | 1.32 sec | 4.23 sec: 3.20x slower (+220%) | +-----------+----------+--------------------------------+

jamadden · 2019-06-11T13:43:46Z

I was able to get a version of the cache working using a sqlite temporary storage. The good news: the memory overhead is much better. Whereas a cache that stores 10MB of data in the current code takes about 40MB+ of actual memory (these are small objects so the overhead is maximized), in the SQL-based code, there's essentially 0MB of memory used (the data resides in ephemeral temporary indexed files, much like what ZEO does).

Of course we pay for this in other ways. The sql cache itself isn't as fast as the memory cache at getting entries out, though it depends on the concurrency situation.

Python 3.7, postgresql, 5 concurrent processes reading 500 cold objects

Type	Time
no cache	98.9ms
CFFI cache	58.1ms
SQL cache	42.1ms

Python 3.7, postgresql, 5 concurrent threads (single process) reading 500 cold objects

Type	Time
no cache	52.4ms
CFFI cache	79.6ms
SQL cache	150ms

That may still be fast enough, though, it needs more profiling, and more tuning may be possible. I haven't gotten to the point of the exercise, either, which was to speed up persistent cache loading/saving.

My plan at this point is to push one last round of test updates and then release this new persistence code, with the in-memory cache defaulting to the current CFFI ring cache (but an undocumented option to use the SQL-based low-overhead cache in case people want to try; it could be that the benefits of a larger cache outweigh it being slightly slower).

Then I can circle back for another round of optimizations and cleanups.

Update the docs for persistence changes, and add change note.

jamadden added 20 commits June 3, 2019 06:02

Working on an idea to get better efficiency from persistent caches.

87d7f39

Better checkpoint handling.

15b61dd

Not using overlap.

b4643b3

Rework cache clients.

845badd

Don't use strings, use the native ints. Only when we go to externalize for memcache do we need to go to byte strings. This lets us have better knowledge of what's in the local client.

Test fixes

7ee219b

Fix the size limitations, and begin work on sqlite backend for persis…

fda076b

…tence.

Restore key_transform

c196da9

Checkpoint on cleaning things up. Tests passing locally.

13c2a73

Adapt to older versions of sqlite.

429a7e2

More benchmarking for local_client.

2ab22d2

Enable local client unit tests again.

29f38b9

adapt local_client benchmark to new formats.

ab70f05

Fix tests on PyPy and figure out why the WAL is sometimes there/not t…

89eebd8

…here. And move its cleanup to a background.

Let connections print the file they connected to.

16b6da6

Implement trimming of the database.

3dea9da

Support older sqlite without parenthesized updates.

9c910c3

Apparently they're quite picky.

a0f3a4d

Avoid some race conditions in tests.

ec14c90

Rework the data storage of the cache layer and make it more robust to…

7f8dfb7

… not store things it can know are stale.

Add ability to split benchmark files out for comparison.

61a76d2

This should also fix a pylint error.

jamadden mentioned this pull request Jun 4, 2019

Update result docs zodb/zodbshootout#44

Open

jamadden added 8 commits June 5, 2019 08:08

Make the temp table use INTEGER PRIMARY KEY again.

f1229ae

Also some logging around the pragmas we set.

Make the row batcher use proper escaping for deletes, and test the ne…

6ea3ecd

…w features.

Improving cache tests.

00b8c36

Split test_cache.py into modules matching the modules it tests.

e940180

The file was way too big.

100% coverage for storage_cache.py

6376b9f

Refactor and simplify after_poll somewhat and add copious comments.

Simplify dealing with local and global caches.

58f3b97

Fix pylint and pypy errors.

f004dee

Back to 100% coverage for cache_ring.py

9ac79ec

jamadden added 9 commits June 6, 2019 09:20

Simplifying the SizedLRUMapping interface to the cache ring.

cf0f757

100% coverage for storage_cache.py, and some refactoring of responsib…

0edd2f2

…ilities.

More testing and simplifications.

9bc31d9

Start defining interfaces for the lru layer of the cache, in prep for…

c90f106

… tisting the sqlite implementation.

Checkpoint trying a sqlite memory cache. Some layers need refactored …

8ebe221

…to make it truly tenable

Refactoring and testing.

c2b1252

More testing of the lower layers of caching.

c42d72d

Update the docs for persistence changes, and add change note.

jamadden merged commit a891940 into master Jun 11, 2019

jamadden deleted the persistent-cache-efficiency branch June 11, 2019 15:21

This was referenced Jun 12, 2019

Cache corruption #200

Closed

Cache write logging should include file name #185

Closed

Can we improve local cache efficiency? #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Substantial improvements to persistent cache efficiency #243

[WIP] Substantial improvements to persistent cache efficiency #243

jamadden commented Jun 4, 2019 •

edited

Loading

jamadden commented Jun 4, 2019 •

edited

Loading

jamadden commented Jun 11, 2019

[WIP] Substantial improvements to persistent cache efficiency #243

[WIP] Substantial improvements to persistent cache efficiency #243

Conversation

jamadden commented Jun 4, 2019 • edited Loading

The Idea

Storage Layer

TODOs

jamadden commented Jun 4, 2019 • edited Loading

Advantages

Unsure

Disadvantages

jamadden commented Jun 11, 2019

jamadden commented Jun 4, 2019 •

edited

Loading

jamadden commented Jun 4, 2019 •

edited

Loading