Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Substantial improvements to persistent cache efficiency #243

Merged
merged 38 commits into from
Jun 11, 2019

Conversation

jamadden
Copy link
Member

@jamadden jamadden commented Jun 4, 2019

Previously, the hit rates achievable by RelStorage's persistent cache files were notoriously poor, obtaining hit rates of 1 to 2% --- if you were lucky. If you threw a global memcache server in there they could be better, but part of the point was to be able to not have a global memcache server in the first place. I've been working rather feverishly recently to try to improve the situation.

Good news, everybody! It appears the situation can be improved, and substantially. Here's zodbshootout's 'cold' benchmark for reading 500 objects not found in the ZODB Connection caches:

Benchmark MySQL No Cache MySQL Persistent Cache
cold 152 ms 40.8 ms: 3.73x faster (-73%)

In this workload, the persistent cache is able to get a 100% hit rate.

Here's another, more recent benchmark, showing both 'add' and 'cold' speed:

psycopg2-pcache: add 500 objects: Mean +- std dev: 70.0 ms +- 5.0 ms
psycopg2:        add 500 objects: Mean +- std dev: 67.0 ms +- 2.8 ms
psycopg2-pcache: read 500 cold objects: Mean +- std dev: 23.0 ms +- 2.6 ms
psycopg2:        read 500 cold objects: Mean +- std dev: 103 ms +- 2 ms

Once again, the persistent cache gets 100% hit rates and delivers nearly 5x better performance than going to the database server. (And that's a local server, doing nothing else, so that's likely only the floor of potential improvements.)

The Idea

The core of the idea is to be smarter about what we store, and smarter about how we read that data back into memory.

  • The in-memory cache tends to accumulate various revisions of an object. But when we're shutting down and writing to disk, we only need to store the most recent revision (for the normal case of a non-historical connection of course; this dosen't accelerate historical connections).
  • The in-memory caches's "checkpoint" system means that we can use a few different cache keys to find object data. We need to save those checkpoints, and reconstitute the keys from disk in a way that will fit with those checkpoints. (You can't use the natural keys, because those are almost certainly wrong for the bulk of the objects. And you can't just shove everything into a single checkpoint because that leads to surprising conflicts. You have to keep the spread. Actually, the checkpoints concept could potentially use some more revisiting down the line.)
  • The MVCC nature of RelStorage means each connection gets its own cache object; we need to provide some of the data (e.g., the checkpoints) we read from the persistent file to each cache object so it can then begin mutating on its own.

So this PR is working on a few things in parallel (I've left all the gnarly intermittent commits in, for now):

  • Re-structuring the in-memory cache to keep more information about its keys and values. It knows what's an OID and what's a TID, so that we can use this information later.
  • Doing what I outlined above to preserve checkpoints and delta maps in the cache files.
  • Implementing a layer that makes those checkpoints and delta maps usefully cooperate between multiple storage instances (e.g., web workers) sharing the same persistent cache. The idea is that instead of just hoping to find the best data, we actually do get the best data. And its self-consistent.

Storage Layer

That last part has involved moving to a structured storage system instead of the ad-hoc multi-file thing we were doing before. Fortunately, Python ships with one that works great for this kind of thing: sqlite3 (as long as you have a version of sqlite from 2010 or newer, and of course the cache files shouldn't be on a network filesystem --- again, that would kind of defeat the point).

Unfortunately, it's possible to produce benchmarks where this storage layer appears to perform rather poorly compared to what we did before. Here's reading and writing a 100mb persistent cache:

Benchmark RelStorage 2 sqlite, all new sqlite, all shared
write 327 ms 2.36 sec: 7.22x slower (+622%) 633 ms: 1.94x slower (+94%)
read 828 ms 1.23 sec: 1.49x slower (+49%) 1.23 sec: 1.49x slower (+49%)

The first column is the old code. The second column is the new code, writing the 100mb cache to disk for the first time (worst case scenario). The third column is it noticing that there's nothing at all to do and not writing anything except some maintenance checks.

We can see that writing is between 2 and 7 times slower; this appears to be the same, no matter the size of the data; on a default 10MB cache, the difference is roughly 200ms). But this is absolutely the worst scenario for the new code: the cache is filled with completely distinct objects with no overlap at all; most of the time the cache will have the same object at different versions, and we account for that.

Because writing caches occurs only at shutdown, I'm hopeful that even at this performance level, adding 2s per 100MB of cache isn't a major problem? Feedback is welcome, as always.

Reading is 50% slower. That doesn't seem too bad to get almost-guaranteed useful data, either in relative or absolute terms. And keep in mind that it was probable that the old code would actually go through that process for at least two files, as it tried to hunt for the best data. So we may wind up about the same.

This isn't particularly optimized yet, there's probably room for improvement.

There's still a big TODO list (see below) but I'm excited to get this out there for real-world testing as part of RelStorage 3.0a1 sooner rather than later, so lots of that might get deferred. To reiterate, feedback is welcome.

TODOs

  • Specific testing. There are lots of places where I left 'TODO' comments about specific needed unit tests.
  • General conflict handling testing --- I've done lots of ad-hoc things, but I need to formalize that into unit tests. In the early days I was able to easily produce conflicts, but those issues mostly got resolved fairly quickly.
  • Performance tuning (memory and speed) --- make sure this hasn't had an impact on non-persistent cache operations. On read, maybe we specifically want to only partly fill the cache instead of reading everything? Handle the pre-allocated memory for the cache nodes better.
  • Improved persistent trimming algorithm --- we need to frequency age somehow, just like the in-memory version does
  • Improved in-memory trimming algorithm --- deciding what should go out to disk can be slow (especially when, as in some benchmarks, the answer is "everything"), and its not clear how much it buys us right now. Can we do better? Is it worth just simplifying?

/cc @jzuech3 @cutz

@jamadden
Copy link
Member Author

jamadden commented Jun 4, 2019

Having made it this far, It occurs to me that an in memory sqlite database might make a compelling replacement for our custom Segmented LRU Cache --- that is, it might allow for a more efficient implementation of a Segmented LRU Cache combined with persistence.

(EDIT: Or indeed, a temporary database; that's expected to be in memory, but allows SQLite to page it out to disk itself as needed.)

Advantages

More SQL :)

Reading in and writing out should both benefit by attaching databases. Whereas the current code took 11s to write a 500mb cache, and 5 to read it, doing the same with SQL copies took 4s and 1s, respectively (obviously that's rough, there's more overhead to account for).

By keeping the data in the SQL space, we avoid the need for all the little objects (and the large allocations and surrounding questions) that make #186 so annoying (we should expect improved GC times, maybe even better processor cache locality).

By dropping our native code, we fix #184.

Unsure

Presumably much of the bookkeeping I'm doing in Python could be done in the SQLlite virtual machine (e.g., the min_allowed_writeback, currently a LLBTree could be an actual indexed table). The question is whether that's equally fast...but it does drop the GIL, so there's that. (And there's set_progress_handler, a hook we could use to prevent blocking gevent for too long.)

We could drop our Python-level locks and rely on transactional semantics (BEGIN IMMEDIATE) instead, which might be cleaner. That would play just as well with the GIL, but I don't know about gevent. (OTOH, these are very small operations, so it may not matter. I should profile to see how hot our lock is.)

Disadvantages

More SQL :(

Getting data in and out would require a copy between SQL and Python. Those should only be transient though, we don't need the pickle data for long.

The (efficient) implementation of Segmented LRU might be more complex?

I feel like it deserves some experimentation.

A test I wrote in 3.7 crashed in 2.7 (but not pypy) having something to do with the CFFI allocations; I've disabled pre-allocation for now until I can look more closely. (But hopefully one of two things will happen: either more people will use persistent caches and get the contiguous benefits that way, or a different LRU implementation such as sqlite will replace the custom one and we can drop CFFI altogether.)
Early results show that it can be a large win for memory usage, if we use a temporary database instead of in-memory (which has about the same performance, it seems).

More work needed to allow efficient use. It's still going to be slower at the micro level, it looks like, but the next step will be to try it at the macro level.
…sense.

This also allows for a more fair test of the (unoptimized, incomplete) SQL-based storage.

Performance numbers below; what they don't show is the memory usage. The CFFI ring takes between 80 and 215MB to do what the temp SQL storage does in about 2MB.

+-----------+----------+--------------------------------+
| Benchmark | CFFI_100 | SQL_100                        |
+===========+==========+================================+
| pop_eq    | 1.04 sec | 3.13 sec: 3.00x slower (+200%) |
+-----------+----------+--------------------------------+
| pop_ne    | 1.11 sec | 3.66 sec: 3.29x slower (+229%) |
+-----------+----------+--------------------------------+
| epop      | 947 ms   | 2.62 sec: 2.77x slower (+177%) |
+-----------+----------+--------------------------------+
| read      | 370 ms   | 1.14 sec: 3.09x slower (+209%) |
+-----------+----------+--------------------------------+
| mix       | 1.32 sec | 4.23 sec: 3.20x slower (+220%) |
+-----------+----------+--------------------------------+
@jamadden
Copy link
Member Author

I was able to get a version of the cache working using a sqlite temporary storage. The good news: the memory overhead is much better. Whereas a cache that stores 10MB of data in the current code takes about 40MB+ of actual memory (these are small objects so the overhead is maximized), in the SQL-based code, there's essentially 0MB of memory used (the data resides in ephemeral temporary indexed files, much like what ZEO does).

Of course we pay for this in other ways. The sql cache itself isn't as fast as the memory cache at getting entries out, though it depends on the concurrency situation.

Python 3.7, postgresql, 5 concurrent processes reading 500 cold objects

Type Time
no cache 98.9ms
CFFI cache 58.1ms
SQL cache 42.1ms

Python 3.7, postgresql, 5 concurrent threads (single process) reading 500 cold objects

Type Time
no cache 52.4ms
CFFI cache 79.6ms
SQL cache 150ms

That may still be fast enough, though, it needs more profiling, and more tuning may be possible. I haven't gotten to the point of the exercise, either, which was to speed up persistent cache loading/saving.

My plan at this point is to push one last round of test updates and then release this new persistence code, with the in-memory cache defaulting to the current CFFI ring cache (but an undocumented option to use the SQL-based low-overhead cache in case people want to try; it could be that the benefits of a larger cache outweigh it being slightly slower).

Then I can circle back for another round of optimizations and cleanups.

Update the docs for persistence changes, and add change note.
@jamadden jamadden merged commit a891940 into master Jun 11, 2019
@jamadden jamadden deleted the persistent-cache-efficiency branch June 11, 2019 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant