-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Substantial improvements to persistent cache efficiency #243
Conversation
Don't use strings, use the native ints. Only when we go to externalize for memcache do we need to go to byte strings. This lets us have better knowledge of what's in the local client.
…here. And move its cleanup to a background.
… not store things it can know are stale.
This should also fix a pylint error.
Having made it this far, It occurs to me that an in memory sqlite database might make a compelling replacement for our custom Segmented LRU Cache --- that is, it might allow for a more efficient implementation of a Segmented LRU Cache combined with persistence. (EDIT: Or indeed, a temporary database; that's expected to be in memory, but allows SQLite to page it out to disk itself as needed.) AdvantagesMore SQL :) Reading in and writing out should both benefit by attaching databases. Whereas the current code took 11s to write a 500mb cache, and 5 to read it, doing the same with SQL copies took 4s and 1s, respectively (obviously that's rough, there's more overhead to account for). By keeping the data in the SQL space, we avoid the need for all the little objects (and the large allocations and surrounding questions) that make #186 so annoying (we should expect improved GC times, maybe even better processor cache locality). By dropping our native code, we fix #184. UnsurePresumably much of the bookkeeping I'm doing in Python could be done in the SQLlite virtual machine (e.g., the We could drop our Python-level locks and rely on transactional semantics ( DisadvantagesMore SQL :( Getting data in and out would require a copy between SQL and Python. Those should only be transient though, we don't need the pickle data for long. The (efficient) implementation of Segmented LRU might be more complex? I feel like it deserves some experimentation. |
Also some logging around the pragmas we set.
The file was way too big.
Refactor and simplify after_poll somewhat and add copious comments.
A test I wrote in 3.7 crashed in 2.7 (but not pypy) having something to do with the CFFI allocations; I've disabled pre-allocation for now until I can look more closely. (But hopefully one of two things will happen: either more people will use persistent caches and get the contiguous benefits that way, or a different LRU implementation such as sqlite will replace the custom one and we can drop CFFI altogether.)
… tisting the sqlite implementation.
…to make it truly tenable
Early results show that it can be a large win for memory usage, if we use a temporary database instead of in-memory (which has about the same performance, it seems). More work needed to allow efficient use. It's still going to be slower at the micro level, it looks like, but the next step will be to try it at the macro level.
…sense. This also allows for a more fair test of the (unoptimized, incomplete) SQL-based storage. Performance numbers below; what they don't show is the memory usage. The CFFI ring takes between 80 and 215MB to do what the temp SQL storage does in about 2MB. +-----------+----------+--------------------------------+ | Benchmark | CFFI_100 | SQL_100 | +===========+==========+================================+ | pop_eq | 1.04 sec | 3.13 sec: 3.00x slower (+200%) | +-----------+----------+--------------------------------+ | pop_ne | 1.11 sec | 3.66 sec: 3.29x slower (+229%) | +-----------+----------+--------------------------------+ | epop | 947 ms | 2.62 sec: 2.77x slower (+177%) | +-----------+----------+--------------------------------+ | read | 370 ms | 1.14 sec: 3.09x slower (+209%) | +-----------+----------+--------------------------------+ | mix | 1.32 sec | 4.23 sec: 3.20x slower (+220%) | +-----------+----------+--------------------------------+
I was able to get a version of the cache working using a sqlite temporary storage. The good news: the memory overhead is much better. Whereas a cache that stores 10MB of data in the current code takes about 40MB+ of actual memory (these are small objects so the overhead is maximized), in the SQL-based code, there's essentially 0MB of memory used (the data resides in ephemeral temporary indexed files, much like what ZEO does). Of course we pay for this in other ways. The sql cache itself isn't as fast as the memory cache at getting entries out, though it depends on the concurrency situation. Python 3.7, postgresql, 5 concurrent processes reading 500 cold objects
Python 3.7, postgresql, 5 concurrent threads (single process) reading 500 cold objects
That may still be fast enough, though, it needs more profiling, and more tuning may be possible. I haven't gotten to the point of the exercise, either, which was to speed up persistent cache loading/saving. My plan at this point is to push one last round of test updates and then release this new persistence code, with the in-memory cache defaulting to the current CFFI ring cache (but an undocumented option to use the SQL-based low-overhead cache in case people want to try; it could be that the benefits of a larger cache outweigh it being slightly slower). Then I can circle back for another round of optimizations and cleanups. |
Update the docs for persistence changes, and add change note.
Previously, the hit rates achievable by RelStorage's persistent cache files were notoriously poor, obtaining hit rates of 1 to 2% --- if you were lucky. If you threw a global memcache server in there they could be better, but part of the point was to be able to not have a global memcache server in the first place. I've been working rather feverishly recently to try to improve the situation.
Good news, everybody! It appears the situation can be improved, and substantially. Here's zodbshootout's 'cold' benchmark for reading 500 objects not found in the ZODB Connection caches:
In this workload, the persistent cache is able to get a 100% hit rate.
Here's another, more recent benchmark, showing both 'add' and 'cold' speed:
Once again, the persistent cache gets 100% hit rates and delivers nearly 5x better performance than going to the database server. (And that's a local server, doing nothing else, so that's likely only the floor of potential improvements.)
The Idea
The core of the idea is to be smarter about what we store, and smarter about how we read that data back into memory.
So this PR is working on a few things in parallel (I've left all the gnarly intermittent commits in, for now):
Storage Layer
That last part has involved moving to a structured storage system instead of the ad-hoc multi-file thing we were doing before. Fortunately, Python ships with one that works great for this kind of thing: sqlite3 (as long as you have a version of sqlite from 2010 or newer, and of course the cache files shouldn't be on a network filesystem --- again, that would kind of defeat the point).
Unfortunately, it's possible to produce benchmarks where this storage layer appears to perform rather poorly compared to what we did before. Here's reading and writing a 100mb persistent cache:
The first column is the old code. The second column is the new code, writing the 100mb cache to disk for the first time (worst case scenario). The third column is it noticing that there's nothing at all to do and not writing anything except some maintenance checks.
We can see that writing is between 2 and 7 times slower; this appears to be the same, no matter the size of the data; on a default 10MB cache, the difference is roughly 200ms). But this is absolutely the worst scenario for the new code: the cache is filled with completely distinct objects with no overlap at all; most of the time the cache will have the same object at different versions, and we account for that.
Because writing caches occurs only at shutdown, I'm hopeful that even at this performance level, adding 2s per 100MB of cache isn't a major problem? Feedback is welcome, as always.
Reading is 50% slower. That doesn't seem too bad to get almost-guaranteed useful data, either in relative or absolute terms. And keep in mind that it was probable that the old code would actually go through that process for at least two files, as it tried to hunt for the best data. So we may wind up about the same.
This isn't particularly optimized yet, there's probably room for improvement.
There's still a big TODO list (see below) but I'm excited to get this out there for real-world testing as part of RelStorage 3.0a1 sooner rather than later, so lots of that might get deferred. To reiterate, feedback is welcome.
TODOs
/cc @jzuech3 @cutz