Restructure the cache to a windowed-LFU admission policy on top of a segmented LRU #128

jamadden · 2016-09-21T22:07:14Z

See #127 for information on the motivation.

Benchmarks and simulations are in https://github.com/zodb/relstorage/blob/w-tinylru-cffi/relstorage/cache/micro_benchmark_results.rst Long story short, this code is always faster, and almost always produces better hit rates, sometimes much better (and when the rates aren't better, they're comparable).

CFFI is now required, not optional (though given how poorly the non-CFFI cache performed, it was effectively required before). I'll plan to distribute OS X and manxlinux wheels. We haven't had any requests from Windows users about a wheel before, but if we get them we can use Appveyor to generate Windows wheels (separate PR, this one's big enough!)

This PR is not ready to merge yet, I've got at least another day of cleanup to do on it. But I wanted to open it up now for any comments or concerns on the general direction.

There's a lot to look through here, but I tried to keep each commit as close to single-purpose as I could.

…load, and write down some numbers from my machine before I get started. I suspect compression might dominate...

Still need to probably segment the main portion into protected/probation to get the right behaviour. The code is unoptimized but all tests pass (with minor modifications to account for the frequency-based preservation, which is actually kind of cool---although we will ultimately need more specific tests). The mixed micro-benchmark is actually faster. Populating is notably slower. Reading is basically unchanged (slightly slower). Start using the frequency information to avoid writing out objects that haven't been accessed in a long time. We should probably run an aging pass over these objects before we write to make it more accurate.

… (maybe a bug? All tests pass though), but everything else slowed down, including the mixed workload.

Age the entries before we write them and only include those that survive the aging process. Also stop writing the cache file when we've written enough to fill up the protected generation, since that's what we store into (80% cache size by default).

…pre-populate the protected segment directly from the eden segment; this is a modification primarily for the sake of the existing microbenchmarks and may go away.

…lem, although it introduces a test regression that I'm not going to debug right now.

…ng the fact that we're filling protected first

…sions that order dicts. PyPy is stupid fast.

Tests pass for me, but won't anywhere else because I modified the persistent ring.[ch] locally. If that's going to be a necessary step then there are some other questions to answer (i.e., do we need our own custom CFFI extension?)

…ve some fields into the C structure.

…because it showed up in a profile.

Current timings: epop average 4.351013101666467 stddev 0.04419633374727681 mix average 5.273094533001616 stddev 0.056560372799909574 pop average 3.9593097216663105 stddev 0.08276509526733669 read average 1.076361087335196 stddev 0.014496181325770149 epop is slightly worse than before any changes (3.8). mix and pop are both better (6.0 and 4.8), respectively. Read is still slightly worse (0.77, but just tracking frequency took us to 0.86).

And do some minor attribute caching. All times improved: epop average 4.179437022665904 stddev 0.02725543525889035 mix average 4.767989358665848 stddev 0.029150437240123107 pop average 3.847747696330771 stddev 0.05402757351317516 read average 0.924939800335172 stddev 0.008914988112135936

…ng the dict initializer.

This avoids CFFI overhead of creating handles and Ring pointers. Current times all improved are stayed the same; we are now beating the original code on *all* timings (except read) epop average 3.7955654890004857 stddev 0.045522296233750786 mix average 4.36776723600633 stddev 0.040494291535130046 pop average 3.503353017338668 stddev 0.08680920255355108 read average 0.939945668001504 stddev 0.013126699924166357

…gets it out of the profile.

…e reason.

Current times are: epop average 3.5930002473372347 stddev 0.046088702157717376 mix average 4.21050848266168 stddev 0.04612077189551625 pop average 3.6549476809983994 stddev 0.4561708743548922 read average 0.9901350849977462 stddev 0.0089

This lets us see the effect of key rolling and checkpoints. Include benchmark numbers (the LRU times need to be updated still, they're from an earlier revision that didn't use memory efficiently.)

jamadden · 2016-09-23T17:18:42Z

In c80c885 I adapted the storage simulations to use the true cache implementation that RelStorage uses (it's a layer on top of this list) which is where the rolling takes place. By default, the rolling happens every 10000 write operations. It's interesting to see the effect this has on cache hits. In some of the simulations they go up, in some of the simulations they go down.

In the non-rolling simulations, this implementation was losing (slightly) on 3 of the write-heavy workloads. In the rolling simulations, it loses in all 6 of the write-heavy workloads---even in a case where the hit ratio went up! In one of the cases it goes far down.

I can probably produce a trace file containing the reads and writes we do to the cache during these simulations.

ben-manes · 2016-09-23T17:28:17Z

Note that the Financial (OLTP) traces are the worst case for a frequency strategy. It appears to be the disk trace from the database commit log, where the changes are stored and then replayed to catch-up. The database's in-memory cache would have been more interesting to trace, and is probably large enough to service most accesses. That means we only see a few extra operations on a miss and LRU is pretty close to optimal.

So I think the OLTP traces are good for finding weaknesses in the algorithm, but might not be representative of most workloads.

coveralls · 2016-09-23T17:32:48Z

Coverage increased (+0.06%) to 95.697% when pulling 5217d92 on w-tinylru-cffi into 1b39101 on master.

…tion. [skip ci]

jamadden · 2016-09-23T20:27:26Z

Here's the traces from the OLTP runs. The keys reset 201 times in Financial1 and 26 times in Financial2.

https://s3.amazonaws.com/dev.nextthought.com/traces-Financial1.tar.xz
https://s3.amazonaws.com/dev.nextthought.com/traces-Financial2.tar.xz

ben-manes · 2016-09-23T20:30:45Z

Thanks! I'll give it a run tonight. This is one of the harder traces for TinyLFU, though, and I don't think it matches most application workloads.

I do think its important to evaluate based on RelStorage's workload. So a trace of cache accesses from a production-facing machine would be ideal.

jamadden · 2016-09-23T23:06:35Z

I agree that the OLTP traces are an unlikely workload, and it makes sense that they're a difficult case.

I agree that it would be good to run trace files from an actual RelStorage instance, but I don't have any 😿 Cache tracing is a very new feature and I don't know of anyone that's using it in production yet. I could probably get it turned on in a test environment, but that's not a realistic environment (at all).

In the meantime, I added the trace files from the cache2k project. The ORM ones for an e-commerce site should be substantially more similar to a RelStorage workload.

I added proof-of-concept code to save the frequency information from evicted nodes, and restore it if we get the key again (which we do on the small workloads). I've only run a handful of simulations, but so far that isn't making any appreciable difference in any of the workloads at any cache size.

ben-manes · 2016-09-23T23:30:30Z

I don't know enough about the source of cache2k's traces and have my concerns. Regardless, they are also recency skewed and from the names it appears to be an ORM's session cache. That means the entries are very short lived and flushed when the transaction completes. This is why LRU is pretty optimal in that workload, too.

I don't mean to sound like I'm making excuses to invalidate some traces. All data is good and problems help identify where to make improvements. Just that those two are very special case scenarios and being optimal for LRU makes sense. So a RelStorage trace would be interesting because I'd expect frequency to play a large role normally, but recency to spike when the rollover occurs. That's kind of a cool, new workload to play with when trying to design an adaptive algorithm.

I'm hoping to get a trace from Facebook soon, though I'm not clear what the workload is. They've got a cache with 14M entries taking 16gb of JVM heap and a 93% hit rate. Dropping that down to 10M still kept a high hit rate (didn't say what). I thought a trace would hint at what a good cache size is so they don't oversize it. Supposedly the cache is reducing the db by 50% cpu which they're happy about.

ben-manes · 2016-09-24T04:05:22Z

I'm sorry, but I am not sure what to do with these files. Can you point me to the format specification so that I can write a parser and feed them into the simulator?

Document some more info about the types of the workloads and how they respond to generation sizes.

coveralls · 2016-09-24T14:24:22Z

Coverage increased (+0.06%) to 95.699% when pulling 47cb68b on w-tinylru-cffi into 1b39101 on master.

jamadden · 2016-09-26T14:00:12Z

I'm sorry, but I am not sure what to do with these files. Can you point me to the format specification so that I can write a parser and feed them into the simulator?

Sorry, I missed that comment or I would have gotten back to you sooner. There's no spec, it's just and ad-hoc dump similar in spirit to the storage traces. It's a comma separated list of operations in time order. The first field is the opcode, r or w, to indicate a read from or write to the cache. The second field is the string of the cache key (which itself is structured around colons, but you can probably ignore that). The third field is the size of the entry; it's always -1 for reads because I don't have that information (this matters for the RelStorage use case because we're based on the byte size of the entries, which varies, not a fixed entry count).

jamadden · 2016-09-26T14:10:30Z

I've also run a comparison of this code versus the old code in the zodbshootout synthetic benchmark. While this is not a realistic workload, it does demonstrate the common scenario of a BTree based index or container, where the data in the nodes of the BTree changes frequently, but the BTree itself changes less often.

With a 10MB cache size, but right about 12MB of data stored in BTrees, the old code gets a meager hit ratio of 0.00035 (13 hits in 37,209 misses) and results of:

add 2.6156s, update 3.4087s, warm 2.0236s, cold 1.4378s, hot 1.4761s

The new code gets a much nicer hit ratio of 0.248 (66665 hits and 202008 misses) and results of:

add 2.1291s, update 2.5819s, warm 0.3759s, cold 1.4691s, hot 0.9331s

Overall better results, especially on the warm case (the old code would have pushed out most of the BTree nodes in favor of the data produced by the update step, but the new code recognizes their frequency and keeps them, so many fewer backend database hits).

Unfortunately, I discovered a crashing bug as a result of the load test which I have to debug first.

…rsistent files.

…s down on excess memory usage.

ben-manes · 2016-09-26T19:14:40Z

I did a quick run of two traces. The adaptive approach is experimental and corrects small recency skewed traces by using a Bloom filter as feedback of whether it made a misprediction. If so it increases the window size, otherwise it shrinks it.

max-size = 512

Policy	Financial1.spc.1024.ctrace	Financial2.spc.1024.ctrace
ARC	20.29	27.75
LIRS	17.55	25.57
LRU	18.15	22.70
SLRU	14.66	25.20
S4LRU	15.36	26.29
Optimal	32.90	41.61
Unbounded	24.48	76.16
W-TinyLFU 1%	14.07	26.58
W-TinyLFU 10%	17.18	29.50
Adaptive W-TinyLFU	18.43	29.59

Below is the Java code I wrote for your review. Let me know if I did anything wrong.

public final class RelStorageTraceReader extends TextTraceReader {

  public RelStorageTraceReader(List<String> filePaths) {
    super(filePaths);
  }

  @Override
  public LongStream events() throws IOException {
    return lines()
        .filter(line -> line.charAt(0) == 'r')
        .map(line -> line.split(":", 3))
        .filter(part -> part[1].equals("state"))
        .map(parts -> parts[2])
        .map(key -> key.split(",", 2)[0])
        .mapToLong(key -> Hashing.murmur3_128().hashString(key, UTF_8).asLong());
  }
}

…big one; don't put big ones in that won't fit.

coveralls · 2016-09-26T20:12:09Z

Coverage increased (+0.1%) to 95.749% when pulling 46ab276 on w-tinylru-cffi into 1b39101 on master.

jamadden · 2016-09-26T20:36:58Z

Thanks for running that. Very interesting numbers. The Bloom filter as a recency corrector is an interesting idea, and it's neat to see that it has an effect in one case and not the other. (I had also manually simulated different window sizes and noticed a dramatic difference.) But not enough of an effect that I expect to try to implement it for this PR.

Regarding the parsing code, if I'm reading it correctly, it's first splitting on ':', filtering that, only passing on the third element (index 2), and only then splitting that on ','. But that would wind up with the wrong key.

>>> 'r,:state:2:55596,-1'.split(':', 3)[2].split(',')[0]
'2'

It's CSV first, and the whole second field is the key. (And the 'checkpoint' keys are important, we do read and write to them fairly frequently.)

If I were to write a Python iterator to turn the string keys into hashes, it would look like this:

for line in lines:
  if not line.startswith('r'):
    continue
  opcode, key, size = line.split(',')
  yield hash(key)

Here's a long section about writes that you can skip because it turns out to be theoretical:

Something I notice is that you're discarding all the writes. I imagine you're filtering away writes because the cache "auto populates", and with a fixed number of entries, the write size is irrelevant. The RelStorage cache always treats writing to a key as a "hit," and moves it to the most-recently-used position (possibly promoting it from probation to protected) and increments its frequency. It was this way when I got here, and I thought it made sense for RelStorage: the cache is shared, and underneath a higher-level cache (that's not shared). Writing to a key invalidates the higher-level cache for all the clients, and so they'll have to read it again from the RelStorage cache when they need it, so we shouldn't take a chance on it being evicted.

But maybe, while that made sense for a pure LRU cache, having the frequency information now will let us make a smarter decision: even if we don't make it MRU or promote it, and just update the frequency, we're at less risk of bumping it out of the cache; if it's really needed again, it'll be requested and that will move it to the MRU and/or promote it.

All that said, that's pretty much theoretical and only of interest for the simulations. When an object is written to, part of its key (the transaction id portion) changes (that's how the database implements multi-version concurrency control). With the exception of that one checkpoints key, we never actually write to an existing key.

ben-manes · 2016-09-26T21:35:35Z

The actual reason I disregard writes is a little less intelligent than you're theorizing. I agree writes are viewed as a hit and should be moved to the MRU position. This is done in my cache too.

The original traces I had were key hashes, e.g. the LIRS traces, so there was no distinction of a read or write. The simulator's Policy interface has a record(key) method to simulate an auto-populating get. That updates the hit/miss counters for the report. The trace is then a stream of long keys.

A write like the UMass or ARC traces would need a new policy API as its not a hit or miss, but an insertion. Then the stream needs to provide an operation code, which means either I steal the sign bit or use a wrapper object (creating a little GC). If I instead treated it as an access then it adds frequency if immediately reused but hurts the miss rate if ignored. So blindly passing it into record is probably wrong.

For simplicity of not changing the APIs or thinking through it too much, I filtered it out in all the traces. I am not sure what makes the most sense and get decent insight from the current behavior. I'd be fine changing that if we talk through it a little more.

I'll take another stab this evening.

jamadden · 2016-09-26T21:46:01Z

Thanks for all your help and discussion! I'm more than happy to continue discussing and I welcome any further insights.

However, this PR is a blocker for some other work I need to do, and I'm happy enough with its state for the moment (I'm basically convinced there won't be any major performance regressions), so I'm going to go ahead and merge it.

jamadden added 30 commits September 15, 2016 17:10

Cleanup the local cache benchmark. Add a rough proxy for a mixed work…

1e15c7e

…load, and write down some numbers from my machine before I get started. I suspect compression might dominate...

Track popularity of entries. Minimal change to cache times.

5c3ef2a

Age all entries periodically.

6ab9d5c

Move to a full segmented LRU model. Reading got *much* faster somehow…

8ca039e

… (maybe a bug? All tests pass though), but everything else slowed down, including the mixed workload.

Fix boolean testing of the LRUs on Python 3. Modify the algorithm to …

601f5e1

…pre-populate the protected segment directly from the eden segment; this is a modification primarily for the sake of the existing microbenchmarks and may go away.

Dedicated access to the end of the list solves most of the speed prob…

fb7f0e9

…lem, although it introduces a test regression that I'm not going to debug right now.

Fix tests with optimized lru method; add additional tests incorporati…

3b9fd1f

…ng the fact that we're filling protected first

Minor optimizations.

a01c939

Tweak benchmark to use consistent iteration order on PyPy/CPython ver…

314b881

…sions that order dicts. PyPy is stupid fast.

Make relstorage.cache into a package.

c8a7986

Include ring from persistent with the necessary function addition.

44b85a9

Move test_cache_stats to the cache package.

5333599

Move our optimizations into ring.py

47179fa

Move entry class to ring.py

8d36d94

Move SizedLRU to its own module.

413b216

Kill the ring's dict in favor of handles (we already have a dict). Mo…

ad93847

…ve some fields into the C structure.

Move most size and length tracking into C.

101aa16

Keep over_size updated as an ivar instead of computing as a property …

e135a51

…because it showed up in a profile.

Move all the logic for a probation hit into C.

8f00ad1

Move remaining LRU classes into the LRU module.

e162f9d

Setting the values as attributes seems to be slightly faster than usi…

13d12b1

…ng the dict initializer.

Move the aging process to C. No substantial change in numbers but it …

f1317d9

…gets it out of the profile.

leave the parent for reuse and don't assert on the r_next for the sam…

4e7c5fa

…e reason.

Tweak aging decision to speed up population.

6045387

Current times are: epop average 3.5930002473372347 stddev 0.046088702157717376 mix average 4.21050848266168 stddev 0.04612077189551625 pop average 3.6549476809983994 stddev 0.4561708743548922 read average 0.9901350849977462 stddev 0.0089

Adapt the storage trace simulator to use StorageCache.

c80c885

This lets us see the effect of key rolling and checkpoints. Include benchmark numbers (the LRU times need to be updated still, they're from an earlier revision that didn't use memory efficiently.)

jamadden added 2 commits September 23, 2016 12:19

Current timings [skip ci]

5217d92

Use more defaults

2b3547c

jamadden added 2 commits September 23, 2016 13:15

Record hits with a delta size of 30000---much better.

4585c85

Code to write out a simple cache trace file during the storage simula…

4fe9524

…tion. [skip ci]

Parser for the ORM (cache2k) traces and record their numbers.

ae2c58a

Make it possible to configure generation sizes.

47cb68b

Document some more info about the types of the workloads and how they respond to generation sizes.

jamadden mentioned this pull request Sep 24, 2016

Distribute the CFFI ring implementation in the wheels? zopefoundation/persistent#43

Closed

jamadden added 2 commits September 26, 2016 11:25

Fix off-by-one crash that happened when loading data from multiple pe…

a0c2014

…rsistent files.

Reuse preallocated nodes when populating the list initially. This cut…

8871999

…s down on excess memory usage.

Traverse the full array to find items that might be valuable after a …

46ab276

…big one; don't put big ones in that won't fit.

jamadden merged commit e279c73 into master Sep 26, 2016

jamadden deleted the w-tinylru-cffi branch September 26, 2016 21:47

jamadden mentioned this pull request Sep 30, 2016

Change PickleCache from simple LRU to a more modern policy (like windowed LFU/SLRU) zopefoundation/persistent#45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure the cache to a windowed-LFU admission policy on top of a segmented LRU #128

Restructure the cache to a windowed-LFU admission policy on top of a segmented LRU #128

jamadden commented Sep 21, 2016

jamadden commented Sep 23, 2016

ben-manes commented Sep 23, 2016

coveralls commented Sep 23, 2016 •

edited

Loading

jamadden commented Sep 23, 2016

ben-manes commented Sep 23, 2016

jamadden commented Sep 23, 2016

ben-manes commented Sep 23, 2016

ben-manes commented Sep 24, 2016

coveralls commented Sep 24, 2016

jamadden commented Sep 26, 2016

jamadden commented Sep 26, 2016 •

edited

Loading

ben-manes commented Sep 26, 2016

coveralls commented Sep 26, 2016 •

edited

Loading

jamadden commented Sep 26, 2016

ben-manes commented Sep 26, 2016

jamadden commented Sep 26, 2016

Restructure the cache to a windowed-LFU admission policy on top of a segmented LRU #128

Restructure the cache to a windowed-LFU admission policy on top of a segmented LRU #128

Conversation

jamadden commented Sep 21, 2016

jamadden commented Sep 23, 2016

ben-manes commented Sep 23, 2016

coveralls commented Sep 23, 2016 • edited Loading

jamadden commented Sep 23, 2016

ben-manes commented Sep 23, 2016

jamadden commented Sep 23, 2016

ben-manes commented Sep 23, 2016

ben-manes commented Sep 24, 2016

coveralls commented Sep 24, 2016

jamadden commented Sep 26, 2016

jamadden commented Sep 26, 2016 • edited Loading

ben-manes commented Sep 26, 2016

max-size = 512

coveralls commented Sep 26, 2016 • edited Loading

jamadden commented Sep 26, 2016

ben-manes commented Sep 26, 2016

jamadden commented Sep 26, 2016

coveralls commented Sep 23, 2016 •

edited

Loading

jamadden commented Sep 26, 2016 •

edited

Loading

coveralls commented Sep 26, 2016 •

edited

Loading