Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure the cache to a windowed-LFU admission policy on top of a segmented LRU #128

Merged
merged 87 commits into from
Sep 26, 2016

Conversation

jamadden
Copy link
Member

See #127 for information on the motivation.

Benchmarks and simulations are in https://github.com/zodb/relstorage/blob/w-tinylru-cffi/relstorage/cache/micro_benchmark_results.rst Long story short, this code is always faster, and almost always produces better hit rates, sometimes much better (and when the rates aren't better, they're comparable).

CFFI is now required, not optional (though given how poorly the non-CFFI cache performed, it was effectively required before). I'll plan to distribute OS X and manxlinux wheels. We haven't had any requests from Windows users about a wheel before, but if we get them we can use Appveyor to generate Windows wheels (separate PR, this one's big enough!)

This PR is not ready to merge yet, I've got at least another day of cleanup to do on it. But I wanted to open it up now for any comments or concerns on the general direction.

There's a lot to look through here, but I tried to keep each commit as close to single-purpose as I could.

…load, and write down some numbers from my machine before I get started. I suspect compression might dominate...
Still need to probably segment the main portion into protected/probation
to get the right behaviour.

The code is unoptimized but all tests pass (with minor modifications to
account for the frequency-based preservation, which is actually kind of
cool---although we will ultimately need more specific tests).

The mixed micro-benchmark is actually faster. Populating is notably
slower. Reading is basically unchanged (slightly slower).

Start using the frequency information to avoid writing out objects that
haven't been accessed in a long time. We should probably run an aging pass
over these objects before we write to make it more accurate.
… (maybe a bug? All tests pass though), but everything else slowed down, including the mixed workload.
Age the entries before we write them and only include those that survive
the aging process. Also stop writing the cache file when we've written
enough to fill up the protected generation, since that's what we store
into (80% cache size by default).
…pre-populate the protected segment directly from the eden segment; this is a modification primarily for the sake of the existing microbenchmarks and may go away.
…lem, although it introduces a test regression that I'm not going to debug right now.
…ng the fact that we're filling protected first
…sions that order dicts. PyPy is stupid fast.
Tests pass for me, but won't anywhere else because I modified the
persistent ring.[ch] locally. If that's going to be a necessary step
then there are some other questions to answer (i.e., do we need our own
custom CFFI extension?)
Current timings:

epop average 4.351013101666467 stddev 0.04419633374727681
mix  average 5.273094533001616 stddev 0.056560372799909574
pop  average 3.9593097216663105 stddev 0.08276509526733669
read average 1.076361087335196 stddev 0.014496181325770149

epop is slightly worse than before any changes (3.8). mix and pop are
both better (6.0 and 4.8), respectively. Read is still slightly
worse (0.77, but just tracking frequency took us to 0.86).
And do some minor attribute caching.

All times improved:

epop average 4.179437022665904 stddev 0.02725543525889035
mix  average 4.767989358665848 stddev 0.029150437240123107
pop  average 3.847747696330771 stddev 0.05402757351317516
read average 0.924939800335172 stddev 0.008914988112135936
This avoids CFFI overhead of creating handles and Ring pointers.

Current times all improved are stayed the same; we are now beating the
original code on *all* timings (except read)

epop average 3.7955654890004857 stddev 0.045522296233750786
mix  average 4.36776723600633 stddev 0.040494291535130046
pop  average 3.503353017338668 stddev 0.08680920255355108
read average 0.939945668001504 stddev 0.013126699924166357
Current times are:
epop average 3.5930002473372347 stddev 0.046088702157717376
mix  average 4.21050848266168 stddev 0.04612077189551625
pop  average 3.6549476809983994 stddev 0.4561708743548922
read average 0.9901350849977462 stddev 0.0089
This lets us see the effect of key rolling and checkpoints.

Include benchmark numbers (the LRU times need to be updated still,
they're from an earlier revision that didn't use memory efficiently.)
@jamadden
Copy link
Member Author

In c80c885 I adapted the storage simulations to use the true cache implementation that RelStorage uses (it's a layer on top of this list) which is where the rolling takes place. By default, the rolling happens every 10000 write operations. It's interesting to see the effect this has on cache hits. In some of the simulations they go up, in some of the simulations they go down.

In the non-rolling simulations, this implementation was losing (slightly) on 3 of the write-heavy workloads. In the rolling simulations, it loses in all 6 of the write-heavy workloads---even in a case where the hit ratio went up! In one of the cases it goes far down.

I can probably produce a trace file containing the reads and writes we do to the cache during these simulations.

@ben-manes
Copy link

Note that the Financial (OLTP) traces are the worst case for a frequency strategy. It appears to be the disk trace from the database commit log, where the changes are stored and then replayed to catch-up. The database's in-memory cache would have been more interesting to trace, and is probably large enough to service most accesses. That means we only see a few extra operations on a miss and LRU is pretty close to optimal.

So I think the OLTP traces are good for finding weaknesses in the algorithm, but might not be representative of most workloads.

@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.06%) to 95.697% when pulling 5217d92 on w-tinylru-cffi into 1b39101 on master.

@jamadden
Copy link
Member Author

Here's the traces from the OLTP runs. The keys reset 201 times in Financial1 and 26 times in Financial2.

https://s3.amazonaws.com/dev.nextthought.com/traces-Financial1.tar.xz
https://s3.amazonaws.com/dev.nextthought.com/traces-Financial2.tar.xz

@ben-manes
Copy link

Thanks! I'll give it a run tonight. This is one of the harder traces for TinyLFU, though, and I don't think it matches most application workloads.

I do think its important to evaluate based on RelStorage's workload. So a trace of cache accesses from a production-facing machine would be ideal.

@jamadden
Copy link
Member Author

I agree that the OLTP traces are an unlikely workload, and it makes sense that they're a difficult case.

I agree that it would be good to run trace files from an actual RelStorage instance, but I don't have any 😿 Cache tracing is a very new feature and I don't know of anyone that's using it in production yet. I could probably get it turned on in a test environment, but that's not a realistic environment (at all).

In the meantime, I added the trace files from the cache2k project. The ORM ones for an e-commerce site should be substantially more similar to a RelStorage workload.

I added proof-of-concept code to save the frequency information from evicted nodes, and restore it if we get the key again (which we do on the small workloads). I've only run a handful of simulations, but so far that isn't making any appreciable difference in any of the workloads at any cache size.

@ben-manes
Copy link

I don't know enough about the source of cache2k's traces and have my concerns. Regardless, they are also recency skewed and from the names it appears to be an ORM's session cache. That means the entries are very short lived and flushed when the transaction completes. This is why LRU is pretty optimal in that workload, too.

I don't mean to sound like I'm making excuses to invalidate some traces. All data is good and problems help identify where to make improvements. Just that those two are very special case scenarios and being optimal for LRU makes sense. So a RelStorage trace would be interesting because I'd expect frequency to play a large role normally, but recency to spike when the rollover occurs. That's kind of a cool, new workload to play with when trying to design an adaptive algorithm.

I'm hoping to get a trace from Facebook soon, though I'm not clear what the workload is. They've got a cache with 14M entries taking 16gb of JVM heap and a 93% hit rate. Dropping that down to 10M still kept a high hit rate (didn't say what). I thought a trace would hint at what a good cache size is so they don't oversize it. Supposedly the cache is reducing the db by 50% cpu which they're happy about.

@ben-manes
Copy link

I'm sorry, but I am not sure what to do with these files. Can you point me to the format specification so that I can write a parser and feed them into the simulator?

Document some more info about the types of the workloads and how they
respond to generation sizes.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 95.699% when pulling 47cb68b on w-tinylru-cffi into 1b39101 on master.

@jamadden
Copy link
Member Author

I'm sorry, but I am not sure what to do with these files. Can you point me to the format specification so that I can write a parser and feed them into the simulator?

Sorry, I missed that comment or I would have gotten back to you sooner. There's no spec, it's just and ad-hoc dump similar in spirit to the storage traces. It's a comma separated list of operations in time order. The first field is the opcode, r or w, to indicate a read from or write to the cache. The second field is the string of the cache key (which itself is structured around colons, but you can probably ignore that). The third field is the size of the entry; it's always -1 for reads because I don't have that information (this matters for the RelStorage use case because we're based on the byte size of the entries, which varies, not a fixed entry count).

@jamadden
Copy link
Member Author

jamadden commented Sep 26, 2016

I've also run a comparison of this code versus the old code in the zodbshootout synthetic benchmark. While this is not a realistic workload, it does demonstrate the common scenario of a BTree based index or container, where the data in the nodes of the BTree changes frequently, but the BTree itself changes less often.

With a 10MB cache size, but right about 12MB of data stored in BTrees, the old code gets a meager hit ratio of 0.00035 (13 hits in 37,209 misses) and results of:

add 2.6156s, update 3.4087s, warm 2.0236s, cold 1.4378s, hot 1.4761s

The new code gets a much nicer hit ratio of 0.248 (66665 hits and 202008 misses) and results of:

add 2.1291s, update 2.5819s, warm 0.3759s, cold 1.4691s, hot 0.9331s

Overall better results, especially on the warm case (the old code would have pushed out most of the BTree nodes in favor of the data produced by the update step, but the new code recognizes their frequency and keeps them, so many fewer backend database hits).

Unfortunately, I discovered a crashing bug as a result of the load test which I have to debug first.

@ben-manes
Copy link

I did a quick run of two traces. The adaptive approach is experimental and corrects small recency skewed traces by using a Bloom filter as feedback of whether it made a misprediction. If so it increases the window size, otherwise it shrinks it.

max-size = 512

Policy Financial1.spc.1024.ctrace Financial2.spc.1024.ctrace
ARC 20.29 27.75
LIRS 17.55 25.57
LRU 18.15 22.70
SLRU 14.66 25.20
S4LRU 15.36 26.29
Optimal 32.90 41.61
Unbounded 24.48 76.16
W-TinyLFU 1% 14.07 26.58
W-TinyLFU 10% 17.18 29.50
Adaptive W-TinyLFU 18.43 29.59

Below is the Java code I wrote for your review. Let me know if I did anything wrong.

public final class RelStorageTraceReader extends TextTraceReader {

  public RelStorageTraceReader(List<String> filePaths) {
    super(filePaths);
  }

  @Override
  public LongStream events() throws IOException {
    return lines()
        .filter(line -> line.charAt(0) == 'r')
        .map(line -> line.split(":", 3))
        .filter(part -> part[1].equals("state"))
        .map(parts -> parts[2])
        .map(key -> key.split(",", 2)[0])
        .mapToLong(key -> Hashing.murmur3_128().hashString(key, UTF_8).asLong());
  }
}

…big one; don't put big ones in that won't fit.
@coveralls
Copy link

coveralls commented Sep 26, 2016

Coverage Status

Coverage increased (+0.1%) to 95.749% when pulling 46ab276 on w-tinylru-cffi into 1b39101 on master.

@jamadden
Copy link
Member Author

Thanks for running that. Very interesting numbers. The Bloom filter as a recency corrector is an interesting idea, and it's neat to see that it has an effect in one case and not the other. (I had also manually simulated different window sizes and noticed a dramatic difference.) But not enough of an effect that I expect to try to implement it for this PR.

Regarding the parsing code, if I'm reading it correctly, it's first splitting on ':', filtering that, only passing on the third element (index 2), and only then splitting that on ','. But that would wind up with the wrong key.

>>> 'r,:state:2:55596,-1'.split(':', 3)[2].split(',')[0]
'2'

It's CSV first, and the whole second field is the key. (And the 'checkpoint' keys are important, we do read and write to them fairly frequently.)

If I were to write a Python iterator to turn the string keys into hashes, it would look like this:

for line in lines:
  if not line.startswith('r'):
    continue
  opcode, key, size = line.split(',')
  yield hash(key)

Here's a long section about writes that you can skip because it turns out to be theoretical:

Something I notice is that you're discarding all the writes. I imagine you're filtering away writes because the cache "auto populates", and with a fixed number of entries, the write size is irrelevant. The RelStorage cache always treats writing to a key as a "hit," and moves it to the most-recently-used position (possibly promoting it from probation to protected) and increments its frequency. It was this way when I got here, and I thought it made sense for RelStorage: the cache is shared, and underneath a higher-level cache (that's not shared). Writing to a key invalidates the higher-level cache for all the clients, and so they'll have to read it again from the RelStorage cache when they need it, so we shouldn't take a chance on it being evicted.

But maybe, while that made sense for a pure LRU cache, having the frequency information now will let us make a smarter decision: even if we don't make it MRU or promote it, and just update the frequency, we're at less risk of bumping it out of the cache; if it's really needed again, it'll be requested and that will move it to the MRU and/or promote it.

All that said, that's pretty much theoretical and only of interest for the simulations. When an object is written to, part of its key (the transaction id portion) changes (that's how the database implements multi-version concurrency control). With the exception of that one checkpoints key, we never actually write to an existing key.

@ben-manes
Copy link

The actual reason I disregard writes is a little less intelligent than you're theorizing. I agree writes are viewed as a hit and should be moved to the MRU position. This is done in my cache too.

The original traces I had were key hashes, e.g. the LIRS traces, so there was no distinction of a read or write. The simulator's Policy interface has a record(key) method to simulate an auto-populating get. That updates the hit/miss counters for the report. The trace is then a stream of long keys.

A write like the UMass or ARC traces would need a new policy API as its not a hit or miss, but an insertion. Then the stream needs to provide an operation code, which means either I steal the sign bit or use a wrapper object (creating a little GC). If I instead treated it as an access then it adds frequency if immediately reused but hurts the miss rate if ignored. So blindly passing it into record is probably wrong.

For simplicity of not changing the APIs or thinking through it too much, I filtered it out in all the traces. I am not sure what makes the most sense and get decent insight from the current behavior. I'd be fine changing that if we talk through it a little more.

I'll take another stab this evening.

@jamadden
Copy link
Member Author

Thanks for all your help and discussion! I'm more than happy to continue discussing and I welcome any further insights.

However, this PR is a blocker for some other work I need to do, and I'm happy enough with its state for the moment (I'm basically convinced there won't be any major performance regressions), so I'm going to go ahead and merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants