I apologize if I'm grumpy at all, as this sort of thing is a reoccuring theme for us over at memcached central.
Have you folks seen at all the performance and slab balancing changes that have gone in six months ago? From what I see on your end most of the other changes are around stats and logging, which are pretty trivial.
In my own testing (with mc-crusher) I wasn't able to get the stats locks (as of 1.4.13) to cause any contention (most of the stats were split into per-thread locks ages ago). This is probably untrue with a NUMA machine, and I would be curious to know how you caused those issues.
Most relevant, probably:
http://code.google.com/p/memcached/wiki/ReleaseNotes1410 (massive performance boost)
http://code.google.com/p/memcached/wiki/ReleaseNotes1411 (slab rebalancing)
http://groups.google.com/group/memcached/browse_thread/thread/972a4cf1f2c1b017/d55dcfb56fba9592 (more notes on the performance changes. this version eventually became 1.4.10 instead of 1.4.9 though).
It's worth noting that I'd also tried something similar to your background thread aggregation, and was able to measure a loss of performance rather than a boost, and disabling the per-thread locks didn't do anything at all. It should still be easy to remove the "pointless" per-thread locks and kill some of the few remaining global stats, but as I said above I wasn't able to measure much contention.
twemcache essentially was built to make memcached suitable, robust and manageable for the large scale production environment at twitter.
slab eviction changes have been running in our production since nov '10 and has gone through a lot of real world testing for our traffic; also because of the scale at which we operate, doing a client triggered, request-based slab eviction is not ideal for us.
the lock-less stat implementation have been running since may '11 and per-thread stats local-buffer is actually a precursor to any future changes that we would make cache_lock fine grained
klog ends up being extremely useful to us in doing in doing in-depth analysis on the cached data access pattern across our cache cluster
finally, as we mentioned in the blog post, our initial goal was make memcached work well for us and in the long term, we look forward to sharing our code and ideas with the memcached community at large and push it upstream.
I wasn't asking about justifying the work you did. Every time people release code I jump for joy and click my heels together. The only sadness is in the time scale, since I could've avoided doing some work if we had this stuff sooner.
Here's a bit that's off:
1) You are not the only company with traffic. I regularly tune clusters for top 20 websites. Memcached's own features are designed for scale, and are tested very thoroughly. Every time twitter seems to talk about memcached you have this disbelief that anyone could ever run into the same problems you have. We went through this years ago when you ran 1.2.3 and it was crashing, although I'm told your team was not aware of this.
2) Your blog post is damaging since people will read that as though twitter's branch is the only one with slab reassignment and that it will have better performance (which I'm willing to bet it doesn't, if you haven't modified cache_lock much).
As for technical responses:
If your system truly relies on freeing slabs as soon (or perhaps right before) memory fills up, I think it's trivial to get that to happen via an option. I wasn't able to get slab rebalance to not cause microblocking without doing the background thread I did though.
Is there a performance comparison of the latest Memcached against Twemcache?
@dormando re: "This is probably untrue with a NUMA machine" – you don't have one to test?
Slab reassignment aka random eviction in twemcache is on-demand, that enables cache to adapt immediately to changing access patterns. We don't use a background thread because it does not give us the deterministic behavior that we would like in our use case. For comparison with memcached see: https://github.com/twitter/twemcache/blob/master/notes/random_eviction.md
To reiterate the stats feature, the thread-local stats is a precursor to any changes that would make cache_lock fine-grained.
@tsuna https://gist.github.com/3088511 I did a really basic read test here. I need to modify my tool more to do set load tests since they dropped binary protocol support, and I have to add "set with noreply" tests to mc-crusher. Pre-1.4.10 memc had very poor read performance while mixed with significant sets. 1.4.10+ does very well by comparison.
When I was doing the lock scaling and slab rebalance work I had only brief access to a NUMA machine, which showed a near 50% performance drop when spread across nodes (however it still had a 2-3x speedup compared to 1.4.9). I noted this in the 1.4.10/11 release notes. I didn't have enough access until recently to determine the biggest contributors of that drop (http://memcached.org/feedme you can see me unsuccessfully begging here).
@manjuraj Ok, so that tells me how it's different. Slab rebalance in mainline has two background threads, actually: one that moves slab pages between slab classes, and one which observes and runs a crappy algorithm.
If you read the release notes and related mailing list messages, I went with this methodology:
Add commands to move memory from one class to another.
Add a rudimentary built-in automover
Include a perl script which also implements the automover algorithm, which people can modify for their own needs then submit patches/tunable ideas for the built-in automover.
because we're like, a community or something, d'awg.
In your case you want triggered purges. We still can't do them fully inline because that would block the process and we want deterministic performance. I can round out a few trivial things, then add a flag to allow triggering immediate moves in an almost identical way to yours. I'll use your example test as reference and see if we can match it. Should only take a few hours at most.
There is an important point here: I wrote the slab rebalancer after cutting the cache_lock up severely. I've seen 6+ million key reads/sec, and 930,000 sets per second. I can make it go faster but wanted to limit how quickly we changed the locks around to limit potential impact to stability.
I took a look through your code, and the slab mover just erases and shuffles memory while holding a global lock and stopping the world. We can't do that in the general case as some users prefer deterministic response times, but we can actually free items in bulk via a flag :) It was much harder to free memory out from under the system when you can't rely on holding the cache_lock and want to preserve zero-copy client writes. I'm not sure yet what you've done with the latter.
TL;DR in a few days I'll poke a few things on our end which should match your slab rebalance method while retaining our high performance.
@dormando I was surprised because you said you regularly tune clusters for top 20 websites, and I can't imagine that at least 10 of them don't run on NUMA architectures. You can get a dual-Sandy Bridge with 64GB of RAM for around $3000 these days.
@tsuna Box I developed a feature on != the only box this software runs on. I validated it on NUMA but didn't have access to something I could crash without causing someone to pay for remote hands to reboot it.
That and NUMA tuning was basically a separate problem back then and would've taken a lot longer to code around. I made it 2-3x faster on NUMA and almost 9x faster on non-NUMA, so I released it. :)
EDIT: Also to point out that if you have two-socket NUMA, starting two instances and binding them gave 100% performance, but a single instance was always overwhelming the network after my tuning. Needs 10ge to hit a bottleneck now
Haven't had a lot of hacking time this weekend, but I spent a few hours rearranging code and got this:
It's slightly different from twemcache in that it triggers a slab page move on an eviction, instead of moving memory at the time of an eviction. In practice that's completely negligable. Feel free to take the branch and beat on it a bit...
I have a few patches to merge and want to give a whack at removing all the uncontested stats locks again and see how that changes on a NUMA box, then this'll turn into 1.4.14. I'm not sure how practical the automove=2 option is at all, since most of the clusters I've seen are just out of memory across the board and instead need something more like the less aggressive automover (which itself still needs more runtime options).
However, in order to make it aggressive I've fixed a few legitimate issues I had promised to fix back in february anyway, so thanks for pushing on that :)
alan, the (100K size and 10 call case) still has 67 errors out of 1000 requests. I believe you might want to tune your background 'automove' algorithm to be more aggressive so that it can keep up with the request rate.
I'm not sure what caused those errors yet. I'm sure it can be more aggressive, but given that my test ended up with 64 slabs in the 100k test, I think if you look at how many evictions each test had we're still much better than twemcache...
I don't mean to say that from a "well if you look at THIS side of the bench..." viewpoint. the original test wasn't so great. In designing this feature you're attempting to optimize the hit rate by minimizing the eviction rate. In this set only test you can only measure how many slab pages were moved and how many items got evicted.
In another test, you could load up 100,000 items, then set another 100,000 items into another slab class, then try to fetch the latter 100,000 items and see what the hit rate may be
To contrive a test that my branch will lose:
Set 100,000 items into class 3, so that memcached is fully out of slab pages to assign.
Set 100,000 new items into class 2 with automove=2. Such that there should be more than enough slab pages to hold all 100,000 items. Then fetch all those items back. In that test my branch should not end up with 100,000 items available in class 2, but yours should be a lot closer, depending on how well the random selection works.
server errors signify that 67 out of 1000 'set' requests failed and didn't make it to the cache
(apologies for the spam, you made me curious about those errors :P)
I took a verbose log of the run (doing each mcperf in order then recording the full log of the final run). The errors are "out of memory storing object", and almost all happen at the beginning of the session.
This is because you can't actually fit 10 100k items into a single slab page, you can fit 9 due to header overhead. So the first 9 items are being set, and the 10th connection spins on OOM errors as the other 9 are still uploading the 100k item. Since we store uploads directly into the memory assigned to an item, they appear "locked" and can't be evicted. Since it's not an eviction my little code didn't fire at first.
As a test I added a rebalance trigger to the OOM condition and that dropped the errors from 67 to 25. Still occassionally the item in the bottom LRU is locked due to upload/whatever and it OOM's. I'm actually fine with this, as it doesn't affect production folks very often.
I did remove the item depth search as that turned out to be pointless and slow, however after talking with another user about a month back it seems like it's a good idea to add a short search back in that ticks up only if the bottom items are in use (being fetched or written to), but otherwise not searching for expired items. As is you can cause this OOM condition if you have very few pages in a very large slab class, but even when people hit it, it's extremely rare.
We can't avoid it entirely without doing tradeoffs I don't want... like buffering an item into temporary memory then copying it in or something. In order for twemcache to not error out on that you must've completely disabled zero copy reads/writes...
we didn't disable zero copy in twemcache, and just like memcached we copy directly into the item struct carved out of slab; the reason we don't hit this in twemcache is because we don't use a background thread and our eviction is inline
Even if you don't use a background thread, it should be impossible for you to enter this scenario without clobbering memory somewhere... but I think I see what's happening...
(answering my own question again: looks like twemcache will actually reassign a page from its own slab class?)
Ah, it looks like you're doing this by adding a refcount for the entire slab class! (slab_acquire_refcount). Then you skip evicting a page from anything with a refcount incremented.
So if I fetch hard against all slabs on twemcache, it'll never be able to move pages around? Since at hardly any point in time each slab class with pages in it would be idle. It looks like if you hit this condition you return a NOMEM as well. If you're actually hitting twemcache fairly hard, shouldn't this be spewing NOMEM's instead of rebalancing pages? I guess if you suddenly set hard against another class you'll still hit enough times where other classes are fully idle...
If you look at mc-crusher's conf/ directory, you'll see "slab_rebal_torture[1-4]", which load up gets and sets on slab classes 1-14. On my desktop it'll be doing 1.4 million key fetches/sec while doing 400,000 sets per second. Then I run the "slab-tosser" script to throw memory between each class at random. Then to validate I run this for 12 hours... It found a few crashes and problems during development.
I have a feeling twemcache would have a hard time with that test (which is similar to a highly loaded production box), wherein memcached is able to move slab pages arbitrarily regardless of read load. This is why we do it in the background. As I've been splitting the cache_lock and allowing higher and higher fetch rates, this becomes a more important feature. You'll run into similar issues when you look at splitting cache_lock as well.
I can cause some server_error's on twemcache by using an even bigger item size. Perhaps we can make that better on our end, but I'm still not really convinced it matters for those large slabs, and our error rate should go down more by adding a search crawl on encountering a locked item.
refcount are on slab not slabclass; refcount on slab implies that the slab is being used by some other thread; this is necessary to support zero copy into item structs
Ah, gotcha. Sorry for getting hung up on terminology in the code. You have a refcount per slab page, then randomly look for available pages within the entire pool. That'll scale a bit better for sure, though I still cringe at holding the cache_lock for so long.
If you have a higher percentage of pages in the slab class receiving a set, doesn't that end up mostly just evicting items randomly out of its own class? In that case I want to pull just one item from the bottom of the LRU in that class.
Or is there something deeper going on about preserving a page distribution across classes despite pressure increasing in one area?
In random eviction, all slabs across all slab classes are treated as equal class citizens and the the algorithm for choosing the candidate slab is O(1). Random eviction works well when data access pattern is changing and not predetermined. For a LRU like access pattern, we have a 'LRU slab eviction' available in 2.5.0