-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the cost of hash collisions for set objects #62971
Comments
I'm working on a patch for the lookkey() functions in Object/setobject.c. The core idea is to follow the probe sequence as usual but to also check an adjacent entry for each probe (i.e. check &table[i & mask] as usual and then check &table[(i & mask) ^ 1] before going on the next probes which are scattered around memory). On modern processors (anything in the last decade), this is likely to reduce the cost of hash collisions and reduce the number of cache lines fetched from memory. + Cache line benefit: improves the odds that the adjacent probe will be on the same cache line, thereby reducing the total number of cache lines fetched. This benefit will work for set insertions as well as set lookups (a partial write to a cache line requires that a full cache line be read prior to writing, so it is helpful if we've just examined another slot on the same cache line). + Parallel execution: because the second lookup has no data dependency on the first lookup, some of its instructions can be executed in parallel by the out-of-order engine. + Reduced loop overhead: the second lookup doesn't require a new computation of the index *i* (we just do a XOR 1), a new perturb shift, a new masking of *i*, or a jump to the top of the loop. In other words, it has all the usual benefits of loop unrolling.
? I'm unsure whether the additional if-statements will have a net positive or net negative effect on branch prediction rates (positive because each branch can be tracked separately or negative because additional branches use up more space in the branch prediction tables). Once the patch is ready, I'll run CacheGrind to get a better sense of what is happening. |
How it will affect a performance of set(range(n))? |
Serhiy, I've posted a patch so you can examine the effects in detail. For something like set(range(n)), I would expect no change because the dataset is collision free due to Tim's design where hash(someint)==someint. That said, I'm trying to optimize the general case and don't really if rare cases are modestly slowed. The new code only kicks in when there is a collision. The logic for the initial probe is unchanged. The idea is to pickup some of the cache locality benefits of collision resolution by separate chaining, but not incurring the 100% malloc typically associated with chaining. When adjacent entries are in the same cache-line, the cost of the extra adjacent probe is much cheaper (nearly zero) than a probe somewhere else in memory. |
Do you expect visible difference on a microbenchmark? Do you have |
I just made the patch a few minutes ago. Am just starting to work on benchmarks. I posted here so that you guys could help find the strengths and weaknesses of the approach. The theory is sound, but it is going to take a good deal of effort to isolate the effects (either good or bad) in realistic benchmarks. The short, tight loops in timeit tend to put everything in cache and hide the effects of good memory access patterns. If would love to have you all team up with me on this one. I could use help on timings, examining disassembly, and runs with cache grind. This is just in the experimental phase, so it is premature to be throwing up obstacles with "show me your data" or "what does it do in situatation x". At this point, you all could help out by running some timings designed to isolate the effects on high collision data big enough to not fit in L2 cache. I hope you assist in evaluating the idea with an open mind. |
Oh, it was just asking for a microbenchmark on set(). I don't care of I asked because I don't understand all these complex things of the CPU All your arguments sound good and I'm always happy to see people |
Here's an excerpt from the patch that gives a pretty good idea of what is being changed (the three lines of new logic are marked): static void
set_insert_clean(PySetObject *so, PyObject *key, Py_hash_t hash)
{
setentry *table = so->table;
setentry *entry;
size_t perturb = hash;
size_t mask = (size_t)so->mask;
size_t i, j; // the j variable is new
i = j = (size_t)hash & mask;
while(1) {
entry = &table[j];
if (entry->key == NULL)
break;
entry = &table[j ^ 1]; // this line is new
if (entry->key == NULL) // this line is new
break; // this line new
i = i * 5 + perturb + 1;
j = i & mask;
perturb >>= PERTURB_SHIFT;
}
so->fill++;
entry->key = key;
entry->hash = hash;
so->used++;
} |
"+ Cache line benefit: improves the odds that the adjacent probe will With j=11, is the lookup at index 10 in the same cache line? Does a I read that CPU likes to prefetch data forward, but I didn't know that |
It might or it might not. The cache lines are typically 64 bytes which holds 4 key/hash pairs. So, the odds are 75% in favor of an adjacent entry being in the same cache line. If malloc where guaranteed to return 64-byte aligned memory blocks, the odds would be 100%, but my understanding is that 16-byte alignment is the norm.
The integrated memory controller can also prefetch backwards, but that won't help us. The prefetching doesn't start until you've done a series of monotonically increasing or decreasing accesses. There is no prefetch on the first probe regardless of whether you're going up or down. The whole problem with large sets is that the data is scattered all over memory and is unlikely to be in cache or to benefit from prefetching in the same way that lists and deques do (their memory access patterns are consecutive). The core concept for the patch is that a cache line fetch may be expensive so you might as well check more than one slot once a line is fetched. Working against us is that we don't have control over where malloc starts a memory block (possibly starting in the middle of a cache line). |
Here's a simple benchmark to start with. On my machine (2.4Ghz Core2 Duo with 2MB L3 cache) and compiled with GCC-4.8, the benchmark shows a 6% speedup for sets that spill over L2 cache and 11% for sets that spill over the L3 cache. The theoretically expected result is 9.75%: 26% collision rate (for tables that are 64% full) Most programs will have less benefit (they may smaller tables that fit in L1 cache or have tables that are less densely filled resulting in fewer collisions). My quick timings show either no change or inconsequential improvement in the performance of those programs. |
Attaching the detailed results of a CacheGrind analysis. Here are the high-points (all of which were expected):
The last level cache misses are the most important. Per the cachegrind docs, "an L1 miss will typically cost around 10 cycles, [and] an LL miss can cost as much as 200 cycles." |
Also, cachegrind shows a 3% improvement in the branch misprediction rate (from 9.9% to 9.6%). ==82814== Mispred rate: 9.9% ( 9.4% + 15.5% |
Instead of only a second lookup, could you try for example 4 lookup and |
Accessing 4 entries per probe is a tempting line of development, but will be subject to diminishing returns (second and third collisions aren't frequent). I like the idea of aligning j to fit in a cache line, but the computation would need to be cheap and portable (standard C with no pointer tricks that rely on non-guaranteed behavior). Have you had a chance to run the benchmarks on your machine? I'm curious how this works out on other processors and with other compilers. |
Attaching an updated patch (same algorithm but with a little more factoring to make the patch smaller). |
The following benchmarks were run with the second version of the patch (so2.diff), once before and once after. OSX 10.8 on a Macbook Pro, 2.5 GHz Core i5, two cores each with 256kb L2 cache, 3MB L3 cache: 1.32% less time for first benchmark (7.07 before, 6.98 after) Ubuntu 12.10 64-bit, AMD Athlon II X2 250 Processor (3.0 GHz, 2MB L2 cache, 256KB total L1 per processor): 5% less time for first benchmark (13.6 before, 12.9 after) Win7 64-bit, Intel Xeon E3-1230 V2 (quad-core, 3.30GHz, 256kb L2 cache per core, 8MB L3 cache): 1.8% less time for first benchmark (3.90 before, 3.83 after) However, this processor has "Intel SmartCache", which as far as I can tell means one processor can use other processors' caches if they are inactive. So I ran the benchmarks again with 4x larger data sets (revised benchmark script attached): 3.8% less time for first benchmark (19.3 before, 18.6 after) |
Because of the unusually low differences, I tried running the benchmarks again on the MacBook (OSX 10.8). The first time was actually with so.diff, this time was with so2.diff. 0.69% less time for first benchmark (7.30 before, 7.25 after) Strange... |
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz on Ubuntu 13.04 X86_64 with GCC 4.7.3 1MB L2 cache test: 11.33sec / 10.83sec (4.4% faster) |
Sorry to be pedantic, but while 19.67sec is about 11.2% less than 22.16sec, it is actually about 12.6% faster (22.16/19.67 - 1). Regardless, that's a good result! And I'm happy that the modified benchmark is useful. |
Oh *curse*, you are right. Why do I always confuse the order? :| It looks like the performance boost increases with CPU cache size as well as the CPU's generation. The i7-4770K box at work is less than a week old. The CPU is currently the fast desktop CPU from Intel. |
Wow, thanks for running these tests and posting the results :-) |
New changeset a887596b841f by Raymond Hettinger in branch 'default': |
How about a Misc/NEWS and a Doc/whatsnew/3.4 entry, too? |
This optimization cannot be applied to the dict type? |
Victor, this should also work for dictionaries as well. Fundamentally, the concepts of loop-unrolling and exploiting cache locality should apply to any hash table implementation. That said, the dicts are more complex than sets due to 1) storing the hash/key/value 2) key-sharing, and 3) different access patterns. So, it will require more effort than just porting over the patch. Feel free to open a new tracker item for that one. It will take more work but will potentially have a much greater pay-off. |
New changeset 9353c611f897 by Raymond Hettinger in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: