-
Notifications
You must be signed in to change notification settings - Fork 23.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRC64 can be faster, if we want #11988
Conversation
* 53-73% faster on Xeon 2670 v0 @ 2.6ghz * 2-2.5x faster on Core i3 8130U @ 2.2 ghz * 1.6-2.46 bytes/cycle on i3 8130U * likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670 * crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 *microseconds* without vector optimizations, ~80 *microseconds without cache, the combination is extra effective) * redis-benchmark gets --crc <buffer size> testing option * still single-threaded
Some performance comparison information from the Xeon 2670:
|
I definitely want to get this merged in if it's faster. One additional note is we also want to improve CRC16 speed, it's a pretty big chunk of the CME cost. |
We can definitely make CRC16 faster with the same trick, though we'll need to add a separate 8k cache and a model argument to the combine function. I'll see what I can do this weekend / Monday. ETA: it's definitely faster on all hardware and sizes I've tested, with vector + static caching of the crc combine arrays. |
Actually, how long are the values that are passed into crc16? Because that's really a question about whether it's worth the time to do this for crc16 too. At least on my 2670, data > 128 bytes are faster with crcdual than crcspeed, but need to see about 2k before crctri is faster than crcdual. Are keys averaging >128 bytes? |
Thanks @josiahcarlson.
@madolson do you think you'll have the capacity to review this? |
Yeah, I'll take a look a bit later this week. |
@oranagra In attempting to put this into Trying to wrap the On the other hand, ETA: +1 on not worrying about crc16. |
Agree on this too. I was playing around with merging the crc16 code and was only testing 50 byte keys and was seeing some benefit. Didn't fully think through that there would be a limit to parallelizing. |
The parallelization offers linear speedup by itself only if you have extra hardware (my 2670 doesn't, but my 8130U does, as does lot of hardware). With the merge function being nonzero (if not tiny at 15-60ns; or 40-50 bytes hashed), the combination looks something like:
... and the resulting 15-60 ns per merge ends up being substantial at sizes < 1k. If you are seeing improvement in crc16 speed, it's likely due to the re-arrangement of the loop. I found that putting the increment at the top (as it is now) got 5-10%, depending, because otherwise the compiler will put the increment just before the loop compare + jump, causing a pipeline stall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly convinced the implementation (or at least the theory behind it) is sound. Some comments about clarity and organization.
env options for defines in crccombine.h?
If they are better for modern hardware (last couple of years) then I don't think we need to expose them as environments.
sanity check
Ideally yeah we would add some simple testing. We still don't have great unit testing as a project. There are some tests in crc64.c that probably work for now.
only keep gf_matrix_times_vec2, remove other variations as they seem to be slower on all platforms tested.
I would be pro removing code we don't think is useful.
src/crc64.c
Outdated
// https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel | ||
// Extended to 64 bits, and added byteswap for final 3 steps. | ||
// 16-30x 64-bit operations, no comparisons (16 for native byteswap, 30 for pure C) | ||
// should be ~9-16x faster than before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is committed there will be no before.
Will address review comments when I get back in ~2 weeks. Thank you for
your patience.
…On Sat, Apr 8, 2023, 6:18 PM Madelyn Olson ***@***.***> wrote:
***@***.**** commented on this pull request.
I'm mostly convinced the implementation (or at least the theory behind it)
is sound. Some comments about clarity and organization.
env options for defines in crccombine.h?
If they are better for modern hardware (last couple of years) then I don't
think we need to expose them as environments.
sanity check
Ideally yeah we would add some simple testing. We still don't have great
unit testing as a project. There are some tests in crc64.c that probably
work for now.
only keep gf_matrix_times_vec2, remove other variations as they seem to be
slower on all platforms tested.
I would be pro removing code we don't think is useful.
------------------------------
In src/crc64.c
<#11988 (comment)>:
> @@ -67,14 +67,32 @@ static uint64_t crc64_table[8][256] = {{0}};
* \return The reflected data.
*****************************************************************************/
static inline uint_fast64_t crc_reflect(uint_fast64_t data, size_t data_len) {
- uint_fast64_t ret = data & 0x01;
-
- for (size_t i = 1; i < data_len; i++) {
- data >>= 1;
- ret = (ret << 1) | (data & 0x01);
- }
-
- return ret;
+ // only ever called for data_len == 64;
+
+ // Borrowed from bit twiddling hacks, original in the public domain.
+ // https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel
+ // Extended to 64 bits, and added byteswap for final 3 steps.
+ // 16-30x 64-bit operations, no comparisons (16 for native byteswap, 30 for pure C)
+ // should be ~9-16x faster than before.
Once this is committed there will be no before.
------------------------------
In src/crc64.c
<#11988 (comment)>:
> @@ -67,14 +67,32 @@ static uint64_t crc64_table[8][256] = {{0}};
* \return The reflected data.
*****************************************************************************/
static inline uint_fast64_t crc_reflect(uint_fast64_t data, size_t data_len) {
- uint_fast64_t ret = data & 0x01;
-
- for (size_t i = 1; i < data_len; i++) {
- data >>= 1;
- ret = (ret << 1) | (data & 0x01);
- }
-
- return ret;
+ // only ever called for data_len == 64;
Is this an assertion for correctness or just a fact of our implementation?
It seems like the code is correctly handling everything.
------------------------------
In src/redis-benchmark.c
<#11988 (comment)>:
> @@ -1771,6 +1823,162 @@ int main(int argc, char **argv) {
tag = "";
+#if BENCH_CRC_AVAILABLE
Generally agree with oran that I would move this into crc64.c or some
other similar C file that has normal unit like tests.
------------------------------
In src/crcspeed.c
<#11988 (comment)>:
> +/* Note: doing all of our crc/next modifications *before* the crc table
+ * references is an absolute speedup on all CPUs tested. So... keep these
+ * macros separate.
+ */
Is this referring to the DO_8 macros? Also why is keeping them as two
separate macros a performance speedup, do we have any idea why?
------------------------------
In src/crcspeed.c
<#11988 (comment)>:
> +#define MERGE_CRC(I) \
+ crc1 = crc64_combine(crc1, crc##I, next2 - next1, CRC64_REVERSED_POLY, 64)
⬇️ Suggested change
-#define MERGE_CRC(I) \
- crc1 = crc64_combine(crc1, crc##I, next2 - next1, CRC64_REVERSED_POLY, 64)
+#define MERGE_CRC(crc_i) \
+ crc1 = crc64_combine(crc1, (crc_i), next2 - next1, CRC64_REVERSED_POLY, 64)
I don't know if there was some method to these naming conventions, but
suggesting a couple of things that would make it easy for me to read.
------------------------------
In src/crcspeed.c
<#11988 (comment)>:
> len--;
}
- /* fast middle processing, 8 bytes (aligned!) per loop */
+ if (len > CRC64_TRI_CUTOFF) {
+ /* 24 bytes per loop, doing 3 parallel 8 byte chunks at a time */
+ unsigned char *next2, *next3;
+ uint64_t olen, crc2=0, crc3=0;
+ CRC64_SPLIT(3);
+ // len is now the length of the shortest segment
What is the "shortest" segment in this context mean? This comment is also
duplicated in the 2x pathway.
------------------------------
In src/crcspeed.c
<#11988 (comment)>:
> +#define DO_8_2(I) \
+ crc##I = big_table[0][(uint8_t)crc##I] ^ \
+ big_table[1][(uint8_t)(crc##I >> 8)] ^ \
+ big_table[2][(uint8_t)(crc##I >> 16)] ^ \
+ big_table[3][(uint8_t)(crc##I >> 24)] ^ \
+ big_table[4][(uint8_t)(crc##I >> 32)] ^ \
+ big_table[5][(uint8_t)(crc##I >> 40)] ^ \
+ big_table[6][(uint8_t)(crc##I >> 48)] ^ \
+ big_table[7][crc##I >> 56]
Would rather just omit the code if it's not used/implemented.
------------------------------
In src/crcspeed.h
<#11988 (comment)>:
> @@ -34,6 +34,8 @@
typedef uint64_t (*crcfn64)(uint64_t, const void *, const uint64_t);
typedef uint16_t (*crcfn16)(uint16_t, const void *, const uint64_t);
+void set_crc64_cutoffs(size_t a, size_t b);
⬇️ Suggested change
-void set_crc64_cutoffs(size_t a, size_t b);
+void set_crc64_cutoffs(size_t dual_cutoff, size_t tri_cutoff);
------------------------------
In src/redis-benchmark.c
<#11988 (comment)>:
> +
+#define RPOLY UINT64_C(0x95ac9329ac4bc9b5)
+#define INIT_SIZE UINT64_C(0xffffffffffffffff)
+
+ if (config.crc64_test_size > 0 || config.crc64_combine) do {
+ unsigned char* data = NULL;
+ if (config.crc64_test_size > 0) {
+ data = zmalloc(config.crc64_test_size);
+ if (!data) {
+ fprintf(stderr, "Could not allocate sufficient data to test crc64 "
+ "%lld\n", config.crc64_test_size);
+ exit(1);
+ }
+ }
+
+/* Will fix later; minimizing CPU used for getting nstime, as shifts and even
I don't think we need to fix this. I don't think we need to be that
precise about nstime, these benchmarks don't need to be that precise in
terms of time taken, just that they are consistent to use as an evaluation
tool.
------------------------------
In src/redis-benchmark.c
<#11988 (comment)>:
> + for (uint64_t csize = (uint64_t)config.crc64_test_size;
+ csize > 3;
+ csize -= (csize >> 2) + (csize >> 3) + (csize >> 6)) {
I'm sure there is valid explanations for all of this stuff, but I have no
idea why weren't doing half of this math/comparisons.
------------------------------
In src/redis-benchmark.c
<#11988 (comment)>:
> + if (config.csv) printf("algorithm,buffer,performance,crc64_matches\n");
+ } else {
+ // just need some nonzero bytes here
+ genBenchmarkRandomData((char*)(&expect), sizeof(uint64_t));
+ }
+
+ // We have nanoseconds, and we need megs/second from bytes. Great,
+ // just a multiply and a division.
+ // By sampling over a range of --crc <sizes>, you can determine the
+ // best values for for setting CRC64_[DUAL|TRI]_CUTOFF on your machine.
+ // NOTE: we needed nstime here for handling timing of processing --crc
+ // values < 4k or so, as they can complete in under 1 microsecond on
+ // some platforms.
+
+ // save some copy / paste
+#define bench(WHICH) \
Not quite clear why these are macros and not functions. I can't piece out
anything that seems useful as a macro, and the macro replacements make this
code much less readable.
------------------------------
In src/crccombine.c
<#11988 (comment)>:
> @@ -0,0 +1,456 @@
+#include <stdint.h>
Didn't really look through this code, sort of just assuming it works as
described in the .h file.
—
Reply to this email directly, view it on GitHub
<#11988 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABTDQTN4BM2I4FIOMCPGG3XAGMWBANCNFSM6AAAAAAWPEL2AU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
@filipecosta90 I don't think we have any benchmarks that would test this right? I believe all of these configurations don't use a replica, so this would only be hit if they were using restore, which we don't have a benchmark around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry I think we just missed each other for the review. The next revision makes a lot more sense and is pretty clean. Some minor comments, I'm going to check it out as well and run on a couple of instances and report the performance. (That'll be tomorrow though)
src/Makefile
Outdated
@@ -116,6 +116,7 @@ endif | |||
# Override default settings if possible | |||
-include .make-settings | |||
|
|||
STD+=-DLINUX_PLATFORM=1 -DREDIS_TEST=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STD+=-DLINUX_PLATFORM=1 -DREDIS_TEST=1 |
What is LINUX_PLATFORM supposed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to enable nstime on my local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the configuration of your local machine, I don't really want to add random stuff into the makefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies. New patch uses env var to decide, as there is no other way that REDIS_TEST is defined in the repo (that I can find).
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
@josiahcarlson Thanks. I'm going to make a couple of small tweaks but other than that it looks good to merge (just a little pre-occupied with some other tasks). |
|
Given the change in licensing of the project, I withdraw my submission. |
@josiahcarlson Hey. I wanted to apologize for what happened to this CR. It got delayed because of politics related to 7.2, and then something was clearly up after the launch. If you're still interested, we would love to have it here: https://github.com/placeholderkv/placeholderkv/issues. |
@madolson You don't need to apologize, I should have pinged the issue months ago. I'll create an updated PR in the next couple weeks (have my own product releases pending). |
Big parts:
Maybe: