Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRC64 can be faster, if we want #11988

Closed
wants to merge 14 commits into from

Conversation

josiahcarlson
Copy link

@josiahcarlson josiahcarlson commented Mar 31, 2023

Big parts:

  • 53-73% faster on Xeon 2670 v0 @ 2.6ghz
  • 2-2.5x faster on Core i3 8130U @ 2.2 ghz
  • 1.6-2.46 bytes/cycle on i3 8130U
  • likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670
  • crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 microseconds without vector optimizations, ~80 microseconds without cache, up to 900 microseconds without either; but the combination is extra effective)
  • redis-benchmark gets --crc testing option
  • still single-threaded

Maybe:

  • A test should call redis-benchmark, and make sure that all of our outputs have maching crcs
  • Macros for non-linux platform testing, and/or removing benchmarks in those cases where nstime() is not provided
  • env options for defines in crccombine.h?
  • sanity check
  • only keep gf_matrix_times_vec2, remove other variations as they seem to be slower on all platforms tested

* 53-73% faster on Xeon 2670 v0 @ 2.6ghz
* 2-2.5x faster on Core i3 8130U @ 2.2 ghz
* 1.6-2.46 bytes/cycle on i3 8130U
* likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670
* crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations
  (~8 *microseconds* without vector optimizations, ~80 *microseconds without cache,
  the combination is extra effective)
* redis-benchmark gets --crc <buffer size> testing option
* still single-threaded
@josiahcarlson
Copy link
Author

Some performance comparison information from the Xeon 2670:

$ cat yes_static_yes_vector.txt
algorithm,buffer,performance,crc64_matches
crc_1byte,5480000,407,1
crcspeed,5480000,1402,1
crcdual,5480000,2189,1
crctri,5480000,2346,1

operation,size,nanoseconds
init_64,18446744073709551615,132411
largest_combine,18446744073709551615,18
hash_as_size_combine,16673763985951615754,58
combine,5480000,27

algorithm,buffer,performance,crc64_matches
crc_1byte,548000,408,1
crcspeed,548000,1404,1
crcdual,548000,2199,1
crctri,548000,2356,1

operation,size,nanoseconds
init_64,18446744073709551615,134355
largest_combine,18446744073709551615,17
hash_as_size_combine,13081725533802567739,50
combine,548000,25

$ cat no_static_yes_vector.txt
algorithm,buffer,performance,crc64_matches
crc_1byte,5480000,407,1
crcspeed,5480000,1409,1
crcdual,5480000,2219,1
crctri,5480000,2488,1

operation,size,nanoseconds
init_64,18446744073709551615,114511
largest_combine,18446744073709551615,82544
hash_as_size_combine,16673763985951615754,82128
combine,5480000,31155

algorithm,buffer,performance,crc64_matches
crc_1byte,548000,408,1
crcspeed,548000,1415,1
crcdual,548000,2036,1
crctri,548000,2052,1

operation,size,nanoseconds
init_64,18446744073709551615,114131
largest_combine,18446744073709551615,82594
hash_as_size_combine,13081725533802567739,82100
combine,548000,27400

$ cat no_static_no_vector_no_switch.txt
algorithm,buffer,performance,crc64_matches
crc_1byte,5480000,407,1
crcspeed,5480000,1402,1
crcdual,5480000,2080,1
crctri,5480000,2131,1

operation,size,nanoseconds
init_64,18446744073709551615,914909
largest_combine,18446744073709551615,702088
hash_as_size_combine,16673763985951615754,695149
combine,5480000,238155

algorithm,buffer,performance,crc64_matches
crc_1byte,548000,408,1
crcspeed,548000,1405,1
crcdual,548000,1272,1
crctri,548000,961,1

operation,size,nanoseconds
init_64,18446744073709551615,907235
largest_combine,18446744073709551615,701880
hash_as_size_combine,13081725533802567739,693871
combine,548000,202534

@madolson
Copy link
Contributor

I definitely want to get this merged in if it's faster. One additional note is we also want to improve CRC16 speed, it's a pretty big chunk of the CME cost.

@josiahcarlson
Copy link
Author

josiahcarlson commented Mar 31, 2023

We can definitely make CRC16 faster with the same trick, though we'll need to add a separate 8k cache and a model argument to the combine function. I'll see what I can do this weekend / Monday.

ETA: it's definitely faster on all hardware and sizes I've tested, with vector + static caching of the crc combine arrays.

@josiahcarlson
Copy link
Author

Actually, how long are the values that are passed into crc16? Because that's really a question about whether it's worth the time to do this for crc16 too. At least on my 2670, data > 128 bytes are faster with crcdual than crcspeed, but need to see about 2k before crctri is faster than crcdual.

Are keys averaging >128 bytes?

@oranagra
Copy link
Member

oranagra commented Apr 2, 2023

Thanks @josiahcarlson.
my 2 cents:

  • I don't think we wanna optimize for keys longer than 128 bytes.
  • i'm not sure the right place for the crc benchmark is in redis-benchmark.c maybe more suitable to follow the dict benchmarks in the REDIS_TEST infra.

@madolson do you think you'll have the capacity to review this?

@madolson madolson self-requested a review April 2, 2023 23:02
@madolson
Copy link
Contributor

madolson commented Apr 3, 2023

Yeah, I'll take a look a bit later this week.

@josiahcarlson
Copy link
Author

josiahcarlson commented Apr 4, 2023

@oranagra In attempting to put this into crc64.c, to be run as part of redis-server test crc64, I discovered that there seems to be an issue compiling. With the combination of -std=c99 or -std=c11 and the .h files needed by redis-server, the following symbols aren't available because c standards != posix: CLOCK_MONOTONIC, CLOCK_REALTIME, clock_getres, clock_gettime. Those are all necessary for nanosecond-resolution timing.

Trying to wrap the #include with some define magic didn't seem to work.

On the other hand, redis-benchmark seems to not include whichever .h file that causes the posix / c standard issue, and at least seems to compile for me. I'll see about trying to make it work for your workflow here, assuming I get output from it.

ETA: +1 on not worrying about crc16.

@madolson
Copy link
Contributor

madolson commented Apr 4, 2023

ETA: +1 on not worrying about crc16.

Agree on this too. I was playing around with merging the crc16 code and was only testing 50 byte keys and was seeing some benefit. Didn't fully think through that there would be a limit to parallelizing.

@josiahcarlson
Copy link
Author

The parallelization offers linear speedup by itself only if you have extra hardware (my 2670 doesn't, but my 8130U does, as does lot of hardware). With the merge function being nonzero (if not tiny at 15-60ns; or 40-50 bytes hashed), the combination looks something like:

total_time = <size> / <crcspeed bytes/second> / <possibly non-linear parallelization_factor> + <parts - 1> * <15-60 ns>
reported_bytes_per_second = size / total_time

... and the resulting 15-60 ns per merge ends up being substantial at sizes < 1k.

If you are seeing improvement in crc16 speed, it's likely due to the re-arrangement of the loop. I found that putting the increment at the top (as it is now) got 5-10%, depending, because otherwise the compiler will put the increment just before the loop compare + jump, causing a pipeline stall.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly convinced the implementation (or at least the theory behind it) is sound. Some comments about clarity and organization.

env options for defines in crccombine.h?

If they are better for modern hardware (last couple of years) then I don't think we need to expose them as environments.

sanity check

Ideally yeah we would add some simple testing. We still don't have great unit testing as a project. There are some tests in crc64.c that probably work for now.

only keep gf_matrix_times_vec2, remove other variations as they seem to be slower on all platforms tested.

I would be pro removing code we don't think is useful.

src/crc64.c Outdated
// https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel
// Extended to 64 bits, and added byteswap for final 3 steps.
// 16-30x 64-bit operations, no comparisons (16 for native byteswap, 30 for pure C)
// should be ~9-16x faster than before.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is committed there will be no before.

src/crc64.c Outdated Show resolved Hide resolved
src/redis-benchmark.c Outdated Show resolved Hide resolved
src/crcspeed.c Show resolved Hide resolved
src/crcspeed.c Outdated Show resolved Hide resolved
src/crcspeed.h Outdated Show resolved Hide resolved
src/redis-benchmark.c Outdated Show resolved Hide resolved
src/redis-benchmark.c Outdated Show resolved Hide resolved
src/redis-benchmark.c Outdated Show resolved Hide resolved
src/crccombine.c Show resolved Hide resolved
@josiahcarlson
Copy link
Author

josiahcarlson commented Apr 11, 2023 via email

@filipecosta90 filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Apr 17, 2023
josiahcarlson and others added 2 commits April 26, 2023 19:10
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
@madolson
Copy link
Contributor

@filipecosta90 I don't think we have any benchmarks that would test this right? I believe all of these configurations don't use a replica, so this would only be hit if they were using restore, which we don't have a benchmark around.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry I think we just missed each other for the review. The next revision makes a lot more sense and is pretty clean. Some minor comments, I'm going to check it out as well and run on a couple of instances and report the performance. (That'll be tomorrow though)

src/Makefile Outdated
@@ -116,6 +116,7 @@ endif
# Override default settings if possible
-include .make-settings

STD+=-DLINUX_PLATFORM=1 -DREDIS_TEST=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STD+=-DLINUX_PLATFORM=1 -DREDIS_TEST=1

What is LINUX_PLATFORM supposed to do?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to enable nstime on my local machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the configuration of your local machine, I don't really want to add random stuff into the makefile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies. New patch uses env var to decide, as there is no other way that REDIS_TEST is defined in the repo (that I can find).

src/crc64.c Outdated Show resolved Hide resolved
src/crc64.c Outdated Show resolved Hide resolved
src/crc64.c Outdated Show resolved Hide resolved
src/crc64.c Outdated Show resolved Hide resolved
src/crccombine.h Outdated Show resolved Hide resolved
src/crc64.c Outdated Show resolved Hide resolved
src/crcspeed.c Outdated Show resolved Hide resolved
src/crcspeed.c Outdated Show resolved Hide resolved
src/crc64.c Show resolved Hide resolved
josiahcarlson and others added 6 commits July 3, 2023 16:34
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
@josiahcarlson josiahcarlson requested a review from madolson July 4, 2023 00:14
@madolson
Copy link
Contributor

@josiahcarlson Thanks. I'm going to make a couple of small tweaks but other than that it looks good to merge (just a little pre-occupied with some other tasks).

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@josiahcarlson
Copy link
Author

Given the change in licensing of the project, I withdraw my submission.

@madolson
Copy link
Contributor

@josiahcarlson Hey. I wanted to apologize for what happened to this CR. It got delayed because of politics related to 7.2, and then something was clearly up after the launch. If you're still interested, we would love to have it here: https://github.com/placeholderkv/placeholderkv/issues.

@josiahcarlson
Copy link
Author

@madolson You don't need to apologize, I should have pinged the issue months ago. I'll create an updated PR in the next couple weeks (have my own product releases pending).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action:run-benchmark Triggers the benchmark suite for this Pull Request
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

5 participants