-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of Rabin fingerprint calculation #24
Conversation
wpos is once again limited using modulo. The performance issue with > wpos = wpos % windowsSize Did not stem from the usage of modulo itself but from wpos being a signed integer which forces the compiler to emit a modulo implementation that can handle negative numbers instead of the much faster > wpos = wpos & 0x3f This optimized version is possible as windowSize is equal to 64. Performance: Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz benchmark old MB/s new MB/s speedup BenchmarkChunker 218.53 239.68 1.10x Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz benchmark old MB/s new MB/s speedup BenchmarkChunker 542.50 542.46 1.00x
Bounding wpos before using it or afterwards makes in general no difference. Here it allows the compiler to elide array bound checks when accessing win. Performance: Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz benchmark old MB/s new MB/s speedup BenchmarkChunker 239.68 243.10 1.01x Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz benchmark old MB/s new MB/s speedup BenchmarkChunker 542.46 547.28 1.01x
The code only uses polynomials of degree up to 53. Therefore let the compiler know of this restriction which avoids costly checks to handle the case where polShift is larger than the value type width, which would result in unexpected behavior on x86. > byte(digest >> polShift) Performance: Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz benchmark old MB/s new MB/s speedup BenchmarkChunker 243.10 268.83 1.11x Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz benchmark old MB/s new MB/s speedup BenchmarkChunker 547.28 602.37 1.10x
The speedups shown in the commit messages are still valid (to a large degree) for recent go versions such as 1.13.5 . |
Oh, wow, this PR fell through the cracks, I'm sorry for not responding earlier. I've rerun the benchmarks 10 times and benchstat tells me that it's a solid 8-10% speedup for my CPU (Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz):
Thank you very much! |
Improve performance of Rabin fingerprint calculation
I've also released 0.3.0 with all the new commits. |
Should I open a pull request to update the chunker version in restic? |
Already done :) |
These patches improve the chunking throughput by about 10-23% depending on the CPU used. The speedup is more noticeable on older CPUs.
The test were run on Linux using go 1.11.6 with
go test -bench 'BenchmarkChunker$' -cpu 1 -benchtime 20s
. I've also repeated the complete performance measurements three times. As the results were quite stable (except for 1 extreme slow outlier), I've just included the results from the second measurement run.Would it be possible to create a new release for use in restic? The version there is already missing commit 7c2a180 which improves the performance to the baseline for the first commit in the merge request. For tag v0.2.0, which is the version used in restic, I've measured the following throughput:
This totals up to 15-36% more throughput than v0.2.0 .