Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Rabin fingerprint calculation #24

Merged
merged 3 commits into from
Feb 12, 2020

Conversation

MichaelEischer
Copy link
Member

These patches improve the chunking throughput by about 10-23% depending on the CPU used. The speedup is more noticeable on older CPUs.

The test were run on Linux using go 1.11.6 with go test -bench 'BenchmarkChunker$' -cpu 1 -benchtime 20s. I've also repeated the complete performance measurements three times. As the results were quite stable (except for 1 extreme slow outlier), I've just included the results from the second measurement run.

Would it be possible to create a new release for use in restic? The version there is already missing commit 7c2a180 which improves the performance to the baseline for the first commit in the merge request. For tag v0.2.0, which is the version used in restic, I've measured the following throughput:

Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
BenchmarkChunker                     200         169683667 ns/op         197.75 MB/s

Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
BenchmarkChunker                     500          64142688 ns/op         523.12 MB/s

This totals up to 15-36% more throughput than v0.2.0 .

wpos is once again limited using modulo. The performance issue with
> wpos = wpos % windowsSize
Did not stem from the usage of modulo itself but from wpos being a
signed integer which forces the compiler to emit a modulo implementation
that can handle negative numbers instead of the much faster
> wpos = wpos & 0x3f
This optimized version is possible as windowSize is equal to 64.

Performance:
Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunker               218.53       239.68       1.10x

Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunker               542.50       542.46       1.00x
Bounding wpos before using it or afterwards makes in general no
difference. Here it allows the compiler to elide array bound checks when
accessing win.

Performance:
Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunker               239.68       243.10       1.01x

Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunker               542.46       547.28       1.01x
The code only uses polynomials of degree up to 53. Therefore let the compiler
know of this restriction which avoids costly checks to handle the case
where polShift is larger than the value type width, which would result
in unexpected behavior on x86.
> byte(digest >> polShift)

Performance:
Old CPU: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunker               243.10       268.83       1.11x

Modern CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunker               547.28       602.37       1.10x
@MichaelEischer
Copy link
Member Author

The speedups shown in the commit messages are still valid (to a large degree) for recent go versions such as 1.13.5 .

@fd0
Copy link
Member

fd0 commented Feb 12, 2020

Oh, wow, this PR fell through the cracks, I'm sorry for not responding earlier. I've rerun the benchmarks 10 times and benchstat tells me that it's a solid 8-10% speedup for my CPU (Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz):

$ benchstat before after
name               old time/op   new time/op   delta
ChunkerWithSHA256    163ms ± 0%    155ms ± 0%  -4.79%   (p=0.000 n=8+10)
Chunker             70.6ms ± 0%   65.2ms ± 2%  -7.65%   (p=0.000 n=9+10)
NewChunker          33.9µs ± 5%   34.3µs ± 4%    ~     (p=0.190 n=10+10)

name               old speed     new speed     delta
ChunkerWithSHA256  206MB/s ± 0%  216MB/s ± 0%  +5.04%   (p=0.000 n=8+10)
Chunker            475MB/s ± 0%  515MB/s ± 2%  +8.30%   (p=0.000 n=9+10)

Thank you very much!

fd0 added a commit that referenced this pull request Feb 12, 2020
Improve performance of Rabin fingerprint calculation
@fd0 fd0 merged commit 42a27a1 into restic:master Feb 12, 2020
@fd0
Copy link
Member

fd0 commented Feb 12, 2020

I've also released 0.3.0 with all the new commits.

@MichaelEischer
Copy link
Member Author

Should I open a pull request to update the chunker version in restic?

@fd0
Copy link
Member

fd0 commented Feb 12, 2020

Already done :)

See restic/restic#2576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants