New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make [u8]::reverse() 5x faster #41764

Merged
merged 3 commits into from May 10, 2017

Conversation

Projects
None yet
10 participants
@scottmcm
Member

scottmcm commented May 5, 2017

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):

# Before
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

# After
test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s

If you're curious about the assembly, instead of doing this

movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al

it does this

mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax
Make [u8]::reverse() 5x faster
Since LLVM doesn't vectorize the loop for us, do unaligned reads
of a larger type and use LLVM's bswap intrinsic to do the
reversing of the actual bytes.  cfg!-restricted to x86 and
x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by
loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)
@rust-highfive

This comment has been minimized.

Show comment
Hide comment
@rust-highfive

rust-highfive May 5, 2017

Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

Collaborator

rust-highfive commented May 5, 2017

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson May 5, 2017

Contributor

This is a neat patch!

It combines several low level abstractions to do something very precise and unusual, and it's written clearly and all the abstractions disappear into a blip of assembly. Good Rust.

If you think the compiler should be doing this optimization (or better) itself maybe add a comment saying so, in case someday it should be removed.

It looks right to me but somebody else should review. Maybe r? @bluss?

Contributor

brson commented May 5, 2017

This is a neat patch!

It combines several low level abstractions to do something very precise and unusual, and it's written clearly and all the abstractions disappear into a blip of assembly. Good Rust.

If you think the compiler should be doing this optimization (or better) itself maybe add a comment saying so, in case someday it should be removed.

It looks right to me but somebody else should review. Maybe r? @bluss?

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 May 9, 2017

Contributor

@bluss are you planning on reviewing this PR?

Contributor

arielb1 commented May 9, 2017

@bluss are you planning on reviewing this PR?

@alexcrichton alexcrichton added the T-libs label May 9, 2017

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson May 9, 2017

Contributor

@bors r+

Contributor

brson commented May 9, 2017

@bors r+

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors May 9, 2017

Contributor

📌 Commit da91361 has been approved by brson

Contributor

bors commented May 9, 2017

📌 Commit da91361 has been approved by brson

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 10, 2017

Rollup merge of #41764 - scottmcm:faster-reverse, r=brson
Make [u8]::reverse() 5x faster

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes.  cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):
```text
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s
```

If you're curious about the assembly, instead of doing this
```
movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al
```
it does this
```
mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax
```

bors added a commit that referenced this pull request May 10, 2017

Auto merge of #41869 - frewsxcv:rollup, r=frewsxcv
Rollup of 11 pull requests

- Successful merges: #41531, #41536, #41659, #41684, #41764, #41815, #41843, #41861, #41862, #41863, #41864
- Failed merges:

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 10, 2017

Rollup merge of #41764 - scottmcm:faster-reverse, r=brson
Make [u8]::reverse() 5x faster

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes.  cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):
```text
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s
```

If you're curious about the assembly, instead of doing this
```
movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al
```
it does this
```
mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax
```

bors added a commit that referenced this pull request May 10, 2017

Auto merge of #41872 - frewsxcv:rollup, r=frewsxcv
Rollup of 9 pull requests

- Successful merges: #41531, #41536, #41764, #41809, #41815, #41861, #41862, #41863, #41864
- Failed merges:
@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors May 10, 2017

Contributor

⌛️ Testing commit da91361 with merge 2b97174...

Contributor

bors commented May 10, 2017

⌛️ Testing commit da91361 with merge 2b97174...

bors added a commit that referenced this pull request May 10, 2017

Auto merge of #41764 - scottmcm:faster-reverse, r=brson
Make [u8]::reverse() 5x faster

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes.  cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):
```text
# Before
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

# After
test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s
```

If you're curious about the assembly, instead of doing this
```
movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al
```
it does this
```
mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax
```
@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors May 10, 2017

Contributor

☀️ Test successful - status-appveyor, status-travis
Approved by: brson
Pushing 2b97174 to master...

Contributor

bors commented May 10, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: brson
Pushing 2b97174 to master...

@bors bors merged commit da91361 into rust-lang:master May 10, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

@scottmcm scottmcm deleted the scottmcm:faster-reverse branch May 10, 2017

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg May 18, 2017

Contributor

llvm byte swap intrinsic works fine on ARM as well :/

Contributor

gnzlbg commented May 18, 2017

llvm byte swap intrinsic works fine on ARM as well :/

@scottmcm

This comment has been minimized.

Show comment
Hide comment
@scottmcm

scottmcm May 18, 2017

Member

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access, so I doubted that read_unaligned+swap_bytes+write_unaligned would be better there than just directly reversing the bytes. Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

Member

scottmcm commented May 18, 2017

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access, so I doubted that read_unaligned+swap_bytes+write_unaligned would be better there than just directly reversing the bytes. Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg May 18, 2017

Contributor

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access,

Oh, did not knew that.

Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

This make sense, maybe we should fill an issue and mark it with the easy tag + ARM or something. One would just need to try this on ARMv6/v7/v8 and choose an appropriate set of features.

Contributor

gnzlbg commented May 18, 2017

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access,

Oh, did not knew that.

Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

This make sense, maybe we should fill an issue and mark it with the easy tag + ARM or something. One would just need to try this on ARMv6/v7/v8 and choose an appropriate set of features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment