Optimize pointer alignment in utf8 validation #61339

jridgewell · 2019-05-30T03:37:39Z

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned.

I initially thought that this would just move the expensive align_offset call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too...

before: https://rust.godbolt.org/z/WIPvWl
after: https://rust.godbolt.org/z/-jBPoW

Benchmarks

https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment

test from_utf8_2_bytes_fast      ... bench:         310 ns/iter (+/- 42) = 1290 MB/s
test from_utf8_2_bytes_regular   ... bench:         309 ns/iter (+/- 24) = 1294 MB/s

test from_utf8_3_bytes_fast      ... bench:       1,027 ns/iter (+/- 62) = 1168 MB/s
test from_utf8_3_bytes_regular   ... bench:       1,513 ns/iter (+/- 611) = 793 MB/s

test from_utf8_4_bytes_fast      ... bench:       1,788 ns/iter (+/- 26) = 1342 MB/s
test from_utf8_4_bytes_regular   ... bench:       1,907 ns/iter (+/- 181) = 1258 MB/s

test from_utf8_all_bytes_fast    ... bench:       3,463 ns/iter (+/- 97) = 1155 MB/s
test from_utf8_all_bytes_regular ... bench:       4,083 ns/iter (+/- 89) = 979 MB/s

test from_utf8_ascii_fast        ... bench:          88 ns/iter (+/- 4) = 28988 MB/s
test from_utf8_ascii_regular     ... bench:          88 ns/iter (+/- 8) = 28988 MB/s

test from_utf8_cyr_fast          ... bench:       7,707 ns/iter (+/- 531) = 665 MB/s
test from_utf8_cyr_regular       ... bench:       8,202 ns/iter (+/- 135) = 625 MB/s

test from_utf8_enwik8_fast       ... bench:   1,135,756 ns/iter (+/- 84,450) = 8804 MB/s
test from_utf8_enwik8_regular    ... bench:   1,145,468 ns/iter (+/- 79,601) = 8730 MB/s

test from_utf8_jawik10_fast      ... bench:  12,723,844 ns/iter (+/- 473,247) = 785 MB/s
test from_utf8_jawik10_regular   ... bench:  13,384,596 ns/iter (+/- 666,997) = 747 MB/s

test from_utf8_mixed_fast        ... bench:       2,321 ns/iter (+/- 123) = 2081 MB/s
test from_utf8_mixed_regular     ... bench:       2,702 ns/iter (+/- 408) = 1788 MB/s

test from_utf8_mostlyasc_fast    ... bench:         249 ns/iter (+/- 10) = 14666 MB/s
test from_utf8_mostlyasc_regular ... bench:         276 ns/iter (+/- 5) = 13231 MB/s

rust-highfive · 2019-05-30T03:37:50Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @TimNN (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

rust-highfive · 2019-05-30T03:43:49Z

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

travis_time:end:25b5bbb8:start=1559187520651196332,finish=1559187521388138197,duration=736941865
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---
[00:04:36]    Compiling backtrace v0.3.25
[00:04:37] error[E0425]: cannot find value `ptr` in this scope
[00:04:37]     --> src/libcore/str/mod.rs:1420:17
[00:04:37]      |
[00:04:37] 1420 |     let align = ptr.align_offset(usize_bytes);
[00:04:37] 
[00:04:40]    Compiling compiler_builtins v0.1.15
[00:04:40]    Compiling cmake v0.1.38
[00:04:40]    Compiling backtrace-sys v0.1.27

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned. I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too... before: https://rust.godbolt.org/z/WIPvWl after: https://rust.godbolt.org/z/-jBPoW https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment ``` test from_utf8_2_bytes_fast ... bench: 310 ns/iter (+/- 42) = 1290 MB/s test from_utf8_2_bytes_regular ... bench: 309 ns/iter (+/- 24) = 1294 MB/s test from_utf8_3_bytes_fast ... bench: 1,027 ns/iter (+/- 62) = 1168 MB/s test from_utf8_3_bytes_regular ... bench: 1,513 ns/iter (+/- 611) = 793 MB/s test from_utf8_4_bytes_fast ... bench: 1,788 ns/iter (+/- 26) = 1342 MB/s test from_utf8_4_bytes_regular ... bench: 1,907 ns/iter (+/- 181) = 1258 MB/s test from_utf8_all_bytes_fast ... bench: 3,463 ns/iter (+/- 97) = 1155 MB/s test from_utf8_all_bytes_regular ... bench: 4,083 ns/iter (+/- 89) = 979 MB/s test from_utf8_ascii_fast ... bench: 88 ns/iter (+/- 4) = 28988 MB/s test from_utf8_ascii_regular ... bench: 88 ns/iter (+/- 8) = 28988 MB/s test from_utf8_cyr_fast ... bench: 7,707 ns/iter (+/- 531) = 665 MB/s test from_utf8_cyr_regular ... bench: 8,202 ns/iter (+/- 135) = 625 MB/s test from_utf8_enwik8_fast ... bench: 1,135,756 ns/iter (+/- 84,450) = 8804 MB/s test from_utf8_enwik8_regular ... bench: 1,145,468 ns/iter (+/- 79,601) = 8730 MB/s test from_utf8_jawik10_fast ... bench: 12,723,844 ns/iter (+/- 473,247) = 785 MB/s test from_utf8_jawik10_regular ... bench: 13,384,596 ns/iter (+/- 666,997) = 747 MB/s test from_utf8_mixed_fast ... bench: 2,321 ns/iter (+/- 123) = 2081 MB/s test from_utf8_mixed_regular ... bench: 2,702 ns/iter (+/- 408) = 1788 MB/s test from_utf8_mostlyasc_fast ... bench: 249 ns/iter (+/- 10) = 14666 MB/s test from_utf8_mostlyasc_regular ... bench: 276 ns/iter (+/- 5) = 13231 MB/s ```

Centril · 2019-05-30T09:17:33Z

r? @SimonSapin

jridgewell · 2019-05-30T14:58:38Z

We could probably make this alignment calculation into a real API? Something like:

struct Alignment {
    block_size: usize,
    align: usize,
}

impl Alignment {
    pub fn is_aligned(&self, offset: usize) -> bool {
        self.align.wrapping_sub(offset) % self.block_size == 0
    }
}

And tied to the lifetime of the pointer.

verification of alignment calculation

[src/main.rs:22] index = 0
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 1
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 2
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 3
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 4
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 5
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 6
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 7
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 8
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 9
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 10
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 11
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 12
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 13
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 14
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 15
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 16
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 17
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 18
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 19
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 20
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 21
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 22
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 23
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 24
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 25
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 26
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 27
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 28
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 29
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 30
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 31
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 32
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 33
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 34
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 35
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 36
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 37
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 38
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 39
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 40
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 41
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 42
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 43
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 44
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 45
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 46
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 47
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 48
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 49
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 50
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 51
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 52
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 53
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 54
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 55
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 56
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 57
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 58
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 59
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 60
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 61
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 62
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 63
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 64
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 65
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 66
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 67
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 68
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 69
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 70
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 71
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 72
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 73
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 74
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 75
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 76
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 77
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 78
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 79
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 80
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 81
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 82
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 83
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 84
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 85
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 86
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 87
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 88
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 89
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 90
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 91
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 92
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 93
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 94
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 95
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 96
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 97
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 98
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 99
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 100
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

SimonSapin · 2019-05-31T11:01:41Z

I’m on vacation this week and this looks non-obvious.

r? @BurntSushi

jridgewell · 2019-06-13T21:23:04Z

Friendly ping @BurntSushi

Sample test to prove that the calculation is the same

#[test]
fn test_pointer_alignment() {
    let bytes = ASCII.as_bytes();
    assert_ne!(intrinsic_alignment(bytes), 0);
    test(bytes);
}

fn intrinsic_alignment(bytes: &[u8]) -> usize {
    let usize_bytes = std::mem::size_of::<usize>();
    bytes.as_ptr().align_offset(usize_bytes)
}

fn test(bytes: &[u8]) {
    let usize_bytes = std::mem::size_of::<usize>();
    let align = bytes.as_ptr().align_offset(usize_bytes);
    let len = bytes.len();
    let mut index = 0;

    while index < len {
        let wrapping_sub = align.wrapping_sub(index) % usize_bytes;
        let align_offset = unsafe { bytes.as_ptr().add(index).align_offset(usize_bytes) };
        assert_eq!(wrapping_sub, align_offset);

        index += 1;
    }
}

static ASCII: &'static str = "\
hello this is a test \
hello this is a test \
hello this is a test \
hello this is a test \
hello this is a test \
hello this is a test \
";

Dylan-DPC-zz · 2019-07-01T15:42:58Z

ping from triage @BurntSushi awaiting your review on this

BurntSushi · 2019-07-05T12:18:15Z

I buy it! @bors r+

Note that it is fairly straight-forward to write a SIMD-accelerated version of an ASCII check using only SSE2 instructions, which means that should work in core on x86_64 only. Here's my version of said algorithm: https://github.com/BurntSushi/bstr/blob/ac4e670d95ab88e41375e0e4f7914e995e93fbed/src/ascii.rs#L118-L197

Also, more generally, I've found that code like this results in both better readability and better codegen if one just commits to using raw pointers everywhere, instead of trying to manage a back-and-forth between raw pointers and indices.

tesuji · 2019-07-05T13:47:26Z

Bors doesn't respond to the approvement action.

Centril · 2019-07-05T13:49:13Z

@bors r=BurntSushi

bors · 2019-07-05T13:49:14Z

📌 Commit 3d2c4ff has been approved by BurntSushi

bors · 2019-07-05T15:30:37Z

⌛ Testing commit 3d2c4ff with merge 1437609c16a41fee5391dc06309830cd38b72854...

…ntSushi Optimize pointer alignment in utf8 validation This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned. I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too... before: https://rust.godbolt.org/z/WIPvWl after: https://rust.godbolt.org/z/-jBPoW ## Benchmarks https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment ``` test from_utf8_2_bytes_fast ... bench: 310 ns/iter (+/- 42) = 1290 MB/s test from_utf8_2_bytes_regular ... bench: 309 ns/iter (+/- 24) = 1294 MB/s test from_utf8_3_bytes_fast ... bench: 1,027 ns/iter (+/- 62) = 1168 MB/s test from_utf8_3_bytes_regular ... bench: 1,513 ns/iter (+/- 611) = 793 MB/s test from_utf8_4_bytes_fast ... bench: 1,788 ns/iter (+/- 26) = 1342 MB/s test from_utf8_4_bytes_regular ... bench: 1,907 ns/iter (+/- 181) = 1258 MB/s test from_utf8_all_bytes_fast ... bench: 3,463 ns/iter (+/- 97) = 1155 MB/s test from_utf8_all_bytes_regular ... bench: 4,083 ns/iter (+/- 89) = 979 MB/s test from_utf8_ascii_fast ... bench: 88 ns/iter (+/- 4) = 28988 MB/s test from_utf8_ascii_regular ... bench: 88 ns/iter (+/- 8) = 28988 MB/s test from_utf8_cyr_fast ... bench: 7,707 ns/iter (+/- 531) = 665 MB/s test from_utf8_cyr_regular ... bench: 8,202 ns/iter (+/- 135) = 625 MB/s test from_utf8_enwik8_fast ... bench: 1,135,756 ns/iter (+/- 84,450) = 8804 MB/s test from_utf8_enwik8_regular ... bench: 1,145,468 ns/iter (+/- 79,601) = 8730 MB/s test from_utf8_jawik10_fast ... bench: 12,723,844 ns/iter (+/- 473,247) = 785 MB/s test from_utf8_jawik10_regular ... bench: 13,384,596 ns/iter (+/- 666,997) = 747 MB/s test from_utf8_mixed_fast ... bench: 2,321 ns/iter (+/- 123) = 2081 MB/s test from_utf8_mixed_regular ... bench: 2,702 ns/iter (+/- 408) = 1788 MB/s test from_utf8_mostlyasc_fast ... bench: 249 ns/iter (+/- 10) = 14666 MB/s test from_utf8_mostlyasc_regular ... bench: 276 ns/iter (+/- 5) = 13231 MB/s ```

Centril · 2019-07-05T15:35:45Z

@bors retry rolled up.

bors · 2019-07-05T17:14:10Z

⌛ Testing commit 3d2c4ff with merge d8a21c4d14accefd54b2483f4bae01ca4045fb3e...

alexreg · 2019-07-05T17:52:52Z

@bors retry (same reason as @Centril's above)

bors · 2019-07-05T17:59:18Z

⌛ Testing commit 3d2c4ff with merge 4ef8922d6e2cab03a7329c185fd87f9860a2080e...

…ntSushi Optimize pointer alignment in utf8 validation This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned. I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too... before: https://rust.godbolt.org/z/WIPvWl after: https://rust.godbolt.org/z/-jBPoW ## Benchmarks https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment ``` test from_utf8_2_bytes_fast ... bench: 310 ns/iter (+/- 42) = 1290 MB/s test from_utf8_2_bytes_regular ... bench: 309 ns/iter (+/- 24) = 1294 MB/s test from_utf8_3_bytes_fast ... bench: 1,027 ns/iter (+/- 62) = 1168 MB/s test from_utf8_3_bytes_regular ... bench: 1,513 ns/iter (+/- 611) = 793 MB/s test from_utf8_4_bytes_fast ... bench: 1,788 ns/iter (+/- 26) = 1342 MB/s test from_utf8_4_bytes_regular ... bench: 1,907 ns/iter (+/- 181) = 1258 MB/s test from_utf8_all_bytes_fast ... bench: 3,463 ns/iter (+/- 97) = 1155 MB/s test from_utf8_all_bytes_regular ... bench: 4,083 ns/iter (+/- 89) = 979 MB/s test from_utf8_ascii_fast ... bench: 88 ns/iter (+/- 4) = 28988 MB/s test from_utf8_ascii_regular ... bench: 88 ns/iter (+/- 8) = 28988 MB/s test from_utf8_cyr_fast ... bench: 7,707 ns/iter (+/- 531) = 665 MB/s test from_utf8_cyr_regular ... bench: 8,202 ns/iter (+/- 135) = 625 MB/s test from_utf8_enwik8_fast ... bench: 1,135,756 ns/iter (+/- 84,450) = 8804 MB/s test from_utf8_enwik8_regular ... bench: 1,145,468 ns/iter (+/- 79,601) = 8730 MB/s test from_utf8_jawik10_fast ... bench: 12,723,844 ns/iter (+/- 473,247) = 785 MB/s test from_utf8_jawik10_regular ... bench: 13,384,596 ns/iter (+/- 666,997) = 747 MB/s test from_utf8_mixed_fast ... bench: 2,321 ns/iter (+/- 123) = 2081 MB/s test from_utf8_mixed_regular ... bench: 2,702 ns/iter (+/- 408) = 1788 MB/s test from_utf8_mostlyasc_fast ... bench: 249 ns/iter (+/- 10) = 14666 MB/s test from_utf8_mostlyasc_regular ... bench: 276 ns/iter (+/- 5) = 13231 MB/s ```

Centril · 2019-07-05T18:03:59Z

@bors retry rolled up.

Centril · 2019-07-05T18:24:18Z

Let's wait until the issue has resolved itself, @bors r-

RalfJung · 2019-07-05T18:59:55Z

I opened an issue for proper discussion: #62420

In this PR, we might discuss whether it is possible to change it such that it will fall back to the unaligned case if align_offset failed.

jridgewell

I've found that code like this results in both better readability and better codegen if one just commits to using raw pointers everywhere, instead of trying to manage a back-and-forth between raw pointers and indices.

I also think it might be better as pointers, but that's a larger code change to do. I can try that in a follow up PR?

jridgewell · 2019-07-05T19:27:45Z

src/libcore/str/mod.rs

-                ptr.add(index).align_offset(usize_bytes)
-            };
-            if align == 0 {
+            if align.wrapping_sub(index) % usize_bytes == 0 {


I should have read align_offset's doc better, I didn't realize it might not be possible to align the offsets. I can manually check for align != usize::max_value() here to avoid this. Interestingly, it seems that check will disappear entirely on supported platforms? Compare with and without, which generate the same assembly.

This could also be added to a Alignment impl, if we wanted to go that route instead.

BurntSushi · 2019-07-05T20:18:58Z

I also think it might be better as pointers, but that's a larger code change to do. I can try that in a follow up PR?

Oh yeah, of course! I was just throwing it out there. :)

RalfJung · 2019-07-05T20:50:04Z

I should have read align_offset's doc better, I didn't realize it might not be possible to align the offsets. I can manually check for align != usize::max_value() here to avoid this. Interestingly, it seems that check will disappear entirely on supported platforms? Compare with and without, which generate the same assembly.

Thanks! Good to hear that this optimizes away, that means there should be no perf cost. :)

This new version looks good to me from a Miri perspective.

jridgewell · 2019-07-17T03:46:24Z

Does this need anything more from me? I think the miri issue is resolved, just needs a final review and merge.

RalfJung · 2019-07-17T06:25:19Z

Agreed. @BurntSushi the changes since your last r+ are tiny, could you have a look?

BurntSushi · 2019-07-17T10:25:29Z

@bors r+

bors · 2019-07-17T10:25:30Z

📌 Commit 3b6e8ed has been approved by BurntSushi

bors · 2019-07-17T12:13:43Z

⌛ Testing commit 3b6e8ed with merge e34d4ae...

Optimize pointer alignment in utf8 validation This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned. I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too... before: https://rust.godbolt.org/z/WIPvWl after: https://rust.godbolt.org/z/-jBPoW ## Benchmarks https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment ``` test from_utf8_2_bytes_fast ... bench: 310 ns/iter (+/- 42) = 1290 MB/s test from_utf8_2_bytes_regular ... bench: 309 ns/iter (+/- 24) = 1294 MB/s test from_utf8_3_bytes_fast ... bench: 1,027 ns/iter (+/- 62) = 1168 MB/s test from_utf8_3_bytes_regular ... bench: 1,513 ns/iter (+/- 611) = 793 MB/s test from_utf8_4_bytes_fast ... bench: 1,788 ns/iter (+/- 26) = 1342 MB/s test from_utf8_4_bytes_regular ... bench: 1,907 ns/iter (+/- 181) = 1258 MB/s test from_utf8_all_bytes_fast ... bench: 3,463 ns/iter (+/- 97) = 1155 MB/s test from_utf8_all_bytes_regular ... bench: 4,083 ns/iter (+/- 89) = 979 MB/s test from_utf8_ascii_fast ... bench: 88 ns/iter (+/- 4) = 28988 MB/s test from_utf8_ascii_regular ... bench: 88 ns/iter (+/- 8) = 28988 MB/s test from_utf8_cyr_fast ... bench: 7,707 ns/iter (+/- 531) = 665 MB/s test from_utf8_cyr_regular ... bench: 8,202 ns/iter (+/- 135) = 625 MB/s test from_utf8_enwik8_fast ... bench: 1,135,756 ns/iter (+/- 84,450) = 8804 MB/s test from_utf8_enwik8_regular ... bench: 1,145,468 ns/iter (+/- 79,601) = 8730 MB/s test from_utf8_jawik10_fast ... bench: 12,723,844 ns/iter (+/- 473,247) = 785 MB/s test from_utf8_jawik10_regular ... bench: 13,384,596 ns/iter (+/- 666,997) = 747 MB/s test from_utf8_mixed_fast ... bench: 2,321 ns/iter (+/- 123) = 2081 MB/s test from_utf8_mixed_regular ... bench: 2,702 ns/iter (+/- 408) = 1788 MB/s test from_utf8_mostlyasc_fast ... bench: 249 ns/iter (+/- 10) = 14666 MB/s test from_utf8_mostlyasc_regular ... bench: 276 ns/iter (+/- 5) = 13231 MB/s ```

bors · 2019-07-17T16:02:02Z

☀️ Test successful - checks-azure
Approved by: BurntSushi
Pushing e34d4ae to master...

rust-highfive assigned TimNN May 30, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 30, 2019

jridgewell force-pushed the pointer-alignment branch from 9792561 to 3d2c4ff Compare May 30, 2019 03:52

rust-highfive assigned SimonSapin and unassigned TimNN May 30, 2019

rust-highfive assigned BurntSushi and unassigned SimonSapin May 31, 2019

Centril closed this Jul 5, 2019

Centril reopened this Jul 5, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 5, 2019

Centril mentioned this pull request Jul 5, 2019

Rollup of 14 pull requests #62413

Closed

Centril mentioned this pull request Jul 5, 2019

Rollup of 14 pull requests #62418

Closed

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jul 5, 2019

RalfJung mentioned this pull request Jul 5, 2019

align_offset guarantees #62420

Closed

jridgewell commented Jul 5, 2019

View reviewed changes

Do not use pointer alignment on unsupported platforms

3b6e8ed

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 17, 2019

bors mentioned this pull request Jul 17, 2019

Update mdbook, cargo, books #62733

Merged

bors added the merged-by-bors This PR was explicitly merged by bors. label Jul 17, 2019

bors merged commit 3b6e8ed into rust-lang:master Jul 17, 2019

jridgewell deleted the pointer-alignment branch July 17, 2019 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pointer alignment in utf8 validation #61339

Optimize pointer alignment in utf8 validation #61339

jridgewell commented May 30, 2019 •

edited

Loading

rust-highfive commented May 30, 2019

rust-highfive commented May 30, 2019

Centril commented May 30, 2019

jridgewell commented May 30, 2019

SimonSapin commented May 31, 2019

jridgewell commented Jun 13, 2019

Dylan-DPC-zz commented Jul 1, 2019

BurntSushi commented Jul 5, 2019

tesuji commented Jul 5, 2019

Centril commented Jul 5, 2019

bors commented Jul 5, 2019

bors commented Jul 5, 2019

Centril commented Jul 5, 2019

bors commented Jul 5, 2019

alexreg commented Jul 5, 2019

bors commented Jul 5, 2019

Centril commented Jul 5, 2019

Centril commented Jul 5, 2019

RalfJung commented Jul 5, 2019

jridgewell left a comment

jridgewell Jul 5, 2019 •

edited

Loading

BurntSushi commented Jul 5, 2019

RalfJung commented Jul 5, 2019

jridgewell commented Jul 17, 2019

RalfJung commented Jul 17, 2019

BurntSushi commented Jul 17, 2019

bors commented Jul 17, 2019

bors commented Jul 17, 2019

bors commented Jul 17, 2019

Optimize pointer alignment in utf8 validation #61339

Optimize pointer alignment in utf8 validation #61339

Conversation

jridgewell commented May 30, 2019 • edited Loading

Benchmarks

rust-highfive commented May 30, 2019

rust-highfive commented May 30, 2019

Centril commented May 30, 2019

jridgewell commented May 30, 2019

SimonSapin commented May 31, 2019

jridgewell commented Jun 13, 2019

Dylan-DPC-zz commented Jul 1, 2019

BurntSushi commented Jul 5, 2019

tesuji commented Jul 5, 2019

Centril commented Jul 5, 2019

bors commented Jul 5, 2019

bors commented Jul 5, 2019

Centril commented Jul 5, 2019

bors commented Jul 5, 2019

alexreg commented Jul 5, 2019

bors commented Jul 5, 2019

Centril commented Jul 5, 2019

Centril commented Jul 5, 2019

RalfJung commented Jul 5, 2019

jridgewell left a comment

Choose a reason for hiding this comment

jridgewell Jul 5, 2019 • edited Loading

Choose a reason for hiding this comment

BurntSushi commented Jul 5, 2019

RalfJung commented Jul 5, 2019

jridgewell commented Jul 17, 2019

RalfJung commented Jul 17, 2019

BurntSushi commented Jul 17, 2019

bors commented Jul 17, 2019

bors commented Jul 17, 2019

bors commented Jul 17, 2019

jridgewell commented May 30, 2019 •

edited

Loading

jridgewell Jul 5, 2019 •

edited

Loading