Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd fast path for ASCII in UTF-8 validation #30740
Conversation
rust-highfive
assigned
brson
Jan 6, 2016
This comment has been minimized.
This comment has been minimized.
|
r? @brson (rust_highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
|
Benchmarks using long texts are here: https://gist.github.com/bluss/bf45e07e711238e22b7a 2-3% slowdown on japanese and cyrillic texts that are mostly non-ascii. I don't have a problem championing that regression, given the speedup on utf-8 validation for predominantly ASCII input. The example texts are pretty arbitrary, the wikipedia texts a /little/ less so. |
shepmaster
reviewed
Jan 6, 2016
| @@ -468,6 +468,18 @@ fn test_is_utf8() { | |||
| assert!(from_utf8(&[0xEF, 0xBF, 0xBF]).is_ok()); | |||
| assert!(from_utf8(&[0xF0, 0x90, 0x80, 0x80]).is_ok()); | |||
| assert!(from_utf8(&[0xF4, 0x8F, 0xBF, 0xBF]).is_ok()); | |||
|
|
|||
| // deny embedded in long stretches of ascii | |||
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 6, 2016
Member
I don't really know what this specific set of tests is doing.
I always have a bit of a sad when there are these giant "test everything" tests; my personal pref would be another test like is_utf8_is_not_tricked_by_non_ascii_in_long_stretches_of_ascii. No need to add test_, no need to have a comment, a failed test tells you what failed.
This comment has been minimized.
This comment has been minimized.
bluss
Jan 6, 2016
Author
Contributor
Entirely reasonable, no reason to share test name there, no common setup or anything. Fixed to have its own test function.
shepmaster
reviewed
Jan 6, 2016
| for i in 32..64 { | ||
| for elt in data.iter_mut() { | ||
| *elt = 0; | ||
| } |
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 6, 2016
Member
Just move the array into here, instead of rezeroing it?
for i in 32..64 {
let mut data = [0; 128];
// ...
}
This comment has been minimized.
This comment has been minimized.
shepmaster
reviewed
Jan 6, 2016
| debug_assert!(from <= v.len()); | ||
| slice::from_raw_parts(v.as_ptr().offset(from as isize), | ||
| v.len() - from) | ||
| } |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 6, 2016
Member
Yeah, which I think is why I thought it existed for slices. Maybe a future PR :-)
bluss
force-pushed the
bluss:ascii-is-the-best
branch
from
7ddcac2
to
6fd108d
Jan 6, 2016
shepmaster
reviewed
Jan 6, 2016
| let ptr = v.as_ptr(); | ||
|
|
||
| let mut offset = 0; | ||
| if len >= 2 * usize::BYTES { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 6, 2016
Member
Apologies, I wasn't very clear. I guess it's a two-part question:
- Why unroll at all?
- Why only unroll by 2?
This comment has been minimized.
This comment has been minimized.
bluss
Jan 6, 2016
Author
Contributor
It's a bit arbitrary, I've only tried 1, 2, and 4 and compared performance, and it's a trade off. In the memchr code, where this is taken from it's to fill a 16-byte register on x86-64, but that doesn't happen here.
This comment has been minimized.
This comment has been minimized.
brson
Jan 12, 2016
Contributor
Can you extract the 2 to a const with a descriptive name about unrolling? Since I don't see any hand-unrolling here, I am guessing that the if statement allows the compiler to the unrolling according to the unrolling factor. This is not obvious to me. Can you also add a comment explaining?
Edit: Oh, are the duplicated contains_nonascii calls the loop unrolling?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Pedantically, I'd say it should be ASCII (all caps) when in comments or prose as it's an acronym. Also |
ranma42
reviewed
Jan 6, 2016
| } | ||
| } | ||
|
|
||
| // find the byte after the point the loop stopped |
This comment has been minimized.
This comment has been minimized.
ranma42
Jan 6, 2016
Contributor
If the result of (x & 0x80808080_80808080) is non-zero, you can "immediately" find which byte it is using leading_zeros() / 8
This comment has been minimized.
This comment has been minimized.
bluss
Jan 6, 2016
Author
Contributor
depends on endianness, it works fine with .trailing_zeros() on x86-64. It deserved to be tried for sure, but I couldn't make it be an improvement.
What llvm compiles the current code into, the beast it is, is actually if contains_nonascii(u | v) { break; } which seems to make for a much simpler computation inside the loop, and a tight loop.
I'm not 100% happy with the code in find_nonascii, so any suggestion for improvement would be super welcome, feel free to take the code (from the benchmark link) and find something.
This comment has been minimized.
This comment has been minimized.
ranma42
Jan 6, 2016
Contributor
I downloaded the gist, but I am having some trouble in getting the datasets you used. Specifically, I assumed that enwik8 should be http://mattmahoney.net/dc/enwik8.zip and that the specific version of the Japanese wiki should not matter much, but I have no idea about big10.
This comment has been minimized.
This comment has been minimized.
bluss
Jan 6, 2016
Author
Contributor
yes, maybe you can just skip those datasets you don't have though? I could have provided everything better.
big10 is the dataset in http://vaskir.blogspot.ru/2015/09/regular-expressions-rust-vs-f.html
so it's the first 10MB of the unzipped file from https://drive.google.com/open?id=0B8HLQUKik9VtUWlOaHJPdG0xbnM
This comment has been minimized.
This comment has been minimized.
bluss
Jan 6, 2016
Author
Contributor
jawik10 is the first 10MB from the unzip of http://dumps.wikimedia.org/archive/2006/2006-07/jawiki/20061016/jawiki-20061016-pages-articles.xml.bz2
bluss
force-pushed the
bluss:ascii-is-the-best
branch
from
6fd108d
to
7037ef7
Jan 6, 2016
This comment has been minimized.
This comment has been minimized.
|
Sweet wins. r=me but please do extract |
brson
added
the
relnotes
label
Jan 12, 2016
This comment has been minimized.
This comment has been minimized.
|
Ok, I'll look over if there's a neat way to write the unrolling factor |
bluss
force-pushed the
bluss:ascii-is-the-best
branch
from
7037ef7
to
11e3de3
Jan 12, 2016
bluss
changed the title
Add fast path for ascii in UTF-8 validation
Add fast path for ASCII in UTF-8 validation
Jan 12, 2016
This comment has been minimized.
This comment has been minimized.
|
I received an improved version by @dotdash (with permission to incorporate, of course!) and it's an improvement you wouldn't believe.
Updated PR description & benchmarks are in there @brson I addressed loop unrolling only by adding another comment for it, don't see a nice way to factor it out to a constant |
This comment has been minimized.
This comment has been minimized.
|
Wow, awesome stuff! |
This comment has been minimized.
This comment has been minimized.
|
Pushed a fix, there was a missing conditional, let's try this in travis. I can't measure any difference in perf. Oh and the fix actually has a |
This comment has been minimized.
This comment has been minimized.
|
As a bit of "real world" performance information, I pulled this down and used it for SXD. Parsing a 16M XML file Valgrind reported that
And I measured a ~1.25% overall speedup in the program. Parsing a 111M XML file
And I measured a ~1.1% overall speedup in the program. Thanks for the awesome performance gains! |
This comment has been minimized.
This comment has been minimized.
|
Thanks a lot @shepmaster! Always encouraging to get that kind of feedback! |
This comment has been minimized.
This comment has been minimized.
|
@shepmaster Awesome to see some numbers! I'm guessing your data files are almost purely ASCII (as a lot of the data in the world is). @brson This is ready for re-review. It's the same algorithm, indexed access though, and the fast skip ahead loop is simpler, because it's only attempted at aligned locations. The main loop will progress to an aligned location quickly anyway, if the input is mostly ascii. |
This comment has been minimized.
This comment has been minimized.
Ah, yes, I meant to mention that. They indeed are pure-ASCII. |
shepmaster
reviewed
Jan 13, 2016
| } | ||
| } | ||
| // step from the point where the wordwise loop stopped | ||
| while offset < len && v[offset] < 128 { |
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 13, 2016
Member
Reading through this, I thought at first that 128 was another number relating to byte widths, then realized it is the ASCII cutoff value. Since this is also used above (first >= 128), perhaps another constant could be in order?
This comment has been minimized.
This comment has been minimized.
shepmaster
reviewed
Jan 13, 2016
| */ | ||
|
|
||
| // use truncation to fit u64 into usize | ||
| const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize; |
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 13, 2016
Member
Below, you chose to define the constant inside the method; here you did not. Is there a method to these decisions? FWIW, even if you put it in the method, I'd still vote to leave it as a named constant for readability.
This comment has been minimized.
This comment has been minimized.
bluss
Jan 13, 2016
Author
Contributor
Didn't think much about it, but the second constant would be in the function by necessity so that it's not defined too far from its use.
shepmaster
reviewed
Jan 13, 2016
| break; | ||
| } | ||
| } | ||
| offset += UNROLLED_BY * usize::BYTES; |
This comment has been minimized.
This comment has been minimized.
shepmaster
Jan 13, 2016
Member
You could choose to introduce a constant like const STRIDE_BYTES: usize = UNROLLED_BY * usize::BYTES to avoid duplicating the math in 3 places.
bluss
force-pushed the
bluss:ascii-is-the-best
branch
from
4cc87ee
to
cadcd70
Jan 14, 2016
This comment has been minimized.
This comment has been minimized.
|
I updated the second commit to use a constant for 2 * usize::BYTES instead, to follow shepmaster's suggestion roughly. |
This comment has been minimized.
This comment has been minimized.
|
@bors r+ |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Jan 16, 2016
This comment has been minimized.
This comment has been minimized.
bors
merged commit cadcd70
into
rust-lang:master
Jan 16, 2016
bluss
deleted the
bluss:ascii-is-the-best
branch
Jan 16, 2016
This comment has been minimized.
This comment has been minimized.
|
awesome. Thanks @brson and everyone. |
bluss commentedJan 6, 2016
Add fast path for ASCII in UTF-8 validation
This speeds up the ASCII case (and long stretches of ASCII in otherwise
mixed UTF-8 data) when checking UTF-8 validity.
Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input).
On XML and mostly English language input (en.wikipedia XML dump),
throughput improves by a factor 7 (large input).
On mostly non-ASCII input, performance increases slightly or is the
same.
The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, bounds
checks are statically elided by LLVM and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.
A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.
Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.
Old code is
regular, this PR is calledfast.Datasets:
asciiis just ASCII (2.5 kB)cyris cyrillic script with ascii spaces (5 kB)dewik10is 10MB of a de.wikipedia XML dumpenwik8is 100MB of an en.wikipedia XML dumpjawik10is 10MB of a ja.wikipedia XML dumpCo-authored-by: Björn Steinbrink bsteinbr@gmail.com