[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113

Validark · 2024-01-27T12:17:30Z

Hello, I just thought of this optimization while refactoring the code in simdjzon.

For Zig, which uses LLVM, this optimization does not occur automatically. For that, I have submitted an issue to LLVM.

However, this project compiles with multiple compilers which may or may not do this optimization automatically, so I have edited your code to reflect the optimization idea. It is very basic, and I did not test the modified C++ code. But it looks right to me. (I did test my Zig implementation)

Hope it works for you.

‒ Validark

P.S. I assumed simd8<uint8_t> must23_80 = must23 & uint8_t(0x80); splats the 0x80. Is that correct?

…ntinuation

lemire · 2024-01-27T15:10:42Z

@Validark I think your assumption is correct.

I am running test, and I expect to merge your PR.

lemire · 2024-01-27T16:12:59Z

src/arm64.cpp

-    simd8<bool> is_third_byte  = prev2 >= uint8_t(0xe0u);
-    simd8<bool> is_fourth_byte = prev3 >= uint8_t(0xf0u);
-    return is_third_byte ^ is_fourth_byte;
+simdjson_inline simd8<uint8_t> must_be_2_3_continuation(const simd8<uint8_t> prev2, const simd8<uint8_t> prev3) {


I am not sure that the ARM change buys you something. Does it?

My expectation is that it is not going to affect the performance. Do you disagree?

Based on my use of analogous techniques in Zig, I would think the arm emit would have an equal number of instructions either way. I made the change to the arm implementation solely because I was thinking it has to have the same return type as the other implementations. I suppose I could have achieved that in a different way, but I'm not sure it makes a difference.

That's a good answer. I just wanted to make sure I understood the change.

lemire · 2024-01-28T17:28:55Z

@Validark I am pushing your proposal to another library (simdutf), and I am getting a nice boost:

simdutf/simdutf#365

lemire · 2024-01-28T17:42:54Z

I am merging this. This is very nice.

Validark added 2 commits January 27, 2024 05:07

[utf8-validator] eliminate unnecessary comparison from must_be_2_3_co…

5784818

…ntinuation

Fix comment in

bf11df3

lemire reviewed Jan 27, 2024

View reviewed changes

lemire merged commit 9b0435d into simdjson:master Jan 28, 2024
41 checks passed

This was referenced Jan 29, 2024

Version 3.7.0 FourierTransformer/lua-simdjson#70

Open

Version 3.6.4 FourierTransformer/lua-simdjson#71

Open

travisstaloch mentioned this pull request Jan 29, 2024

optimize must_be_2_3_continuation travisstaloch/simdjzon#22

Merged

chenrui333 mentioned this pull request Feb 1, 2024

simdjson 3.6.4 Homebrew/homebrew-core#161501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113

[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113

Validark commented Jan 27, 2024 •

edited

lemire commented Jan 27, 2024

lemire Jan 27, 2024

Validark Jan 27, 2024 •

edited

lemire Jan 28, 2024

lemire commented Jan 28, 2024

lemire commented Jan 28, 2024

[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113

[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113

Conversation

Validark commented Jan 27, 2024 • edited

lemire commented Jan 27, 2024

lemire Jan 27, 2024

Choose a reason for hiding this comment

Validark Jan 27, 2024 • edited

Choose a reason for hiding this comment

lemire Jan 28, 2024

Choose a reason for hiding this comment

lemire commented Jan 28, 2024

lemire commented Jan 28, 2024

Validark commented Jan 27, 2024 •

edited

Validark Jan 27, 2024 •

edited