-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
Closed
Labels
performancePerformance or resource usagePerformance or resource usagetype-featureA feature request or enhancementA feature request or enhancement
Description
Feature or enhancement
Proposal:
I have spotted a few inefficiencies in the stringlib implementations that hinder the compilers ability to optimize the code. These could be fixed.
- find_max_char, the 1-byte version. This unrolls checking 4 or 8-byte chunks. Alignment (which does not matter for x86-64 but may be important on other platforms) happens by checking one character at the time. This can be sped up by simply bitwise OR-ing all the characters together, and only check all the alginments with one check. Furthermore, the loop can be unrolled using 32-byte chunks. (4 size_t integers). By doing so, the compiler needs only very few extra instructions to do the bitwise or and can use 16-byte vectors. These are available on both x86-64 and ARM64 and the compiler will optimize easily. The less than 32 byte remainder can then be obtained by simply bitwise OR-ing these characters together and perform the check.
- Find_max_char, the 2-byte and 4-byte version. These now work with unrolls of 4. For the 2-byte version this means an 8-byte load. Increasing the unroll to 8, this means 16-byte and 32-byte loads. The compiler can vectorize this.
- Stringlib codecs.h utf8_decode on line 47 states, fast unrolled copy. These statements can be replaced by
memcpy(*_p, *_s, SIZEOF_SIZE_T); Using*restrict` the compiler should understand that a read does not need to be performed twice, and memcpy using a fixed size is always optimized out. - ascii_decode: same as find_max_char. This can be optimized using larger chunks.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
Metadata
Metadata
Assignees
Labels
performancePerformance or resource usagePerformance or resource usagetype-featureA feature request or enhancementA feature request or enhancement