Skip to content

Conversation

@Catfish-Man
Copy link
Contributor

@Catfish-Man Catfish-Man commented Jul 15, 2025

Fixes rdar://141789595

@Catfish-Man Catfish-Man self-assigned this Jul 15, 2025
} else {
isASCII = false
var tmp: (
UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a temporary buffer here is sort of awful and I want to improve it at some point, but it's also not really hurting anything and simplifies the rest of the code a lot

@Catfish-Man
Copy link
Contributor Author

@swift-ci please test

@Catfish-Man
Copy link
Contributor Author

@swift-ci please benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

------- Performance (arm64): -Osize -------
REGRESSION                               OLD        NEW        DELTA       RATIO    
NSString.bridged.byteCount.ascii.utf8    0.0        0.339      +33900.0%   **0.00x (?)**
Calculator                               115.5      128.056    +10.9%      **0.90x**
Chars2                                   2731.25    2982.432   +9.2%       **0.92x**

IMPROVEMENT                              OLD        NEW        DELTA       RATIO    
UTF16Decode.initDecoding                 69.2       4.077      -94.1%      **16.97x**
UTF16Decode.initFromCustom.cont          251.375    21.8       -91.3%      **11.53x**
ArrayAppendGenericStructs                1134.545   870.0      -23.3%      **1.30x (?)**
Array.removeAll.keepingCapacity.Object   2.522      2.238      -11.3%      **1.13x (?)**
InsertCharacterEndIndex                  58.865     54.923     -6.7%       **1.07x**

I'll take that. (The NSString.bridged.byteCount.ascii.utf8 result is noise due to it running too fast after earlier speedups)

@Catfish-Man
Copy link
Contributor Author

Some of those failures do look real, so this'll stay as a draft for now

@Catfish-Man
Copy link
Contributor Author

Somehow the x86 results look better despite not using the hand vectorized path? I guess I should try using the fallback path on arm64 and see if it does ok there 😂

 IMPROVEMENT                                   OLD         NEW         DELTA    RATIO    
17:07:59  UTF16Decode.initDecoding                      176.167     6.55        -96.3%   **26.89x**
17:07:59  UTF16Decode.initFromCustom.cont               475.5       37.615      -92.1%   **12.64x**
17:07:59  Breadcrumbs.CopyAllUTF16CodeUnits.longMixed   223.364     160.133     -28.3%   **1.39x**
17:07:59  Breadcrumbs.CopyAllUTF16CodeUnits.Mixed       226.2       162.733     -28.1%   **1.39x**
17:07:59  Breadcrumbs.CopyUTF16CodeUnits.longMixed      229.6       165.643     -27.9%   **1.39x**

@Catfish-Man
Copy link
Contributor Author

@swift-ci please test

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please benchmark

@Catfish-Man
Copy link
Contributor Author

Catfish-Man commented Jul 16, 2025

Turns out not accidentally processing twice as much data improves the speedup!

------- Performance (arm64): -Osize -------

REGRESSION                        OLD       NEW       DELTA    RATIO    
MapReduceClass2                   59.048    64.658    +9.5%    **0.91x**
MapReduceClassShort2              91.654    100.0     +9.1%    **0.92x (?)**

IMPROVEMENT                       OLD       NEW       DELTA    RATIO    
UTF16Decode.initDecoding          72.619    2.239     -96.9%   **32.42x**
UTF16Decode.initFromCustom.cont   252.375   22.0      -91.3%   **11.47x**
BufferFillFromSlice               11.326    10.068    -11.1%   **1.12x (?)**
ArrayAppendToGeneric              179.5     165.496   -7.8%    **1.08x (?)**
String.replaceSubrange.String     6.076     5.615     -7.6%    **1.08x**
InsertCharacterEndIndex           60.472    56.405    -6.7%    **1.07x**

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

Catfish-Man commented Jul 19, 2025

IMPROVEMENT                                  OLD         NEW         DELTA       RATIO    
UTF16Decode.initDecoding                     69.524      2.135       -96.9%      **32.55x**
UTF16Decode.initFromCustom.cont              252.0       21.648      -91.4%      **11.64x**
Calculator                                   128.056     115.55      -9.8%       **1.11x**
StringHasPrefixUnicode                       24014.085   21890.411   -8.8%       **1.10x**

Just as good as before, so I think that means I get to delete all the architecture-specific bits of the patch :)

@Catfish-Man
Copy link
Contributor Author

@swift-ci please benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please test

@Catfish-Man
Copy link
Contributor Author

Alas, still not quite there

@Catfish-Man
Copy link
Contributor Author

@swift-ci please test

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

woo, Linux tests pass! Unfortunately I did some more detailed benchmarking today, and the current non-ASCII handling is several times slower than Foundation's, so I've got a bit more to do there.

@Catfish-Man
Copy link
Contributor Author

Catfish-Man commented Jul 22, 2025

chartASCII chartEmoji chartBMP chartAlternating

Current local benchmark results comparing to Foundation's C implementation. Don't put too much weight on the very small counts, the CF bridging check + cross-module call probably accounts for the difference there.

In short: comparing against the C implementation we trade a small regression on non-ASCII for a large progression on runs of ASCII and a small progression on alternating ASCII/non-ASCII.

@Catfish-Man
Copy link
Contributor Author

@swift-ci please test

@Catfish-Man
Copy link
Contributor Author

@swift-ci please Apple Silicon benchmark

@Catfish-Man
Copy link
Contributor Author

@swift-ci please benchmark

@Catfish-Man Catfish-Man changed the title WIP vectorization for UTF16->UTF8 Vectorize UTF16->UTF8 transcoding Jul 22, 2025
@Catfish-Man Catfish-Man marked this pull request as ready for review July 22, 2025 09:23
@Catfish-Man Catfish-Man requested a review from a team as a code owner July 22, 2025 09:23
typealias Word = UInt
#endif
let mask = Word(truncatingIfNeeded: 0xFF80FF80_FF80FF80 as UInt64)
@_transparent var mask:Word {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a computed var because I was hitting a really weird thing locally where sometimes doing it as a statically initialized let constant was initializing it to zero. I'll see if I can pin down what's going on and file a compiler bug, but this appears to work.

@Catfish-Man
Copy link
Contributor Author

The macOS builder failed with error: cannot find type 'SIMD16' in scope… that's interesting. Guess I'll need to add the vector types enabled guard after all, how annoying.

@Catfish-Man
Copy link
Contributor Author

@swift-ci please smoke test

#else
@_transparent var blockSize:Int { 1 }
@_transparent
func allASCIIBlock(at pointer: UnsafePointer<UInt16>) -> CollectionOfOne<UInt8>? {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we can use InlineArray in the stdlib, this should be easier to convert into a SWAR-style thing that does 8 elements at a time.

…if we need to. I don't think anyone is really depending on UTF16 transcoding perf in an environment that can't use SIMD, but hey, it's a big world, who knows.

@Catfish-Man
Copy link
Contributor Author

Woo, macOS builder passed. I’m on my phone so haven’t checked the logs for the others yet. Fingers crossed for unrelated issues.

@Catfish-Man
Copy link
Contributor Author

@swift-ci please smoke test

@Catfish-Man
Copy link
Contributor Author

Catfish-Man commented Jul 29, 2025

#83407 is an experiment with an additional optimization on top of this. Currently it seems to make -Onone perf completely unusable, so I'll probably save it for later.

[EDIT] Actually that's pre-existing slowness apparently! So maybe I'll go with it after all.

@Catfish-Man
Copy link
Contributor Author

Benchmark results from the other PR:

------- Performance (arm64): -Osize -------
 REGRESSION                        OLD        NEW        DELTA    RATIO  
 StringHasPrefixAscii              1420.625   1596.429   +12.4%   **0.89x**
 Calculator                        115.5      128.056    +10.9%   **0.90x**
 Chars2                            2731.25    2981.579   +9.2%    **0.92x**
 
 IMPROVEMENT                       OLD        NEW        DELTA    RATIO  
 UTF16Decode.initDecoding          71.19      1.819      -97.4%   **39.12x**
 UTF16Decode.initFromCustom.cont   251.625    18.222     -92.8%   **13.81x**

That corresponds to a big speedup for length calculations for low-BMP characters, a small speedup for ASCII, and a small slowdown for astral/high-BMP characters.

@Catfish-Man
Copy link
Contributor Author

image

Comparison between String(decoding:as:UTF16.self) and a simulated version of that using the new implementation

@Catfish-Man
Copy link
Contributor Author

Tests just passed in #83407, so closing this in favor of it

@Catfish-Man
Copy link
Contributor Author

Not going to delete this quite yet though because despite being measurably faster the codegen for the other one is REALLY odd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant