Attempt to transcode into smol strings #78913

Catfish-Man · 2025-01-25T09:55:22Z

Fixes rdar://144114867

Catfish-Man · 2025-01-25T10:03:32Z

@swift-ci please smoke test

Catfish-Man · 2025-01-25T10:03:40Z

@swift-ci please Apple Silicon benchmark

stdlib/public/core/String.swift

Catfish-Man · 2025-01-25T10:05:02Z

stdlib/public/core/StringCreate.swift

  ) -> (String, repairsMade: Bool)?
  where Input.Element == Encoding.CodeUnit {
-    // TODO(String Performance): Attempt to form smol strings
+    if input.count < _SmallString.capacity {


"14 or fewer" is operating under the assumption that "all ASCII except for one" is a relatively common case

I think early-exiting for small should be done in the single transcoding loop. We do not know how many UTF-8 code units input.count is.

I'm not sure I understand. This check is a heuristic to avoid even attempting a small string, not the actual check to confirm that we can. We know that 15 UTF16 code units is very unlikely to succeed because if they were all ASCII then we'd probably have an ASCII buffer to start with, and if they're not all ASCII then we'll have >15 UTF8, so 14 is the largest number that could possible work, unless I'm missing something.

My understanding is this algorithm proceeds as follows:

We check a might-be-small heuristic (which could be either an overestimate or underestimate)

If heuristic passes, we run a transcode loop to try to form and return the small string if we can, otherwise:

We run a second transcode loop into an Array<UInt8>

We copy that array's contents into a String, either making a _SmallString or __StringStorage, depending on its length

I'm wondering if we can simplify the code and logic, as well as improve the general case, by changing the existing transcode loop to skip the Array if it's going to be returning a small string anyways.

Would it be possible to just decode the input's scalars and append them to a var result = "" (or its guts)?. That would avoid there ever being an array and keep it in small form for every input that could be small.

oh that's an interesting approach. I wonder if resizing (i.e. multiple mallocs) is actually faster than two transcode passes (one throwing away the data but writing down the size, then one actually copying in) in common cases

I like that idea. UnicodeScalarView has an append(_: UnicodeScalar), so that seems viable. The implementation is naïve, though; we would need to improve it substantially. (Or we can add something better.)

I was curious about this and tried to dig it a bit.
What is the agreement on ideal flow?
I tried to look at the suggestions above, but there are some hurdles.

The basic issue is that we don't know if result can be smol until transcoding is done.

Starting with "" and resizing it on the go seems inefficient. It will be re-creating _SmallString on each append.

IIUC one of the big desires is to remove the intermediate array. I was thinking maybe __StringStorage could be a starting point. It is updated as part of transcoding. In the end it is converted to _SmallString if it fits the criteria.

EDIT: Looks like __StringStorage itself is not meant for growing. And dealing with buffer pointers directly they really want to know the size upfront. The only alternative to doing 2 transcoding passes I can think of is allocating 2x memory (worst case scenario) and then shrinking it. It probably does not help with this PR optimistic scenario, but it gets rid of intermediate Array<UInt8>.

Re-creating a _SmallString for a few appends is fine, it's two (or three) words held in registers, with a few bytes changed every time. For the ultimate version of this, we need to improve reserveCapacity() and add an internal way to append a known-good UTF8-encoded code point.

The relevant entry points for appending to Strings are mostly in StringRangeReplaceableCollection.swift

Catfish-Man · 2025-01-25T10:05:22Z

stdlib/public/core/StringCreate.swift

  @_specialize(
    where Input == Array<UInt8>, Encoding == Unicode.ASCII)
+  @_specialize(
+    where Input == UnsafeBufferPointer<UInt16>, Encoding == UTF16)


Not sure about these either

As a separate point, we might want a separate fully concrete code path for UTF-16 anyways, rather than going through generic transcoding.

Catfish-Man · 2025-01-25T18:19:01Z


------- Performance (arm64): -Osize -------

REGRESSION                                 OLD        NEW        DELTA     RATIO     
Calculator                                 115.5      128.056    +10.9%    **0.90x** 
Chars2                                     2731.25    2982.051   +9.2%     **0.92x (?)**
KeyPathWritePerformance                    20.375     22.204     +9.0%     **0.92x** 
String.initRepeating.1AsciiChar.Count100   210.818    228.2      +8.2%     **0.92x** 
StringFromLongWholeSubstring               1.854      1.998      +7.8%     **0.93x** 
DictionaryBridgeToObjC_Bridge              5.782      6.228      +7.7%     **0.93x (?)**

IMPROVEMENT                                OLD        NEW        DELTA     RATIO     
DataToStringEmpty                          361.607    0.0        -100.0%   **361608.00x**
UTF16Decode.initDecoding                   67.737     2.584      -96.2%    **26.20x**
StringWithCString2                         0.001      0.0        -50.0%    **2.00x (?)**
RawBufferCopyBytes                         11.171     9.564      -14.4%    **1.17x (?)**
CSVParsing.Scalar                          160.143    142.125    -11.3%    **1.13x** 
ArrayAppendGenericStructs                  1077.273   957.692    -11.1%    **1.12x (?)**
LineSink.scalars.alpha                     34.479     31.653     -8.2%     **1.09x** 
CxxStringConversion.cxx.to.swift           60.0       56.0       -6.7%     **1.07x (?)**

------- Code size: -Osize -------

REGRESSION          OLD     NEW     DELTA   RATIO  
StringSplitting.o   23441   24089   +2.8%   **0.97x**
UTF8Decode.o        13600   13824   +1.6%   **0.98x**

stdlib/public/core/StringCreate.swift

Catfish-Man · 2025-07-30T16:43:12Z

Closing this in favor of #83407, which adds a vectorized transcoder for this as well as making small strings.

Catfish-Man self-assigned this Jan 25, 2025

Attempt to transcode into smol strings

d2ee1ce

Catfish-Man force-pushed the smolify branch from 9983fcc to d2ee1ce Compare January 25, 2025 10:03

Catfish-Man commented Jan 25, 2025

View reviewed changes

stdlib/public/core/String.swift Show resolved Hide resolved

Catfish-Man commented Jan 25, 2025

View reviewed changes

Catfish-Man marked this pull request as ready for review January 30, 2025 20:05

Catfish-Man requested a review from a team as a code owner January 30, 2025 20:05

glessard reviewed Feb 3, 2025

View reviewed changes

stdlib/public/core/StringCreate.swift Show resolved Hide resolved

glessard approved these changes Feb 7, 2025

View reviewed changes

Catfish-Man closed this Jul 30, 2025

Attempt to transcode into smol strings #78913

Attempt to transcode into smol strings #78913

Uh oh!

Conversation

Catfish-Man commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Catfish-Man commented Jan 25, 2025

Uh oh!

Catfish-Man commented Jan 25, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iThinker Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glessard Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Catfish-Man commented Jan 25, 2025

Uh oh!

Uh oh!

Catfish-Man commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Catfish-Man commented Jan 25, 2025 •

edited

Loading

iThinker Feb 6, 2025 •

edited

Loading

glessard Feb 7, 2025 •

edited

Loading