Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions stdlib/public/core/String.swift
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,10 @@ extension String {
public init<C: Collection, Encoding: Unicode.Encoding>(
decoding codeUnits: C, as sourceEncoding: Encoding.Type
) where C.Iterator.Element == Encoding.CodeUnit {
guard codeUnits.count > 0 else {
self = ""
return
}
guard _fastPath(sourceEncoding == UTF8.self) else {
self = String._fromCodeUnits(
codeUnits, encoding: sourceEncoding, repair: true)!.0
Expand Down
34 changes: 32 additions & 2 deletions stdlib/public/core/StringCreate.swift
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,6 @@ extension String {
return contents.withUnsafeBufferPointer { String._uncheckedFromUTF8($0) }
}

@inline(never) // slow path
private static func _slowFromCodeUnits<
Input: Collection,
Encoding: Unicode.Encoding
Expand All @@ -213,7 +212,34 @@ extension String {
repair: Bool
) -> (String, repairsMade: Bool)?
where Input.Element == Encoding.CodeUnit {
// TODO(String Performance): Attempt to form smol strings
if input.count < _SmallString.capacity {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"14 or fewer" is operating under the assumption that "all ASCII except for one" is a relatively common case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think early-exiting for small should be done in the single transcoding loop. We do not know how many UTF-8 code units input.count is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. This check is a heuristic to avoid even attempting a small string, not the actual check to confirm that we can. We know that 15 UTF16 code units is very unlikely to succeed because if they were all ASCII then we'd probably have an ASCII buffer to start with, and if they're not all ASCII then we'll have >15 UTF8, so 14 is the largest number that could possible work, unless I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this algorithm proceeds as follows:

  1. We check a might-be-small heuristic (which could be either an overestimate or underestimate)
  2. If heuristic passes, we run a transcode loop to try to form and return the small string if we can, otherwise:
  3. We run a second transcode loop into an Array<UInt8>
  4. We copy that array's contents into a String, either making a _SmallString or __StringStorage, depending on its length

I'm wondering if we can simplify the code and logic, as well as improve the general case, by changing the existing transcode loop to skip the Array if it's going to be returning a small string anyways.

Would it be possible to just decode the input's scalars and append them to a var result = "" (or its guts)?. That would avoid there ever being an array and keep it in small form for every input that could be small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's an interesting approach. I wonder if resizing (i.e. multiple mallocs) is actually faster than two transcode passes (one throwing away the data but writing down the size, then one actually copying in) in common cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. UnicodeScalarView has an append(_: UnicodeScalar), so that seems viable. The implementation is naïve, though; we would need to improve it substantially. (Or we can add something better.)

Copy link
Contributor

@iThinker iThinker Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about this and tried to dig it a bit.
What is the agreement on ideal flow?
I tried to look at the suggestions above, but there are some hurdles.

  1. The basic issue is that we don't know if result can be smol until transcoding is done.
  2. Starting with "" and resizing it on the go seems inefficient. It will be re-creating _SmallString on each append.
  3. IIUC one of the big desires is to remove the intermediate array. I was thinking maybe __StringStorage could be a starting point. It is updated as part of transcoding. In the end it is converted to _SmallString if it fits the criteria.

EDIT: Looks like __StringStorage itself is not meant for growing. And dealing with buffer pointers directly they really want to know the size upfront. The only alternative to doing 2 transcoding passes I can think of is allocating 2x memory (worst case scenario) and then shrinking it. It probably does not help with this PR optimistic scenario, but it gets rid of intermediate Array<UInt8>.

Copy link
Contributor

@glessard glessard Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-creating a _SmallString for a few appends is fine, it's two (or three) words held in registers, with a few bytes changed every time. For the ultimate version of this, we need to improve reserveCapacity() and add an internal way to append a known-good UTF8-encoded code point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant entry points for appending to Strings are mostly in StringRangeReplaceableCollection.swift

var repaired = false
var overflow = false
let result = _SmallString(initializingUTF8With: { buffer in
var bytesUsed = 0
repaired = transcode(
input.makeIterator(),
from: encoding,
to: UTF8.self,
stoppingOnError: false,
into: {
if bytesUsed < buffer.count {
buffer[bytesUsed] = $0
}
bytesUsed &+= 1
}
)
guard bytesUsed <= buffer.count else {
overflow = true
return 0
}
return bytesUsed
})
if !overflow {
return repair || !repaired
? (String(_StringGuts(result)), repairsMade: repaired) : nil
}
}

// TODO(String performance): Skip intermediary array, transcode directly
// into a StringStorage space.
Expand All @@ -236,6 +262,10 @@ extension String {
where Input == UnsafeBufferPointer<UInt8>, Encoding == Unicode.ASCII)
@_specialize(
where Input == Array<UInt8>, Encoding == Unicode.ASCII)
@_specialize(
where Input == UnsafeBufferPointer<UInt16>, Encoding == UTF16)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about these either

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a separate point, we might want a separate fully concrete code path for UTF-16 anyways, rather than going through generic transcoding.

@_specialize(
where Input == Array<UInt16>, Encoding == UTF16)
internal static func _fromCodeUnits<
Input: Collection,
Encoding: Unicode.Encoding
Expand Down