-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Attempt to transcode into smol strings #78913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -203,7 +203,6 @@ extension String { | |
| return contents.withUnsafeBufferPointer { String._uncheckedFromUTF8($0) } | ||
| } | ||
|
|
||
| @inline(never) // slow path | ||
| private static func _slowFromCodeUnits< | ||
| Input: Collection, | ||
| Encoding: Unicode.Encoding | ||
|
|
@@ -213,7 +212,34 @@ extension String { | |
| repair: Bool | ||
| ) -> (String, repairsMade: Bool)? | ||
| where Input.Element == Encoding.CodeUnit { | ||
| // TODO(String Performance): Attempt to form smol strings | ||
| if input.count < _SmallString.capacity { | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "14 or fewer" is operating under the assumption that "all ASCII except for one" is a relatively common case
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think early-exiting for small should be done in the single transcoding loop. We do not know how many UTF-8 code units
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand. This check is a heuristic to avoid even attempting a small string, not the actual check to confirm that we can. We know that 15 UTF16 code units is very unlikely to succeed because if they were all ASCII then we'd probably have an ASCII buffer to start with, and if they're not all ASCII then we'll have >15 UTF8, so 14 is the largest number that could possible work, unless I'm missing something.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is this algorithm proceeds as follows:
I'm wondering if we can simplify the code and logic, as well as improve the general case, by changing the existing transcode loop to skip the Array if it's going to be returning a small string anyways. Would it be possible to just decode the input's scalars and append them to a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh that's an interesting approach. I wonder if resizing (i.e. multiple mallocs) is actually faster than two transcode passes (one throwing away the data but writing down the size, then one actually copying in) in common cases
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like that idea.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was curious about this and tried to dig it a bit.
EDIT: Looks like
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re-creating a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The relevant entry points for appending to Strings are mostly in StringRangeReplaceableCollection.swift |
||
| var repaired = false | ||
| var overflow = false | ||
| let result = _SmallString(initializingUTF8With: { buffer in | ||
| var bytesUsed = 0 | ||
| repaired = transcode( | ||
| input.makeIterator(), | ||
| from: encoding, | ||
| to: UTF8.self, | ||
| stoppingOnError: false, | ||
| into: { | ||
| if bytesUsed < buffer.count { | ||
| buffer[bytesUsed] = $0 | ||
| } | ||
| bytesUsed &+= 1 | ||
| } | ||
| ) | ||
| guard bytesUsed <= buffer.count else { | ||
| overflow = true | ||
| return 0 | ||
glessard marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| return bytesUsed | ||
| }) | ||
| if !overflow { | ||
| return repair || !repaired | ||
| ? (String(_StringGuts(result)), repairsMade: repaired) : nil | ||
| } | ||
| } | ||
|
|
||
| // TODO(String performance): Skip intermediary array, transcode directly | ||
| // into a StringStorage space. | ||
|
|
@@ -236,6 +262,10 @@ extension String { | |
| where Input == UnsafeBufferPointer<UInt8>, Encoding == Unicode.ASCII) | ||
| @_specialize( | ||
| where Input == Array<UInt8>, Encoding == Unicode.ASCII) | ||
| @_specialize( | ||
| where Input == UnsafeBufferPointer<UInt16>, Encoding == UTF16) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about these either
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a separate point, we might want a separate fully concrete code path for UTF-16 anyways, rather than going through generic transcoding. |
||
| @_specialize( | ||
| where Input == Array<UInt16>, Encoding == UTF16) | ||
| internal static func _fromCodeUnits< | ||
| Input: Collection, | ||
| Encoding: Unicode.Encoding | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.